Archive for January, 2015

Your Weekly Snowden Dribble

Saturday, January 31st, 2015

Levitation program tracked file-sharing sites by David Meyer.

From the post:

The Canadian spy agency CSE monitors activity across over 100 free file upload sites, a newly-revealed PowerPoint document from NSA whistleblower Edward Snowden’s cache has shown.

The document describing CSE’s Levitation program was published on Wednesday by The Intercept, reporting alongside Canadian broadcaster CBC. Although Canada has long been known to be a member of the core Anglophone “Five Eyes” spying club, this is the first Snowden revelation putting it at the forefront of one of the Eyes’ mass surveillance programs.

I was truly surprised that the Canadians were monitoring file sharing sites. Shouldn’t that be contracted out to the mad dogs at the RIAA and MPAA, etc.? Cheaper and doubtful you could find anyone more persistent.

On the downside they would want the phone logs of every teenager suspected of copying a song off the radio. To find clusters of copyright thieves. Maybe it is better to have the Canadians do it.

Effective Communication with Colleagues, the Public, Students and Users

Saturday, January 31st, 2015

I chose the tile of this post as a replacement for: Synthesis of the Value and Benefits of SoTL (Scholarship of Teaching and Learning)Experienced by the Contributors. I know, it’s hard to get past the title but if you do, Rick Reis advises:

The posting below is Chapter 20 of the book, Doing the Scholarship of Teaching and Learning (SoTL) in Mathematics. It has implications well beyond the teaching of mathematics and represents the synthesis of the editors, Jacqueline Dewar and Curtis Bennett of Loyola Marymount University, of the contributing authors’ perceptions of the value of SoTL. In it, they reinterpret Shulman’s (1999) “taxonomy of pedago-pathology” consisting of amnesia, fantasia, and inertia, which he had used to describe pitfalls of student learning, to show how the same 3 labels can describe pathologies of teaching, and then discuss how SoTL can operate as an antidote for these. Dewar, J., & Bennett, C. (Eds.). (2015). Doing the scholarship of teaching and learning in mathematics. Washington, DC: Mathematical Association of America. Copyright © 2015. Mathematical Association of America. ( All rights reserved. Reprinted with permission.

The triple, “amnesia, fantasia, and inertia” originally were references to:

  • Amnesia – inability of students to remember what they have learned
  • Fantasia – remembering of incorrect information by students
  • Inertia – student’s inability to apply what they have learned

The essays turn the tables on professors who teach classes where:

  • Amnesia – inability of a professor to remember what worked or what didn’t in prior courses
  • Fantasia – relying on assumptions about students rather than exploring why they haven’t mastered material
  • Inertia – teaching classes the same way, even though prior students failed to master the materials

I can see that triple, “amnesia, fantasia, and inertia” in the treatment of users:

  • Amnesia – inability to remember what has worked in the past for users
  • Fantasia – no user testing of UI and/or documentation, “just not trying hard enough.”
  • Inertia – writing poor or no documentation, “this time will be different”

I suppose the inertia part is also a reference to how we all take every opportunity to be lazy if at all possible.

I am sure the techniques for SoTL aren’t directly transferable to CS in academia or in practice but every nudge in a better direction helps.

The MAA page offers this description:

The four chapters in Part I provide background on this form of scholarship and specific instructions for undertaking a SoTL investigation in mathematics. Part II contains fifteen examples of SoTL projects in mathematics from fourteen different institutions, both public and private, spanning the spectrum of higher educational institutions from community colleges to research universities. These chapters “reveal the process of doing SoTL” by illustrating many of the concepts, issues, methods and procedures discussed in Part I. An Editors’ Commentary opens each contributed chapter to highlight one or more aspects of the process of doing SoTL revealed within. Toward the end of each chapter the contributing authors describe the benefits that accrued to them and their careers from participating in SoTL.

In PDF it is only $23.00 and appears to be worth every penny of it.

“…we have not identified a single instance…” Ineffectual Phone Surveillance

Saturday, January 31st, 2015

Tim Cushing has a great piece, Privacy Board Says NSA Doesn’t Know How Effective Its Collection Programs Are, Doesn’t Much Care Either at TechDirt on the latest report of the Privacy and Civil Liberties Oversight Board (PCLOB) on government surveillance in the United States.\

As a result of his post, I went searching for the Board’s earlier report on vacuuming up phone records. In part that earlier report reads:

The threat of terrorism faced today by the United States is real. The Section 215 telephone records program was intended as one tool to combat this threat—a tool that would help investigators piece together the networks of terrorist groups and the patterns of their communications with a speed and comprehensiveness not otherwise available. However, we conclude that the Section 215 program has shown minimal value in safeguarding the nation from terrorism. Based on the information provided to the Board, including classified briefings and documentation, we have not identified a single instance involving a threat to the United States in which the program made a concrete difference in the outcome of a counterterrorism investigation. Moreover, we are aware of no instance in which the program directly contributed to the discovery of a previously unknown terrorist plot or the disruption of a terrorist attack. And we believe that in only one instance over the past seven years has the program arguably contributed to the identification of an unknown terrorism suspect. Even in that case, the suspect was not involved in planning a terrorist attack and there is reason to believe that the FBI may have discovered him without the contribution of the NSA’s program.

The Board’s review suggests that where the telephone records collected by the NSA under its Section 215 program have provided value, they have done so primarily in two ways: by offering additional leads regarding the contacts of terrorism suspects already
known to investigators, and by demonstrating that foreign terrorist plots do not have a U.S. nexus. The former can help investigators confirm suspicions about the target of an inquiry or about persons in contact with that target. The latter can help the intelligence community focus its limited investigatory resources by avoiding false leads and channeling efforts where they are needed most. But with respect to the former, our review suggests that the Section 215 program offers little unique value but largely duplicates the FBI’s own information gathering efforts. And with respect to the latter, while the value of proper resource allocation in time-sensitive situations is not to be discounted, we question whether the American public should accept the government’s routine collection of all of its telephone records because it helps in cases where there is no threat to the United States. (emphasis added)

What amazes me is that Tim’s review of the current report reflects that the vacuuming of phone records continues and there has been no effort to develop metrics to test the effectiveness of surveillance programs.

Recommendation 10: Develop a Methodology to Assess the Value of Counterterrorism Programs

Status: Not implemented

I wonder why the “…we have not identified a single instance…” line doesn’t come up in every presidential news conference, every interview with candidates for the House or the Senate, interviews with presidential hopefuls? It should come up repeatedly until the program is terminated by the executive branch or Congress defunds the NSA.

The president and others may be testy because they want to spin some other tale for public consumption but here we have a bi-partisan board that has seen the classified evidence and has reported to the public that the mass collection of phone records is ineffective. (full stop) The effectiveness of mass phone surveillance is no long up for debate. The facts are in.

The question is what will the public do with those facts? Not vote for anyone who refuses to defund the NSA or terminate the phone surveillance program?

That’s a start but just to emphasize the point, call the Whitehouse and your representative and both senators daily to request the ending of the bulk phone records surveillance program. That will also illustrate how the NSA program captures constituents communicating with their elected representatives.

Unsustainable Museum Data

Friday, January 30th, 2015

Unsustainable Museum Data by Matthew Lincoln.

From the post:

In which I ask museums to give less API, more KISS and LOCKSS, please.

“How can we ensure our [insert big digital project title here] is sustainable?” So goes the cry from many a nascent digital humanities project, and rightly so! We should be glad that many new ventures are starting out by asking this question, rather than waiting until the last minute to come up with a sustainability plan. But Adam Crymble asks whether an emphasis on web-based digital projects instead of producing and sharing static data files is needlessly worsening our sustainability problem. Rather than allowing users to download the underlying data files (a passel of data tables, or marked-up text files, or even serialized linked data), these web projects mediate those data with user interfaces and guided searching, essentially making the data accessible to the casual user. But serving data piecemeal to users has its drawbacks, notes Crymble. If and when the web server goes down, access to the data disappears:

When something does go wrong we quickly realise it wasn’t the website we needed. It was the data, or it was the functionality. The online element, which we so often see as an asset, has become a liability.

I would broaden the scope of this call to include library and other data as well. Yes, APIs can be very useful but so can a copy of the original data.

Matthew mentions “creative re-use” near the end of his post but I would highlight that as a major reason for providing the original data. No doubt museums and others work very hard at offering good APIs of data but any API is only one way to obtain and view data.

For data, any data, to reach its potential, it needs to be available for multiple views of the same data. Some you may think are better, some you may think are worse than the original. But it is the potential for a multiplicity of views that opens up those possibilities. Keeping data behind an API is an act of preventing data from reaching its potential.

So You’d Like To Make a Map Using Python

Friday, January 30th, 2015

So You’d Like To Make a Map Using Python by Stephan Hügel.

From the post:

Making thematic maps has traditionally been the preserve of a ‘proper’ GIS, such as ArcGIS or QGIS. While these tools make it easy to work with shapefiles, and expose a range of common everyday GIS operations, they aren’t particularly well-suited to exploratory data analysis. In short, if you need to obtain, reshape, and otherwise wrangle data before you use it to make a map, it’s easier to use a data analysis tool (such as Pandas), and couple it to a plotting library. This tutorial will be demonstrating the use of:

  • Pandas
  • Matplotlib
  • The matplotlib Basemap toolkit, for plotting 2D data on maps
  • Fiona, a Python interface to OGR
  • Shapely, for analyzing and manipulating planar geometric objects
  • Descartes, which turns said geometric objects into matplotlib “patches”
  • PySAL, a spatial analysis library

The approach I’m using here uses an interactive REPL (IPython Notebook) for data exploration and analysis, and the Descartes package to render individual polygons (in this case, wards in London) as matplotlib patches, before adding them to a matplotlib axes instance. I should stress that many of the plotting operations could be more quickly accomplished, but my aim here is to demonstrate how to precisely control certain operations, in order to achieve e.g. the precise line width, colour, alpha value or label position you want.

I didn’t catch this when it was originally published (2013) so you will probably have to update some of the specific package versions.

Still, this looks like an incredibly useful exercise.

Not just for learning Python and map creation but deeper knowledge about particular cities as well. On a good day I can find my way around the older parts of Rome from the Trevi Fountain but my knowledge fades pretty rapidly.

Creating a map using Python could help flesh out your knowledge of cities that are otherwise just names on the news. Isn’t that one of those quadruple learning environments? Geography + Cartography + Programming + Demographics? That’s how I would pitch it in any event.

I first saw this in a tweet by YHat, Inc.

Data Science in Python

Friday, January 30th, 2015

Data Science in Python by Greg.

From the webpage:

Last September we gave a tutorial on Data Science with Python at DataGotham right here in NYC. The conference was great and I highly suggest it! The “data prom” event the night before the main conference was particularly fun!

… (image omitted)

We’ve published the entire tutorial as a collection of IPython Notebooks. You can find the entire presentation on github or checkout the links to nbviewer below.

…(image omitted)

Table of Contents

A nice surprise for the weekend!

Curious, out of the government data that is online, local, state, federal, what data would you like most to see for holding government accountable?

Data science is a lot of fun in and of itself but results that afflict the comfortable are amusing as well.

I first saw this in a tweet by YHat, Inc.

Want Privacy? Go Old School.

Friday, January 30th, 2015

Drug Dealers Swapping Down To Old Cellphones To Stay One Step Ahead In The ‘Tech Arms Race’ by Tim Cushing.

From the post:

…in the UK, some criminals have discovered one way to stay a step ahead of the cops is to take a few steps backward.

A dealer in Handsworth, Birmingham—who would only give his name as “K2″—told me: “I’ve got three Nokia 8210 phones and have been told they can be trusted, unlike these iPhones and new phones, which the police can easily [use to] find out where you’ve been… Every dealer I know uses old phones, and the Nokia 8210 is the one everyone wants because of how small it is and how long the battery lasts.

Old tech beats new tech, at least in some business ventures. The 8210 has 50-150 hours of standby time and an infrared port to quickly beam data from one phone to another (handy for burners or compromised phones). But other than its connection to cell towers, the phone provides no other means of connectivity: no Bluetooth, no WiFi, no WLAN. Nothing.

Full marks for imaginative thinking about technology, although the best line in the post belongs to a copper:

…They’re durable, cheap and unlike today’s smartphones, aren’t “just GPS ankle monitors that double up as pizza-ordering devices,” as Vice’s Mike Zacharanda puts it.

Too late for this year but is anyone interested in a kickstarter campaign for an iPhone as GPS ankle monitor public service spot for the SuperBowl in 2016?

One Week of Harassment on Twitter

Thursday, January 29th, 2015

One Week of Harassment on Twitter by Anita Sarkeesian.

From the post:

Ever since I began my Tropes vs Women in Video Games project, two and a half years ago, I’ve been harassed on a daily basis by irate gamers angry at my critiques of sexism in video games. It can sometimes be difficult to effectively communicate just how bad this sustained intimidation campaign really is. So I’ve taken the liberty of collecting a week’s worth of hateful messages sent to me on Twitter. The following tweets were directed at my @femfreq account between 1/20/15 and 1/26/15.

The limited vocabularies of the posters to one side, one hundred and fifty-six (156) hate messages is an impressive number. I pay no more attention to postings by illiterates than I do to cat pictures but I can understand why that would get to be a drag.

Many others have commented more usefully on the substance of this topic than I can but as a technical matter, how would you:

  • Begin to ferret out the origins and backgrounds on these posters?
  • Automate response networks (use your imagination about the range of responses)?
  • Automate filtering for an account under such attacks?

Lacking any type of effective governance structure, think any unexplored and ungoverned territory, security and safety on the Internet is a question of alliances for mutual protection. Eventually governance will evolve for the Internet but since that will require relinquishing of some national sovereignty, I don’t expect to see it in our lifetimes.

In the meantime, we need stop-gap measures that can set the tone for the governance structures that will eventually evolve.


PS: Some people urge petitioning current governments for protection. Since their interests are in inherent conflict with the first truly transnational artifact (the Internet), I don’t see that as being terribly useful. I prefer whatever other stick comes to hand.

I first saw this in a tweet by

MapR Offers Free Hadoop Training and Certifications

Thursday, January 29th, 2015

MapR Offers Free Hadoop Training and Certifications by Thor Olavsrud.

From the post:

In an effort to make Hadoop training for developers, analysts and administrators more accessible, Hadoop distribution specialist MapR Technologies Tuesday unveiled a free on-demand training program. Another track for HBase developers will be added later this quarter.

“This represents a $50 million, in-kind contribution to the Hadoop community,” says Jack Norris, CMO of MapR. “The focus is overcoming what many people consider the major obstacle to the adoption of big data, particularly Hadoop.”

The developer track is about building big data applications in Hadoop. The topics range from the basics of Hadoop and related technologies to advanced topics like designing and developing MapReduce and HBase applications with hands-on labs. The courses include:

  • Hadoop Essentials. This course, which is immediately available, provides an introduction to Hadoop, the ecosystem, common solutions and use cases.
  • Developing Hadoop Applications. This course is also immediately available and focuses on designing and writing effective Hadoop applications with MapReduce and YARN.
  • HBase Schema Design and Modeling. This course will become available in February and will focus on architecture, schema design and data modeling on HBase.
  • Developing HBase Applications. This course will also debut in February and focuses on real-world application design in HBase (Time Series and Social Application examples).
  • Hadoop Data Analysis – Drill. Slated for debut in March, this course covers interactive SQL on Hadoop for structured, semi-structured and nested data.

I remember how expensive the Novell training classes were back in the Netware 4.11 days. (Yes, that has been a while.)

I wonder whose software will come to mind after completing the MapR training courses and passing the certification exams?

That’s what I think too. Send kudos to MapR for this effort!

Looking forward to seeing some of you at Hadoop certification exams later this year!

I first saw this in a tweet by Kirk Borne.

‘Open Up’ Digital Democracy Commission’s Report published

Thursday, January 29th, 2015

‘Open Up’ Digital Democracy Commission’s Report published

From the post:

The Speaker’s Commission on Digital Democracy has published its report ‘Open Up’. The report recommends how Parliament can use digital technology to help it to be more transparent, inclusive, and better able to engage the public with democracy.

  • Read the Digital Democracy Commission’s full report
  • Read the Summary of the report
  • Read the plain language version of the report (PDF 355 KB)
  • Information on events for the launch of the report
  • Commenting on the report, the Rt Hon John Bercow MP, Speaker of the House of Commons said:

    “I set up the Digital Democracy Commission to explore how Parliament could make better use of digital technology to enhance and improve its work. I am very grateful to all those who contributed to the Commission’s work, and have been particularly struck by the enthusiastic contributions from those who expressed a desire to participate in the democratic process, but felt that barriers existed that prevented them from doing so.

    This report provides a comprehensive roadmap to break down barriers to public participation. It also makes recommendations to facilitate better scrutiny and improve the legislative process.

    In a year where we reflect on our long democratic heritage, it is imperative that we look also to the future and how we can modernise our democracy to meet the changing needs of modern society.”

    … (emphasis in the original)

    Do you think I should forward the U.S. Congress the full report or the plain language summary? 😉

    I was particularly encouraged by the methodology of the report:

    We asked people to tell us their views online or in person and we heard from a wide a range of people. They included not just experts, MPs and interest groups, but members of the public—people of different ages and backgrounds and people with varying levels of interest in politics and the work of Parliament.

    I wonder if that has ever occurred to the various groups drafting standards for IT in government? To ask an actual citizen? They aren’t rare or so I have been told.

    What ever sort of government you want or want to preserve, a good lesson in how to “feel the pulse” that drives the average citizen. Even more useful if you are interested in democratic institutions.

    PS: There is an IT rumor that the Texas tried legislative transparency a number of years ago, for maybe a day or two. It was so transparent and disruptive of the usual skullduggery of the legislature that they jerked the system. I heard the story from more than one very reliable source with first hand knowledge of the project. I suspect there is documentation in the possession of some office at the Texas legislature to corroborate that rumor. Anyone feeling leaky?

    If you put the Texas legislature in jail, the odds of imprisoning an innocent depend on whether it was bring your child to work day or not. (Update)

    Wednesday, January 28th, 2015

    I first wrote about in a post dated October 17, 2011.

    A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

    My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

    A current description of the system from the customer story:

    In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

    The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

    Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

    Very much a site to bookmark and use on a regular basis.

    Links for subjects without them otherwise:

    Deep Web Technologies

    Microsoft Translator

    Bughunter cracks “absolute privacy” Blackphone – by sending it a text message

    Wednesday, January 28th, 2015

    Bughunter cracks “absolute privacy” Blackphone – by sending it a text message by Paul Ducklin.

    From the post:

    Serial Aussie bugfinder Mark Dowd has been at it again.

    He loves to look for security flaws in interesting and important places.

    This time, he turned his attention to a device that most users acquired precisely because of its security pedigree, namely the Blackphone.

    What Dowd found is that text messages received by a Blackphone are processed by the messaging software in an insecure way that could lead to remote code execution.

    Simply put, the sender of a message can format it so that instead of being decoded and displayed safely as text, the message tricks the phone into processing and executing it as if it were a miniature program.

    Dowd’s paper is a great read if you’re a programmer, because it explains the precise details of how the exploit works, which just happens to make it pretty obvious what the programmers did wrong.

    That means his article can help you avoid this sort of error in your own code.

    Don’t get too excited because Blackphone has already issued a patch for the problem.

    On the other hand, Paul’s lay explanation of the exploit could lead to a hard copy demonstration of the bug for educating purchasers of programming services. Imagine a contract that specifies the resulting software is free from this specific type of defect. That can only happen with better educated consumers of software programming services.

    Are there existing hard copy demonstrations of common software bugs? Where a person can file out a common form such as Paul’s change of address and see the problem with the data they have entered?

    Beyond this particular exploit, what other common exploits are subject to similar analogies?

    This could be an entirely new market for security based educational materials, particularly for online and financial communities.

    NIST developing database to help advance forensics

    Tuesday, January 27th, 2015

    NIST developing database to help advance forensics by Greg Otto.

    From the post:

    While the National Institute of Standards and Technology has been spending a lot of time advancing the technology behind forensics, the agency can so only go so far. With all of the ways people can be identified, researchers still lack sufficient data that would allow them to further already existing technology.

    To overcome that burden, NIST has been working on a catalog that would help the agency, academics and other interested parties discover data sets that will allow researchers to further their work. The Biometric and Forensic Research Database Catalog aims to be a one-stop shop for those looking to gather enough data or find better quality data for their projects.

    Not all national agencies in the United States do a bad job. Some of them, NIST being among them, do very good jobs.

    Take the Biometric and Forensic Research Database Catalog (BDbC) for example. Forensic data is hard to find and to cure that problem, NIST has created a curated data collection that is available for anyone to search online.

    Perhaps the U.S. Axis of Surveillance (FBI/DEA/CIA/NSA, etc.) don’t understand the difference between a data vacuum cleaner and a librarian. Any fool can run a data vacuum cleaner, fortunately or the Axis of Surveillance would have no results at all.

    Fortunately, Erica Firment can help the Axis of Surveillance with the difference:

    Why you should fall to your knees and worship a librarian

    Ok, sure. We’ve all got our little preconceived notions about who librarians are and what they do.

    Many people think of librarians as diminutive civil servants, scuttling about “Sssh-ing” people and stamping things. Well, think again buster.

    Librarians have degrees. They go to graduate school for Information Science and become masters of data systems and human/computer interaction. Librarians can catalog anything from an onion to a dog’s ear. They could catalog you.

    Librarians wield unfathomable power. With a flip of the wrist they can hide your dissertation behind piles of old Field and Stream magazines. They can find data for your term paper that you never knew existed. They may even point you toward new and appropriate subject headings.

    People become librarians because they know too much. Their knowledge extends beyond mere categories. They cannot be confined to disciplines. Librarians are all-knowing and all-seeing. They bring order to chaos. They bring wisdom and culture to the masses. They preserve every aspect of human knowledge. Librarians rule. And they will kick the crap out of anyone who says otherwise.

    Everybody has a favorite line but mine is:

    People become librarians because they know too much.

    There is a corollary which Erica doesn’t mention:

    People resort to data vacuuming because they know too little. A condition that data vacuuming cannot fix.

    Think of it as being dumb and getting dumber.

    There are solutions to that problem but since the intelligence community isn’t paying me, it isn’t worth writing them down.

    PS: Go to the Library Avengers store for products by Erica.

    The DEA is Stalking You!

    Tuesday, January 27th, 2015

    When I wrote about Waze earlier today, Google asked to muzzle Waze ‘police-stalking’ app, I had no idea that the Wall Street Journal had dropped the hammer on yet another mass surveillance program.

    In U.S. Justice Department Built Secret, Nationwide License-Plate Tracking Database (Car & Driver) by Robert Sorokanich, Robert reports:

    Bad news for anyone who values privacy: The Wall Street Journal reports that the U.S. Justice Department has been secretly expanding its license-plate scanning program to create a real-time national vehicle tracking database monitoring hundreds of millions of motorists.

    WSJ pulls no punches in describing the program, calling it nothing less than “a secret domestic intelligence-gathering program.” The program, established by the Drug Enforcement Agency in 2008, originated as a way of tracking down and seizing cars, money, and other assets involved in drug trafficking in areas of Arizona, California, Nevada, New Mexico, and Texas where illicit drugs are funneled across the border.

    The program uses camera systems at strategic points on major U.S. highways to record time, location, and direction of vehicle travel. Some locations take photos of drivers and passengers, which are sometimes detailed enough to confirm identity, WSJ reports.

    Perhaps more chillingly, the documents reviewed by the news outlets indicate that the DEA has also employed license-plate-reading technology to create a “far-reaching, constantly updating database of electronic eyes scanning traffic on the roads to steer police toward suspects.”

    My first reaction was as a close friend often says, “…it’s hard to be cynical enough.”

    My second reaction is that I need to get a WSJ subscription so I can check for late breaking news in areas of interest.

    But more to the point, we should all start tracking all police, everywhere and posting that data to Waze. In particular we need to track DEA, FBI, NSA, CIA, all elected and appointed federal officials, etc.

    You see, I happen to trust the town and county police where I live. Not to mention the state police almost that much. I’m sure they would disagree politically with some of the things I say but for the most part, they are doing a thankless job for lower pay that I would take for the same work. Where my trust of the police and government breaks down is once you move off of the state level.

    Not to deny there are bad apples in every lot, but as you go up to the national level, the percentage of bad apples increases rapidly. What agenda they are seeking to serve I cannot say but I do know it isn’t one that is consistent with the Constitution or intended to benefit any ordinary citizens.

    Turn your cellphone cameras on and legally park outside every known DEA, FBI, etc. office and photograph everyone coming or going. Obey all laws and instructions from law enforcement officials. Then post all of your photos and invite others to do the same.

    Actually I would call up my local police and ask for their assistance in tracking DEA, FBI, etc. agents. The local police don’t need interference from people who don’t understand your local community. You may find the local police are your best allies in ferreting out overreaching by the federal government.

    The police (read local police) aren’t the privacy problem. The privacy problem is with federal data vacuums and police wannabes who think people are the sum of their data. People are so much more than that, ask your local police if you don’t believe me.

    Data Science and Hadoop: Predicting Airline Delays – Part 3

    Tuesday, January 27th, 2015

    Data Science and Hadoop: Predicting Airline Delays – Part 3 by Ofer Mendelevitch and Beau Plath.

    From the post:

    In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala.

    Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to dismiss the misconception that data scientists – when applying predictive learning algorithms, like Linear Regression, Random Forest or Neural Networks to large datasets – require dramatic changes to the tooling; that they need dedicated clusters; and that existing tools will not suffice.

    Instead, we used the same HDP cluster configuration, the same machine learning techniques, the same data sets, and the same familiar tools like PIG, Python and Scikit-learn and Spark.

    For the final part, we resort to Scalding and R. R is a very popular, robust and mature environment for data exploration, statistical analysis, plotting and machine learning. We will use R for data exploration, graphics as well as for building our predictive models with Random Forest and Gradient Boosted Trees. Scalding, on the other hand, provides Scala libraries that abstract Hadoop MapReduce and implement data pipelines. We demonstrate how to pre-process the data into a feature matrix using the Scalding framework.

    For brevity I shall spare summarizing the methodology here, since both previous posts (and their accompanying IPython Notebooks) expound the steps, iteration and implementation code. Instead, I would urge that you read all parts as well as try the accompanying IPython Notebooks.

    Finally, for this last installment in the series in Scaling and R, read its IPython Notebook for implementation details.

    Given the brevity of this post, you are definitely going to need Part 1 and Part 2.

    The data science world could use more demonstrations like this series.

    LDAvis: Interactive Visualization of Topic Models

    Tuesday, January 27th, 2015

    LDAvis: Interactive Visualization of Topic Models by Carson Sievert and Kenny Shirley.

    From the webpage:

    Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.

    From the description:

    This video (recorded September 2014) shows how interactive visualization is used to help interpret a topic model using LDAvis. LDAvis is an R package which extracts information from a topic model and creates a web-based visualization where users can interactively explore the model. More details, examples, and instructions for using LDAvis can be found here —

    Excellent exploration of a data set using LDAvis.

    Will all due respect to “agile” programming, modeling before you understand a data set isn’t a winning proposition.

    Eigenvectors and eigenvalues: Explained Visually

    Tuesday, January 27th, 2015

    Eigenvectors and eigenvalues: Explained Visually by Victor Powell and Lewis Lehe

    Very impressive explanation/visualization of eigenvectors and eigenvalues. What is more, it concludes with pointers to additional resources.

    This is only a part of a larger visualization of algorithms projects at: Explained Visually.

    Looking forward to seeing more visualizations on this site.

    Coding is not the new literacy

    Tuesday, January 27th, 2015

    Coding is not the new literacy by Chris Granger.

    From the post:

    Despite the good intentions behind the movement to get people to code, both the basic premise and approach are flawed. The movement sits on the idea that "coding is the new literacy," but that takes a narrow view of what literacy really is.

    If you ask google to define literacy it gives a mechanical definition:

    the ability to read and write.

    This is certainly accurate, but defining literacy as interpreting and making marks on a sheet of paper is grossly inadequate. Reading and writing are the physical actions we use to employ something far more important: external, distributable storage for the mind. Being literate isn't simply a matter of being able to put words on the page, it's solidifying our thoughts such that they can be written. Interpreting and applying someone else's thoughts is the equivalent for reading. We call these composition and comprehension. And they are what literacy really is.

    Before you assume that Chris is going to diss programming, go read his post.

    Chris is arguing for a skill set that will make anyone a much better programmer as well as spill over into other analytical tasks as well.

    Take the title as a provocation to read the post. By the end of the post, you will have learned something valuable or have been reminded of something valuable that you already knew.


    Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack

    Tuesday, January 27th, 2015

    Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack by RK Paleru.

    From the post:

    Recently, as a sad day of terror ended in Sydney, a bad case of Uber’s analytical approach to pricing came to light – an “algorithm based price surge.” Uber’s algorithm driven price surge started overcharging people fleeing the Central Business District (CBD) of Sydney following the terror attack.

    I’m not sure the algorithm got it wrong. If you asked me to drive into a potential war zone to ferry strangers out, I suspect a higher fee than normal is to be expected.

    The real dilemma for Uber is that not all ground transportation has surge price algorithms. When buses, subways, customary taxis, etc. all have surge price algorithms, the price hikes won’t appear to be abnormal.

    One of the consequences of an algorithm/data-driven world is that factors known or unknown to you may be driving the price or service. To say it another way, your “expectations” of system behavior may be at odds with how the system will behave.

    The inventory algorithm at my local drugstore thought a recent prescription was too unusual to warrant stocking. My drugstore had to order it from a regional warehouse. Just-in-time inventory I think they call it. That was five (5) days ago. That isn’t “just-in-time” for the customer (me) but that isn’t the goal of most cost/pricing algorithms. Particularly when the customer has little choice about the service.

    I first saw this in a tweet by Kirk Borne.

    Nature: A recap of a successful year in open access, and introducing CC BY as default

    Tuesday, January 27th, 2015

    A recap of a successful year in open access, and introducing CC BY as default by Carrie Calder, the Director of Strategy for Open Research, Nature Publishing Group/Palgrave Macmillan.

    From the post:

    We’re pleased to start 2015 with an announcement that we’re now using Creative Commons Attribution license CC BY 4.0 as default. This will apply to all of the 18 fully open access journals Nature Publishing Group owns, and will also apply to any future titles we launch. Two society- owned titles have introduced CC BY as default today and we expect to expand this in the coming months.

    This follows a transformative 2014 for open access and open research at Nature Publishing Group. We’ve always been supporters of new technologies and open research (for example, we’ve had a liberal self-archiving policy in place for ten years now. In 2013 we had 65 journals with an open access option) but in 2014 we:

    • Built a dedicated team of over 100 people working on Open Research across journals, books, data and author services
    • Conducted research on whether there is an open access citation benefit, and researched authors’ views on OA
    • Introduced the Nature Partner Journal series of high-quality open access journals and announced our first ten NPJs
    • Launched Scientific Data, our first open access publication for Data Descriptors
    • And last but not least switched Nature Communications to open access, creating the first Nature-branded fully open access journal

    We did this not because it was easy (trust us, it wasn’t always) but because we thought it was the right thing to do. And because we don’t just believe in open access; we believe in driving open research forward, and in working with academics, funders and other publishers to do so. It’s obviously making a difference already. In 2013, 38% of our authors chose to publish open access immediately upon publication – in 2014, this percentage rose to 44%. Both Scientific Reports and Nature Communications had record years in terms of submissions for publication.

    Open access is on its way to becoming the expected model for publishing. That isn’t to say that there aren’t economies and kinks to be worked out, but the fundamental principles of open access have been widely accepted.

    Not everywhere of course. There are areas of scholarship that think self-isolation makes them important. They shun open access as an attack on their traditions of “Doctor Fathers” and access to original materials as a privilege. Strategies that make them all the more irrelevant in the modern world. Pity because there is so much they could contribute to the public conversation. But a public conversation means you are not insulated from questions that don’t accept “because I say so” as an adequate answer.

    If you are working in such an area or know of one, press for emulation of the Nature and the many other efforts to provide open access to both primary and secondary materials. There are many areas of the humanities that already follow that model, but not all. Let’s keep pressing until open access is the default for all disciplines.

    Kudos to Nature for their ongoing efforts on open access.

    I first saw the news about the post about Nature in a tweet by Ethan White.

    Message of Ayatollah Seyyed Ali Khamenei To the Youth in Europe and North America

    Tuesday, January 27th, 2015

    #LETTER4U Message of Ayatollah Seyyed Ali Khamenei To the Youth in Europe and North America

    Unlike many news sources I will not attempt to analyze this message from Ayatollah Seyyed Ali Khamenei.

    You should read the message for yourself and not rely on the interpretations of others.

    Ayatollah Seyyed Ali Khamenei’s request is an honorable one and should be granted. You will find it an exercise in attempting (one never really succeeds) to understand the context of another. That is one of the key skills in creating topic maps that traverse the contextual boundaries of departments, enterprises, government offices and cultures.

    It isn’t easy to stray from one’s own cultural context but even making the effort is worthwhile.

    Google asked to muzzle Waze ‘police-stalking’ app

    Tuesday, January 27th, 2015

    Google asked to muzzle Waze ‘police-stalking’ app by Lisa Vaas.

    From the post:

    GPS trackers on vehicles; stingray devices to siphon mobile phone IDs and their owners’ locations; gunshot-detection sensors; license plate readers: these are just some of the types of surveillance technologies used by law enforcement, often without warrants.

    Now, US police are protesting the fact that citizens are using technology to track them, and they want Google to pull the plug on it.

    The technology being used to track police – regardless of whether they’re on their lunch break, assisting with a broken-down vehicle on the highway, or hiding in wait to nab speeders – is part of a popular mobile app, Waze, that Google picked up in 2013.

    Don’t you find it interesting that law enforcement has no apparent objection to mass surveillance and stalking of citizens but quickly rallies when effective crowd sourcing creates surveillance of the police?

    Lisa seizes on the highly unusual killing of two New York police officers in Brooklyn last December to make this a police safety issue. Random events are going to happen whether citizens report police locations or not.

    And if we are going to be “data-driven,” the numbers of line of duty deaths for police officers have been going downward for the past three years. 2011 – 171, 2012 – 122, 2013 – 102, so if Waze is having an impact on officer safety, it isn’t showing up in the data. Those numbers were collected by the National Law Enforcement Officers Memorial Fund.

    Shouldn’t policy decisions on surveillance/stalking be driven by data? If you oppose Waze, where is your data?

    Those are simple enough questions.

    Forward this post to your local newspaper and police department. If there is data to oppose Waze, let’s everyone see it.


    PS: You can find more information on Waze at: Not only do I hope Waze keeps posting the location of police officers but I hope they add politicians, bankers, CIA/NSA staff to their map. Effective crowd sourcing may be our only defense against government overreaching. Help keep everyone free with your location contributions.

    Chandra Celebrates the International Year of Light

    Monday, January 26th, 2015

    Chandra Celebrates the International Year of Light by Janet Anderson and Megan Watzke.

    From the webpage:

    The year of 2015 has been declared the International Year of Light (IYL) by the United Nations. Organizations, institutions, and individuals involved in the science and applications of light will be joining together for this yearlong celebration to help spread the word about the wonders of light.

    In many ways, astronomy uses the science of light. By building telescopes that can detect light in its many forms, from radio waves on one end of the “electromagnetic spectrum” to gamma rays on the other, scientists can get a better understanding of the processes at work in the Universe.

    NASA’s Chandra X-ray Observatory explores the Universe in X-rays, a high-energy form of light. By studying X-ray data and comparing them with observations in other types of light, scientists can develop a better understanding of objects likes stars and galaxies that generate temperatures of millions of degrees and produce X-rays.

    To recognize the start of IYL, the Chandra X-ray Center is releasing a set of images that combine data from telescopes tuned to different wavelengths of light. From a distant galaxy to the relatively nearby debris field of an exploded star, these images demonstrate the myriad ways that information about the Universe is communicated to us through light.

    Five objects at various distances that have been observed by Chandra

    SNR 0519-69.0: When a massive star exploded in the Large Magellanic Cloud, a satellite galaxy to the Milky Way, it left behind an expanding shell of debris called SNR 0519-69.0. Here, multimillion degree gas is seen in X-rays from Chandra (blue). The outer edge of the explosion (red) and stars in the field of view are seen in visible light from Hubble.

    Five objects at various distances that have been observed by Chandra

    Cygnus A: This galaxy, at a distance of some 700 million light years, contains a giant bubble filled with hot, X-ray emitting gas detected by Chandra (blue). Radio data from the NSF’s Very Large Array (red) reveal “hot spots” about 300,000 light years out from the center of the galaxy where powerful jets emanating from the galaxy’s supermassive black hole end. Visible light data (yellow) from both Hubble and the DSS complete this view.

    There are more images but one of the reasons I posted about Chandra is that the online news reports I have seen all omitted the most important information of all: Where to find more information!

    At the bottom of this excellent article on Chandra (which also doesn’t appear as a link in the news stories I have read), you will find:

    For more information on “Light: Beyond the Bulb,” visit the website at

    For more information on the International Year of Light, go to

    For more information and related materials, visit:

    For more Chandra images, multimedia and related materials, visit:

    Granted it took a moment or two to insert the hyperlinks but now any child or teacher or anyone else who wants more information can avoid the churn and chum of searching and go directly to the sources for more information.

    That doesn’t detract from my post. On the contrary, I hope that readers find that sort of direct linking to more resources helpful and a reason to return to my site.

    Granted I don’t have advertising and won’t so keeping people at my site is no financial advantage to me. But if I have to trap people into remaining at my site, it must not be a very interesting one. Yes?

    Why Internet Memory Is Important – Auschwitz

    Monday, January 26th, 2015

    After posting a note about Jill Lepore’s essay The Cobweb: Can the Internet be archived?, I found a great example of why memory and sources (like a footnote) are important.

    Today, 26 January 2015, is the 70th anniversary of the liberation of Auschwitz. The Telegraph gave this lead into its reprinting of the obituary of Rudolf Vrba:

    Rudolf Vrba escaped from Auschwitz in 1944 and was one of the first people to give first-hand evidence of the gas chambers, mass murder and plans to exterminate a million Jews. Nearly 70 years on from the liberation of the concentration camp, the Telegraph looks back on his legacy

    So horrific was the testimony from Rudolf Vrba, that the members of the Jewish Council in Hungary couldn’t quite believe what they were hearing.

    Vrba and Alfred Wetzler, who escaped with him in April 1944, drew up a detailed plan of Auschwtiz and its gas chambers, providing compelling evidence of what had previously been considered embellishment. It has since emerged that reports from inside Auschwitz, compiled by the Polish Underground State and the Polish Government in Exile and written by Jan Karski and Witold Pilecki among others, had in fact reached some Western allies before 1944, but action had not been taken.

    Vrba and Wetzler’s detailed, first-hand report about how Nazis were systematically killing Jews was compiled into the Wetzler-Vrba report and sent shockwaves around the world when it was circulated and picked up by international media in 1944.

    It still took some weeks before the report was accepted and credited after it was written – something that Vrba said had contributed to the deaths of an estimated 50,000 Hungarian Jews. Just weeks before their escape, German forces had invaded Hungary, and Jews there were already being shipped to Auschwitz. It wasn’t until the report made the headlines in international media that Hungary stopped the deportation in July of 1944.

    Ahead of the 70th anniversary of the liberation of Auschwitz on Monday 26th January, here is the Telegraph’s obituary of Vrba, who died in 2006, and is credited for opening the world’s eyes to the horrors of Auschwitz:

    The obituary is very moving but if you need to read The Auschwitz Protocol / The Vrba-Wetzler Report to get a true sense of the horror that was Auschwitz.

    The report is all the more chilling because of the lack of hype and matter of fact tone of the report. Quite different from the news we experience every day.

    Remembering an event such as Auschwitz is important, not to relive old wrongs but to attempt to avoid repeating those same wrongs again. Remembering Auschwitz did not prevent any of the bloodiness of the second half of the 20th century. Which if anything, exceeded the bloodiness of the first half, when famine, drought, disease and human neglect or malice are taken into account.

    But Auschwitz will live on in the memories of survivors and their children. Equally important, it will live on as a well documented event. Dislodging it from the historical record will take more than time.

    Can the same be said about many of the events and reports of events that now live only in digital media? We have done badly enough with revisionist history on actual events (see who defeated Germany). How much worse will we do when “history” can simply disappear? (As much already has from government archives no doubt.)

    Preserving discovery and analysis of the content of archives presumes there are archives to be mined for subjects and relationships between them. Talk to your local librarian about how to best support long term archiving in your organization, locality and national government. The history we loose could well be your own.

    I first saw the basis for this post in Vintage Infodesign [105].

    The Cobweb: Can the Internet be archived?

    Monday, January 26th, 2015

    The Cobweb: Can the Internet be archived? by Jill Lepore.

    From the post:

    Malaysia Airlines Flight 17 took off from Amsterdam at 10:31 A.M. G.M.T. on July 17, 2014, for a twelve-hour flight to Kuala Lumpur. Not much more than three hours later, the plane, a Boeing 777, crashed in a field outside Donetsk, Ukraine. All two hundred and ninety-eight people on board were killed. The plane’s last radio contact was at 1:20 P.M. G.M.T. At 2:50 P.M. G.M.T., Igor Girkin, a Ukrainian separatist leader also known as Strelkov, or someone acting on his behalf, posted a message on VKontakte, a Russian social-media site: “We just downed a plane, an AN-26.” (An Antonov 26 is a Soviet-built military cargo plane.) The post includes links to video of the wreckage of a plane; it appears to be a Boeing 777.

    Two weeks before the crash, Anatol Shmelev, the curator of the Russia and Eurasia collection at the Hoover Institution, at Stanford, had submitted to the Internet Archive, a nonprofit library in California, a list of Ukrainian and Russian Web sites and blogs that ought to be recorded as part of the archive’s Ukraine Conflict collection. Shmelev is one of about a thousand librarians and archivists around the world who identify possible acquisitions for the Internet Archive’s subject collections, which are stored in its Wayback Machine, in San Francisco. Strelkov’s VKontakte page was on Shmelev’s list. “Strelkov is the field commander in Slaviansk and one of the most important figures in the conflict,” Shmelev had written in an e-mail to the Internet Archive on July 1st, and his page “deserves to be recorded twice a day.”

    On July 17th, at 3:22 P.M. G.M.T., the Wayback Machine saved a screenshot of Strelkov’s VKontakte post about downing a plane. Two hours and twenty-two minutes later, Arthur Bright, the Europe editor of the Christian Science Monitor, tweeted a picture of the screenshot, along with the message “Grab of Donetsk militant Strelkov’s claim of downing what appears to have been MH17.” By then, Strelkov’s VKontakte page had already been edited: the claim about shooting down a plane was deleted. The only real evidence of the original claim lies in the Wayback Machine.

    If you aren’t a daily user of the the Internet Archive (home of the WayBack Machine) you are missing out on a very useful resource.

    Jill tells the story about the archive, its origins and challenges as well as I have heard it told. Very much worth your time to read.

    Hopefully after reading the story you will find ways to contribute/support the Internet Archive.

    Without the Internet Archive, the memory of the web would be distributed, isolated and in peril of erasure and neglect.

    I am sure many governments and corporations wish the memory of the web could be altered, let’s disappoint them!

    New Member of the Axis of Evil – Greece

    Monday, January 26th, 2015

    In case you haven’t heard, Greece has a new government, a leftist government. How convenient that CNN ran Add this to Greece’s list of problems: It’s an emerging hub for terrorists today.

    I won’t repeat the bogeyman rumors reported by CNN but suffice it to say that it is a first step towards establishing Greece can’t control its borders and so is a highway for terrorists.

    It doesn’t take a lot of imagination to realize who might want to “assist” Greece in controlling its borders. Assist as in “insist” in Greece controlling its borders. Should it fail to do so, well, there are always international coalitions willing to assist with such duties.

    The U.S. Dept. of Fear jumped on this yesterday. A great Twitter account to follow if you are interested in the smoke and mirrors that are the illusion of fighting terrorism.

    PS: Tell me, do you know if the Dulles brothers had any grandchildren? You may remember their efforts in Guatemala and Honduras on behalf of United Fruit Company. That included overthrowing governments, etc. Independence from the United States is possible, but ask Vietnam, at what cost?

    Cost To Be A Terrorist Hits Rock Bottom

    Monday, January 26th, 2015

    The cost of being a terrorist has dropped dramatically since 9/11 but it hit a new low when a free Twitter account was used to ground two planes with bomb threats. Both planes were “escorted” by F-16 fighters to safe landings. No bombs were found.

    The @KingZortic twitter account, as of last Saturday, is reported to have eleven (11) tweets, ten of which were threats.

    You can find more details at: F-16s Scrambled to Escort Jets After Twitter Bomb Threat. You can find the same account with varying verbage at any number of media outlets. I just happened upon that one first.

    Does ten tweets being “credible evidence” tell you something about the confidence of government officials in their airport security systems?

    Being a tweet literate terrorist allows you to avoid the unpleasantness of terrorism camps, being traced by to such camps, travel expenses, the camp fees and extra charges for ammunition, food, etc.

    No, under no circumstances should you become a tweeting terrorist, but on the other hand, you should not become a terrorist that uses cruise missiles to attack wedding parties either.

    How such activities will be treated depends on your national government and who you are terrorizing.

    Because of the very low bar to being a terrorist, Atlanta a free Twitter account, Paris easy to obtain automatic weapons, or being declared a terrorist, everyone should back away from terrorism as an instrument of state or near-state policy.

    That includes Western powers that are even now conducting terrorist campaigns in the Middle East. What else would you call it when a cruise missile or bomb kills? It is no less terrorizing than a car bomb or an AK-47. The trite line about trying to avoid civilian casualties is further evidence of Western moral arrogance, deciding who will live and who will die.

    Terrorism and the war on terrorism are equally wasteful of resources, lives and the economies of nations. What other priorities should replace terrorism and the war on terrorism I don’t know. What I do know is that the current efforts for and against terrorism are waste, waste pure and simple.

    Humpty-Dumpty on Being Data-Driven

    Monday, January 26th, 2015

    What’s Hampering Corporate Efforts to be Data-Driven? by Michael Essany.

    Michael summarizes a survey from Terradata that reports:

    • 47% of CEOs, or about half, believe that all employees have access to the data they need, while only 27% of other respondents agree.
    • 43% of CEOs think relevant data are captured and made available in real time, as opposed to 29% of other respondents.
    • CEOs are also more likely to think that employees extract relevant insights from data – 38% of them hold this belief, as compared to 24% among other the rest of respondents
    • 53% of CEOs think data utilization has made decision-making less hierarchical and further empowered employees, as compared to only 36% of the employees themselves.
    • 51% of CEOs believe data availability has improved employee engagement, satisfaction and retention, while only 35% of the rest agree.

    As marketing literature, Terradata’s survey is targeted at laying the failure to become “data-driven” at the door of CEOs.

    But Terradata didn’t ask or Michael did not report the answer to several other relevant questions:

    What are the characteristics of a business that can benefit from being “data-driven?” If you are going to promote being “data-driven,” shouldn’t there be data to establish being “data-driven” benefits a business? Real data, not the power point slide hand wavy stuff.

    Who signs the check for the enterprise is a more relevant question than the CEOs opinion about “data-driven,” IT in general or global warming.

    And as Humpty-Dumpty would say, in a completely different context: “The question is, which is to be master, that’s all!”

    I suppose as marketing glam it’s not bad but not all that impressive either. Data-driven marketing should be based on hard data and case studies with references. Upstairs/downstairs differences in perception hardly qualify as hard data.

    I first saw this in a tweet by Kirk Borne.

    Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs

    Monday, January 26th, 2015

    Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs by Amir Hajian, Marcelo Alvarez, J. Richard Bond.


    Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya’ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.

    From the introduction:

    In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist. A sample of the objects does exist (the very objects in the observed catalog) however, and the observed sample can be used to express the rules for the selection function. This “learning from examples” is the main idea behind classi cation algorithms in machine learning. The problem of selection functions can be re-stated in the statistical machine learning language as: given a set of samples, we would like to detect the soft boundary of that set so as to classify new points as belonging to that set or not. (emphasis added)

    Does the sentence:

    In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist.

    sound like they could be describing people?

    I mention this as a reason why you should be read broadly in machine learning in particular and IR in general.

    What if all the known data about known terrorists, sans all the idle speculation by intelligence analysts, were gathered into a data set. Machine learning on that data set could then be tested against a simulation of potential terrorists, to help avoid the biases of intelligence analysts.

    Lest the undeserved fixation on Muslims blind security services to other potential threats, such as governments bent on devouring their own populations.

    I first saw this in a tweet by Stat.ML.

    Understanding Context

    Sunday, January 25th, 2015

    Understanding Context by Andrew Hinton.

    From the post:

    Technology is destabilizing the way we understand our surroundings. From social identity to ubiquitous mobility, digital information keeps changing what here means, how to get there, and even who we are. Why does software so easily confound our perception and scramble meaning? And how can we make all this complexity still make sense to our users?

    Understanding Context — written by Andrew Hinton of The Understanding Group — offers a powerful toolset for grasping and solving the challenges of contextual ambiguity. By starting with the foundation of how people perceive the world around them, it shows how users touch, navigate, and comprehend environments made of language and pixels, and how we can make those places better.

    Understanding Context is ideal for information architects, user experience professionals, and designers of digital products and services of any scope. If what you create connects one context to another, you need this book.


    Amazon summarizes in part:

    You’ll discover not only how to design for a given context, but also how design participates in making context.

    • Learn how people perceive context when touching and navigating digital environments
    • See how labels, relationships, and rules work as building blocks for context
    • Find out how to make better sense of cross-channel, multi-device products or services
    • Discover how language creates infrastructure in organizations, software, and the Internet of Things
    • Learn models for figuring out the contextual angles of any user experience

    This book is definitely going on my birthday wish list at Amazon. (There done!)

    Looking forward to a slow read and in the meantime, will start looking for items from the bibliography.

    My question, of course, is that after expending all the effort to discover and/or design a context, how do I pass that context onto another?

    To someone coming from a slightly different context? (Assuming always that the designer is “in” a context.)

    From a topic map perspective, what subjects do I need to represent to capture a visual context? Even more difficult, what properties of those subjects do I need to capture to enable their discovery by others? Or to facilitate mapping those subjects to another context/domain?

    Definitely a volume I would assign as reading for a course on topic maps.

    I first saw this in a tweet by subjectcentric.