## Archive for November, 2015

### Connecting Roll Call Votes to Members of Congress (XQuery)

Monday, November 30th, 2015

Apologies for the lack of posting today but I have been trying to connect up roll call votes in the House of Representatives to additional information on members of Congress.

In case you didn’t know, roll call votes are reported in XML and have this form:

<recorded-vote><legislator name-id="A000374" sort-field="Abraham"
unaccented-name="Abraham" party="R" state="LA"
role="legislator">Abraham</legislator><
vote>Aye</vote></recorded-vote>
><vote>No</vote></recorded-vote>
<vote>Aye</vote></recorded-vote>
<recorded-vote><legislator name-id="A000371" sort-field="Aguilar"
unaccented-name="Aguilar" party="D" state="CA"
role="legislator">Aguilar</legislator><
vote>Aye</vote></recorded-vote>
...


For a full example: http://clerk.house.gov/evs/2015/roll643.xml

With the name-id attribute value, I can automatically construct URIs to the Biographical Directory of the United States Congress, for example, the entry on Abraham, Ralph.

More information than a poke with a sharp stick would give you but its only self-serving cant.

One of the things that would be nice to link up with roll call votes would be the homepages of those voting.

Continuing with Ralph Abraham, mapping A000374 to https://abraham.house.gov/ would be helpful in gathering other information, such as the various offices where Representative Abraham can be contacted.

If you are reading the URIs, you might think just prepending the last name of each representative to “house.gov” would be sufficient. Well, it would be except that there are eight-three cases where representatives share last names and/or a new naming scheme has more than the last name + house.gov.

After I was satisfied that there wasn’t a direct mapping between the current uses of name-id and House member websites, I started creating such a mapping that you can drop into XQuery as a lookup table and/or use as an external file.

The lookup table should be finished tomorrow so check back.

PS: Yes, I am aware there are tables of contact information for members of Congress but I have yet to see one that lists all their local offices. Moreover, a lookup table for XQuery may encourage people to connect more data to their representatives. Such as articles in local newspapers, property deeds and other such material.

### Idiomatic Python Resources

Sunday, November 29th, 2015

Idiomatic Python Resources by Andrew Montalenti.

From the post:

Let’s say you’ve just joined my team and want to become an idiomatic Python programmer. Where do you begin?

There are twenty-three resources listed and the benefits of being an idiomatic Python programmer (or an idiomatic programmer in any other language) aren’t limited to employment with Andrew. 😉

One of the advantages to being an idiomatic programmer is that you will be more easily understood by other programmers. Being understood isn’t a bad thing. Really.

Another advantage to being an idiomatic programmer is that it will influence the programmers around you and result in code that is easier for you to understand. Again, understanding isn’t a bad thing.

As if that weren’t enough, perusing the resources that Andrew lists will make you a better programmer overall, which is never a bad thing.

Enjoy!

### arXiv Sanity Preserver

Sunday, November 29th, 2015

arXiv Sanity Preserver by Andrej Karpathy.

From the webpage:

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It’s super hacky and was written in 4 hours. I’ll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want – I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click “sort by tfidf similarity to this”, which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.

You can see this rather remarkable tool online at: https://karpathy23-5000.terminal.com/

Beyond its obvious utility for researchers, this could be used as a framework for experimenting with other similarity measures.

Enjoy!

I first saw this in a tweet by Lynn Cherny.

### The First Draft Toolbox for newsgathering and verification

Saturday, November 28th, 2015

If you are not Donald Trump or some other form of a pathological liar, then you will enjoy: The First Draft Toolbox for newsgathering and verification by Alastair Reid.

From the post:

Welcome to the First Draft Toolbox, a list of tools and sites recommended by the First Draft Coalition to help in social newsgathering, verification and more.

We will be updating the page regularly with new tools as well as more detailed explainers and guides of those listed here already. If you have any suggestions of something we may have missed or are launching a tool you think should be featured here, please let us know by emailing our editor Alastair Reid.

You can also get email alerts for when we update the page using ChangeDetection or other available tools.

So many options can be overwhelming though, and putting them into practice can be daunting when just starting out. The best advice has always been to experiment with everything but find the tools that work for you, and keep up with thought leaders and case studies to see what the experts use and how they use them.

By rough count I make it thirty-eight separate resources for newsgathering and verification. The big categories are: Social newsgathering and search tools, Location checking tools, Source verification, Image verification, YouTube Data Viewer and, Translation.

An impressive collection, several new to me and more than you will probably use at any one time. Try the most needed ones first and then branch out. Over time you will develop favorites and skill at using them.

The one omission that surprised me was Alastair failing to mention Snopes.com.

Snopes.com is one of the premier debunking sites on the WWW. For example:

Undercover Parcel Service No, UPS isn’t smuggling refugees into the United States in the dead of night.

Cetacean Harvestation No, cranberry farmers aren’t netting and canning dolphins during the harvest season.

Does that help explain Donald Trump’s standings in the polls?

Ask not only whether statements are “true,” but also what the speaker has to gain from giving them to you?

### Paris Terrorists and the Kansas City Shuffle

Saturday, November 28th, 2015

The Kansas City shuffle was described by Mr. GoodKat (Bruce Willis) in Lucky Number Slevin, “is is when everybody looks right, you go left.”

The security forces in Paris were victims of a self-inflicted Kansas City shuffle.

Stacy Meichtry and Joshua Robinson detail in: Paris Attacks Plot Was Hatched in Plain Sight how the Paris attackers:

• used their real names
• used their real IDs
• used unencrypted, simple messaging to coordinate

While the Paris attackers were in plain sight, “left,” the Paris security and intelligence services were laboring over electronic debris from innocent civilians, worrying about encrypted messages and other futile and meaningless activities, “right.”

No doubt while preparations for the Paris attack were ongoing, intelligence agencies around the world were laboring to decrypt encrypted messages, mining every growing databases composed primarily of electronic debris from innocent civilians, and engaging is other utterly futile and meaningless activities.

Moreover, none, repeat none of the current data mining activities would have identified the Paris terrorists before the attack or have disclosed their plans.

Intelligence agencies have no profile for a terrorist, short of participating in a terrorist attack, and even then, apprehending a known terrorist taxes their capabilities.

Without a useful terrorist profile, all the data mining in the world won’t help intelligence agencies stop terrorist attacks.

If anything, looking “right,” and wasting government funding on more looking “right,” while terrorists go “left,” is as classic a Kansas City shuffle as I can imagine.

Is your police or intelligence agency victimizing itself and you with the Kansas City shuffle?

### Better Than An Erector Set! — The Deep Sweep (2015)

Saturday, November 28th, 2015

The Deep Sweep (2015) High-altitude Signal Research

From the introduction:

The Deep Sweep is an aerospace probe scanning the otherwise out-of-reach signal space between land and stratosphere, with special interest placed in UAV/drone to satellite communication.

Taking the form of a high-altitude weather balloon, tiny embedded computer and RF equipment, The Deep Sweep project is being developed to function as a low-cost, aerial signal-intelligence (SIGINT) platform. Intended for assembly and deployment by public, it enables surveying and studying the vast and often secretive world of signal in our skies.

Two launches have been performed so far, from sites in Germany, landing in Poland and Belarus respectively.

We intend to make many more, in Europe and beyond.

What a cool homebrew project!

Warning: There are legitimate concerns for air safety when performing this type of research. Governments that engage in questionable practices with UAV/drone hardware are unlikely to welcome detection of their nefarious activities.

I liked the notion of bugs (surveillance devices) that “bite” upon discovery in Marooned in Realtime. Depending upon your appetite for risk, you may want to consider such measures in a hostile environment.

The biggest risk of the narrated approach is that you have to physically recover the probe. All sorts of things could go sideways depending on your operating environment.

Still, a good read and quite instructive on what has been done.

Future improvements could include capturing data, injecting data, taking control UAV/drone vehicles that are not yours, just to name a few.

Up to you to create what comes next.

### Saxon 9.7 Release!

Saturday, November 28th, 2015

I saw a tweet from Michael Kay announcing that Saxon 9.7 has been released!

Saxon 9.7 is up to date with the XSLT 3.0 Candidate Recommendation released from the W3C November 19, 2015.

From the details page:

• XSLT 3.0 implementation largely complete (requires Saxon-PE or Saxon-EE): The new XSLT 3.0 Candidate Recommendation was published on 19 November 2015, and Saxon 9.7 is a complete implementation with a very small number of exceptions. Apart from general updating of Saxon as the spec has developed, the main area of new functionality is in packaging, which allows stylesheet modules to be independently compiled and distributed, and provides much more “software engineering” control over public and private interfaces, and the like. The ability to save packages in compiled form gives much faster loading of frequently used stylesheets.
• Schema validation: improved error reporting. The schema validator now offers customised error reporting, with an option to create an XML report detailing all validation errors found. This has structured information about each error so the report can readily be customised; it has been developed in conjunction with some of our IDE partners who can use this information to provide an improved interactive presentation of the validation report.
• Arrays, Maps, and JSON: Arrays are implemented as a new data type (defined in XPath 3.1). Along with maps, which were already available in Saxon 9.6, this provides the infrastructure for full support of JSON, including functions such as parse-json() which converts JSON to a structure of maps and arrays, and the JSON serialization method which does the inverse.
• Miscellaneous new functions: two of the most interesting are random-number-generator(), and parse-ietf-date().
• Streaming: further improvements to the set of constructs that can be streamed, and the diagnostics when constructs cannot be streamed.
• Collections: In line with XPath 3.1 changes, a major overhaul of the way collections work. They can now contain any kind of item, and new abstractions are provided to give better control over asynchronous and parallel resource fetching, parsing, and validation.
• Concurrency improvements: Saxon 9.6 already offered various options for executing stylesheets in parallel to take advantage of multi-code processors. These facilities have now been tuned for performance and made more robust, by taking advantage of more advanced concurrency features in the JDK platform. The Saxon NamePool, which could be a performance bottleneck in high throughput workloads, has been completely redesigned to allow much higher concurrency.
• Cost-based optimization: Saxon’s optimizer now makes cost estimates in order to decide the best execution strategy. Although the estimates are crude, they have been found to make a vast difference to the execution speed of some stylesheets. Saxon 9.7 particularly addresses the performance of XSLT pattern matching.

There was no indication of when these features will appear in the free “home edition.”

In the meantime, you can go the Online Shop for Saxonica.

Currency conversion rates vary but as of today, Saxon-PE (Professional Edition) is about $75 U.S. and some change. I’m considering treating myself to Saxon-PE as a Christmas present to myself. And you? ### Raspberry Pi Zero — The$5 Tiny Computer is Here [Paper Thoughts?]

Saturday, November 28th, 2015

Raspberry Pi Zero — The $5 Tiny Computer is Here by Swati Khandelwal. From the post: Get ready for a ThanksGiving celebration from the Raspberry Pi Foundation. Raspberry Pi, the charitable foundation behind the United Kingdom’s best-selling computer, has just unveiled its latest wonder – the Raspberry Pi Zero. Raspberry Pi Zero is a programmable computer that costs just$5 (or £4), may rank as the world’s cheapest computer.

The Raspberry Pi Zero is on sale from today and is also given away with this month’s copy of the Raspberry Pi own magazine MagPi (available at Barnes & Noble and Microcenter).

Do you intend to use your Raspberry Pi Zero, which far exceeds anything available during the early years of atomic bomb development “…as [a] really fast paper emulator?”

The quote is from:

…how the media in which we choose to represent our ideas shape (and too often, limit) what ideas we can have. “We have these things called computers, and we’re basically just using them as really fast paper emulators,” he says. “With the invention of the printing press, we invented a form of knowledge work which meant sitting at a desk, staring at little tiny rectangles and moving your hand a little bit. It used to be those tiny rectangles were papers or books and you’re moving your hand with a pen.

Now we’re staring at computer screens and moving our hands on a keyboard, but it’s basically the same thing. We’re computer users thinking paper thoughts.”

Is the Raspberry Pi Zero going to be where you or your child steps beyond “…paper thoughts?”

Or doing the same activities of yesteryear, only faster?

Enjoy!

### The Utopian UI Architect [the power of representation]

Saturday, November 28th, 2015

Following all the links and projects mentioned in this post will take some time but the concluding paragraph will provide enough incentive:

“The example I like to give is back in the days of Roman numerals, basic multiplication was considered this incredibly technical concept that only official mathematicians could handle,” he continues. “But then once Arabic numerals came around, you could actually do arithmetic on paper, and we found that 7-year-olds can understand multiplication. It’s not that multiplication itself was difficult. It was just that the representation of numbers — the interface — was wrong.”

Imagine that. A change in representation changed multiplication from a professional activity to one for 7-year olds.

Now that is testimony to the power of representation.

What other representations, common logic, RDF, category theory, compilers, etc., are making those activities more difficult than necessary?

There are no known or general answer to that question but Bret Victor’s work may spark clues from others.

I first saw this in a tweet by Max Roser.

### Optimizing Hash-Array Mapped Tries…

Saturday, November 28th, 2015

Adrian’s review of Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections by Steinforder & Vinju, 2015, starts this way:

You’d think that the collection classes in modern JVM-based languages would be highly efficient at this point in time – and indeed they are. But the wonderful thing is that there always seems to be room for improvement. Today’s paper examines immutable collections on the JVM – in particular, in Scala and Clojure – and highlights a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.

CHAMP stands for Compressed Hash-Array Mapped Prefix-tree.

The use of immutable collections is on the rise…

Immutable collections are a specific area most relevant to functional/object-oriented programming such as practiced by Scala and Clojure programmers. With the advance of functional language constructs in Java 8 and functional APIs such as the stream processing API, immutable collections become more relevant to Java as well. Immutability for collections has a number of benefits: it implies referential transparency without giving up on sharing data; it satisfies safety requirements for having co-variant sub-types; it allows to safely share data in presence of concurrency.

Both Scala and Clojure use a Hash-Array Mapped Trie (HAMT) data structure for immutable collections. The HAMT data structure was originally developed by Bagwell in C/C++. It becomes less efficient when ported to the JVM due to the lack of control over memory layout and the extra indirection caused by arrays also being objects. This paper is all about the quest for an efficient JVM-based derivative of HAMTs.

Fine-tuning data structures for cache locality usually improves their runtime performance. However, HAMTs inherently feature many memory indirections due to their tree-based nature, notably when compared to array-based data structures such as hashtables. Therefore HAMTs presents an optimization challenge on the JVM. Our goal is to optimize HAMT-based data structures such that they become a strong competitor of their optimized array-based counterparts in terms of speed and memory footprints.

Adrian had me at: “a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.”

If you want experience with the proposed data structures, the authors have implemented them in the Rascal Metaprogramming Language.

I first saw this in a tweet by Atabey Kaygun

### Docker and Jupyter [Advantages over VMware or VirtualBox?]

Saturday, November 28th, 2015

From the post:

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry.

The past few years have seen the rise of technologies that help with this by creating isolated environments. We’ll be exploring one in particular, Docker. Docker makes it fast and easy to create new data science environments, and use tools such as Jupyter notebooks to explore your data.

With Docker, we can download an image file that contains a set of packages and data science tools. We can then boot up a data science environment using this image within seconds, without the need to manually install packages or wait around. This environment is called a Docker container. Containers eliminate configuration problems – when you start a Docker container, it has a known good state, and all the packages work properly.

A nice walk through on installing a Docker container and Jupyter. I do wonder about the advantages claimed over VMware and VirtualBox:

Although virtual machines enable Linux development to take place on Windows, for example, they have some downsides. Virtual machines take a long time to boot up, they require significant system resources, and it’s hard to create a virtual machine from an image, install some packages, and then create another image. Linux containers solve this problem by enabling multiple isolated environments to run on a single machine. Think of containers as a faster, easier way to get started with virtual machines.

I have never noticed long boot times on VirtualBox and “require significant system resources” is too vague to evaluate.

As far as “it’s hard to create a virtual machine from an image, install some packages, and then create another image,” I thought the point of the post was to facilitate quick access to a data science environment?

In that case, I would download an image of my choosing, import it into VirtualBox and then fire it up. How hard is that?

There are pre-configured images with Solr, Solr plus web search engines, and a host of other options.

For more details, visit VirtualBox.org and for a stunning group of “appliances” see VirtualBoxImages.com.

You can use VMs with Docker so it isn’t strictly an either/or choice.

I first saw this in a tweet by Data Science Renee.

Update: Data Science Renee encountered numerous issues trying to follow this install on Windows 7 Professional 64-bit, using VirtualBox 5.0.10 r104061. You can read more about her travails here: Trouble setting up default, maybe caused by virtualbox. After 2 nights of effort, she succeeded! Great!

Error turned out to (apparently) be in VirtualBox. Or at least upgrading to a test version of VirtualBox fixed the problem. I know, I was surprised too. My assumption was that it was Windows. 😉

### Hello Barbie (Hello NSA) [Barbie Spy edition]

Friday, November 27th, 2015

From the post:

Mattel’s “Hello Barbie” is one of the hottest toys this holiday season, but researchers warn that a security flaw that affects the Wi-Fi-enabled doll is capable of quickly turning Christmas into the creepiest time of the year.

Retailing for about $75, the “Hello Barbie” is perhaps the most advanced action figure on the market: between being Wi-Fi-ready and equipped with speech recognition technology, Mattel claims the doll “can interact uniquely with each child by holding conversations, playing games, sharing stories and even telling jokes.” Take every occasion to teach your children to be cybersecurity aware. The new ‘Hello Barbie’ toy is the latest in such occasions. The moral here is that anything you say out loud, even to a seemingly innocent doll, can be captured and used by those who intend you ill. Watch for post-Christmas stories of holiday “activities” capture by rogue ‘Hello Barbie’ toys. Who would have thought Americans would pay for the privilege of bugging their own homes? Go figure. Update: As of 17:00 UTC on November 29, 2015, a popular search engine reports 9,020 “hits” on ‘hijack “hello barbie”‘. No sales figures have been reported as of yet. ### Best Paper Awards in Computer Science (2014) Friday, November 27th, 2015 Best Paper Awards in Computer Science (2014) From the webpage: Jeff Huang’s list of the best paper awards from 29 CS conferences since 1996 up to and including 2014. I saw a tweet about Jeff’s site being updated to include papers from 2014. If you are looking for reading material in a particular field, this is a good place to start. For a complete list of the organizations, conferences as expanded abbreviations: see: Best Paper Awards in Computer Science (2013). None of them have changed so I didn’t see the point of repeating them. ### UK – Investigatory Powers Bill – Volunteer Targets Friday, November 27th, 2015 I saw a tweet earlier today that indicates the drafters of the UK Investigatory Powers Bill have fouled themselves, again. Section 195, General Definitions (1) has a list of unnumbered definitions which includes: “data” includes any information which is not data, However creative the English courts may be, I think that passage is going to prove to be a real challenge. Which makes even more worried than I was before. A cleanly drafted bill that strips every citizen of the UK of their rights presents a well-defined target for opposition. In this semantic morass, terms could mean what they say, the opposite and also be slang for a means of execution. Because of the Paris bombings, there is a push on to approve something, anything, to be seen as taking steps against terrorism. Instead of the Investigatory Powers Bill, Parliament should acquire 5 acres of land outside of London and erect a podium at its center. Members of Parliament will take turns reading Shakespeare aloud for two hours, eight hours a day, every day of the year. Terrorists prefer high-value targets over low and so members of Parliament can save all the people of the UK from fearing terrorists attacks. Their presence as targets will attract terrorists and simplify the task of locating potential terrorists. Any member of parliament who is killed while reading Shakespeare at the designated location, should be posthumously made a peer of the realm. A bill like that would protect the rights of every citizen of the UK, assist in the hunting of terrorist be drawing them to a common location and help prevent future crimes against the English language as are found in the Investigatory Powers Bill. What’s there not to like? ### What Should the Media Do When Donald Trump Blatantly Lies? [Try Not Reporting Lies] Thursday, November 26th, 2015 From the post: Political speech is a unique animal, especially during election season. It often mixes hyperbole with flowery language and aggressive rhetoric designed to inflame a particular passion. But Republican presidential candidate Donald Trump is arguably in a category unto himself. More than almost any other 2016 candidate, he is prone to telling flat-out lies, making up facts, and distorting the truth to a prodigious extent. This kind of behavior creates a tricky problem for the press. How should media companies deal with Trump and his falsehoods? If he were just a joke candidate without a hope of ever being the Republican nominee, it would be easy enough to ignore him. But he appears to stand a better than even chance of getting the nomination — he has been leading in the polls for months. If media outlets attack Trump’s lying directly, they run the risk of being accused of bias by his supporters and Republicans in general. In fact, that kind of reaction is already occurring in response to a New York Times editorial that accused the billionaire businessman of playing fast and loose with the truth on a number of issues, including whether Muslims in New Jersey cheered the Sept. 11, 2001 terrorist attacks. Part of the problem is that Trump and his candidacy are to some extent a creation of the mainstream media. At the very least, the two have developed a disturbingly co-dependent relationship. As disturbing as the article in on media coverage of lies by Donald Trump, the crux of the dilemma was put this way: since the U.S. news media is based on the commercial model—and more eyeballs on the page or the screen is good for business—the networks love it when someone like Donald Trump says outrageous stuff. Fact-checking rains on the parade of that revenue model. Perhaps news rooms need a new version of First they came for: First Trump lied about the refugees, and I reported it— Because I was not a refugee. Then Trump lied about blacks, and I reported it— Because I was not black. Then Trump lied about Jews, and I reported it— Because I was not a Jew. Donald Trump lied his way into the Whitehouse, and I made it possible- Because fact checking conflicted with the bottom-line. When I think about journalists who risk their lives reporting on drug cartels and violent governments, I wonder what they must think of the moral cowardice of political coverage in the United States? ### MagSpoof – credit card/magstripe spoofer [In Time For Black Friday] Wednesday, November 25th, 2015 MagSpoof – credit card/magstripe spoofer by Samy Kamkar. From the webpage: • Allows you to store all of your credit cards and magstripes in one device • Works on traditional magstripe readers wirelessly (no NFC/RFID required) • Can disable Chip-and-PIN (code not included) • Correctly predicts Amex credit card numbers + expirations from previous card number (code not included) • Supports all three magnetic stripe tracks, and even supports Track 1+2 simultaneously • Easy to build using Arduino or other common parts MagSpoof is a device that can spoof/emulate any magnetic stripe or credit card. It can work “wirelessly”, even on standard magstripe/credit card readers, by generating a strong electromagnetic field that emulates a traditional magnetic stripe card. Note: MagSpoof does not enable you to use credit cards that you are not legally authorized to use. The Chip-and-PIN and Amex information is not implemented and using MagSpoof requires you to have/own the magstripes that you wish to emulate. Simply having a credit card number and expiration is not enough to perform transactions. MagSpoof does allow you to perform research in other areas of magstripes, microcontrollers, and electromagnetism, as well as learn about and create your own devices similar to other existing, commercial technologies such as Samsung MST and Coin. Non-legal use of MagSpoof is left as an exercise for the reader. I first saw this in Four Short Links: 25 November 2015 by Nat Torkington. ### Quantum Walks with Gremlin [Graph Day, Austin] Wednesday, November 25th, 2015 Abstract: A quantum walk places a traverser into a superposition of both graph location and traversal “spin.” The walk is defined by an initial condition, an evolution determined by a unitary coin/shift-operator, and a measurement based on the sampling of the probability distribution generated from the quantum wavefunction. Simple quantum walks are studied analytically, but for large graph structures with complex topologies, numerical solutions are typically required. For the quantum theorist, the Gremlin graph traversal machine and language can be used for the numerical analysis of quantum walks on such structures. Additionally, for the graph theorist, the adoption of quantum walk principles can transform what are currently side-effect laden traversals into pure, stateless functional flows. This is true even when the constraints of quantum mechanics are not fully respected (e.g. reversible and unitary evolution). In sum, Gremlin allows both types of theorist to leverage each other’s constructs for the advancement of their respective disciplines. Best not to tackle this new paper on Gremlin and quantum graph walks after a heavy meal. 😉 Marko will be presenting at Graph Day, 17 January 2016, Austin, Texas. Great opportunity to hear him speak along with other cutting edge graph folks. The walk Marko describes is located in a Hilbert space. Understandable because numerical solutions require the use of a metric space. However, if you are modeling semantics in difference universes of discourse, realize that semantics don’t possess metric spaces. Semantics lie outside of metric space, although I concede that many have imposed varying arbitrary metrics on semantics. For example, if I am mapping the English term for “black,” as in a color to the term “schwartz” in German, I need a “traverser” that enables the existence of both terms at separate locations, one for each universe in the graph. You may protest that is overly complex for the representation of synonyms, but consider that “schwartz” occupies a different location in the universe of German and etymology from “black.” For advertising, subtleties of language may not be useful, but for reading medical or technical works, an “approximate” or “almost right” meaning may be more damaging than helpful. Who knows? Perhaps quantum computers will come closer to modeling semantics across domains better than any computer to date. Not perfectly but closer. ### Apple Watches Lowers Your IQ – Still Want One For Christmas? Wednesday, November 25th, 2015 The vast majority of those uses are not to check the time. The reports Philip summarizes say that interactions last only a few seconds but how long does it take to break your train of thought? Which reminded me of Vanessa Loder‘s post: Why Multi-Tasking Is Worse Than Marijuana For Your IQ. From Vanessa’s post: What makes you more stupid – smoking marijuana, emailing while talking on the phone or losing a night’s sleep? Researchers at the Institute of Psychiatry at the University of London studied 1,100 workers at a British company and found that multitasking with electronic media caused a greater decrease in IQ than smoking pot or losing a night’s sleep. For those of you in Colorado, this means you should put down your phone and pick up your pipe! In all seriousness, in today’s tech heavy world, the temptation to multi-task is higher than it’s ever been. And this has become a major issue. We don’t focus and we do too many things at once. We also aren’t efficient or effective when we stay seated too long. If a colleague gives you an Apple Watch for Christmas, be very wary. Apple is likely to complain that my meta-comparison isn’t the same as a controlled study and I have to admit, it’s not. If Apple wants to get one hundred people together for about a month, with enough weed, beer, snack food, PS4s, plus Apple Watches, my meta-analysis can be put to the test. The Consumer Safety Commission should sponsor that type of testing. Imagine, being a professional stoner. 😉 ### Need a Bigoted, Racist Uncle for Holiday Meal? Wednesday, November 25th, 2015 I don’t know why uncles are always singled out as being racist and bigoted when difficult holiday meals are discussed but they are. It is Thanksgiving” in the United States the fourth Thursday in November, to mark beginning of the first and only known case where immigrants took over a country and butchered almost all of its inhabitants. The idea came to late for a marketable product this year, but would you be interested in an app that substitutes for having a bigoted, racist uncle for a holiday meal? You know that all 1.6 billion comments on Reddit are available for download. (October 2007 to May 2015, + updates) Some design issues for an app by Thanksgiving next year: • List of topics? Top 20? Top 50? Search? • Number of comments for each topic? • Generated voice? • Random or timed delivery? • Choose a side? • Other features? It will be as close to having a bigoted, racist aunt/uncle of your own at the table as technically possible. All suggestions and comments welcome! PS: No “trigger” warnings. ### Cassini-Tools (for astronomers on your gift list) Wednesday, November 25th, 2015 Cassini-Tools by Jon Keegan. Code for imagery and metadata from the Cassini space probe‘s ISS cameras. From the NASA mission description: Cassini completed its initial four-year mission to explore the Saturn System in June 2008 and the first extended mission, called the Cassini Equinox Mission, in September 2010. Now, the healthy spacecraft is seeking to make exciting new discoveries in a second extended mission called the Cassini Solstice Mission. The mission’s extension, which goes through September 2017, is named for the Saturnian summer solstice occurring in May 2017. The northern summer solstice marks the beginning of summer in the northern hemisphere and winter in the southern hemisphere. Since Cassini arrived at Saturn just after the planet’s northern winter solstice, the extension will allow for the first study of a complete seasonal period. Cassini launched in October 1997 with the European Space Agency’s Huygens probe. The probe was equipped with six instruments to study Titan, Saturn’s largest moon. It landed on Titan’s surface on Jan. 14, 2005, and returned spectacular results. Meanwhile, Cassini’s 12 instruments have returned a daily stream of data from Saturn’s system since arriving at Saturn in 2004. Among the most important targets of the mission are the moons Titan and Enceladus, as well as some of Saturn’s other icy moons. Towards the end of the mission, Cassini will make closer studies of the planet and its rings. The best recommendation for the Cassini-Tools is the Meanwhile, Near Saturn… 11 Years of Cassini Saturn Photos site by Jon Keegan. Eleven years worth of images and other data should keep your astronomer friend busy for a while. 😉 ### The Limitations of Deep Learning in Adversarial Settings [The other type of setting would be?] Tuesday, November 24th, 2015 Abstract: Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification. I recommended deep learning for parsing lesser known languages earlier today. The utility of deep learning isn’t in doubt, but its vulnerability to “adversarial” input should give us pause. Adversarial input isn’t likely to be labeled as such. In fact, it may be concealed in ordinary open data that is freely available for download. As the authors note, the more prevalent deep learning becomes, the greater the incentive for the manipulation of input into a deep neural network (DNN). Although phrased as “adversaries,” the manipulation of input into DNNs isn’t limited to the implied “bad actors.” The choice or “cleaning” of input could be considered manipulation of input, from a certain point of view. This paper is notice that input into a DNN is as important in evaluating its results as as any other factor, if not more so. Or to put it more bluntly, no disclosure of DNN data = no trust of DNN results. ### Graphical Linear Algebra Tuesday, November 24th, 2015 Graphical Linear Algebra by Pawel Sobocinski. From Episode 1, Makélélé and Linear Algebra. Linear algebra is the Claude Makélélé of science and mathematics. Makélélé is a well-known, retired football player, a French international. He played in the famous Real Madrid team of the early 2000s. That team was full of “galácticos” — the most famous and glamorous players of their generation. Players like Zidane, Figo, Ronaldo and Roberto Carlos. Makélélé was hardly ever in the spotlight, he was paid less than his more celebrated colleagues and was frequently criticised by fans and journalists. His style of playing wasn’t glamorous. To the casual fan, there wasn’t much to get excited about: he didn’t score goals, he played boring, unimaginative, short sideways passes, he hardly ever featured in match highlights. In 2003 he signed for Chelsea for relatively little money, and many Madrid fans cheered. But their team started losing matches. The importance of Makélélé’s role was difficult to appreciate for the non-specialist. But football insiders regularly described him as the work-horse, the engine room, the battery of the team. He sat deep in midfield, was always in the right place to disrupt opposition attacks, recovered possession, and got the ball out quickly to his teammates, turning defence into attack. Without Makélélé, the galácticos didn’t look quite so galactic. Similarly, linear algebra does not get very much time in the spotlight. But many galáctico subjects of modern scientific research: e.g. artificial intelligence and machine learning, control theory, solving systems of differential equations, computer graphics, “big data“, and even quantum computing have a dirty secret: their engine rooms are powered by linear algebra. Linear algebra is not very glamorous. It is normally taught to science undergraduates in their first year, to prepare for the more exciting stuff ahead. It is background knowledge. Everyone has to learn what a matrix is, and how to add and multiply matrices. I have only read the first three or four posts but Pawel’s post look like a good way to refresh or acquire a “background” in linear algebra. Math is important for “big data” and as Renee Teate reminded us in A Challenge to Data Scientists, bias can be lurking anywhere, data, algorithms, us, etc. Or as I am fond of saying, “if you let me pick the data or the algorithm, I can produce a specified result, every time.” Bear that in mind when someone tries to hurry past your questions about data, its acquisition, processing before you saw it, and/or wanting to know the details of an algorithm and how it was applied. There’s a reason why people want to gloss over such matters and the answer isn’t a happy one, at least from the questioner’s perspective. Refresh or get an background in linear algebra! The more you know, the less vulnerable you will be to manipulation and/or fraud. I first saw this in a tweet by Algebra Fact. ### 20 Years of GIMP, release of GIMP 2.8.16 [Happy Anniversary GIMP!] Tuesday, November 24th, 2015 20 Years of GIMP, release of GIMP 2.8.16 From the post: This week the GIMP project celebrates its 20th anniversary. Back in 1995, University of California students, Peter Mattis and Kimball Spencer, were members of the eXperimental Computing Facility, a Berkeley campus organization of undergraduate students enthusiastic about computers and programming. In June of that year, the two hinted at their intentions to write a free graphical image manipulation program as a means of giving back to the free software community. On November 21st, 20 years ago today, Peter Mattis announced the availability of the “General Image Manipulation Program” on Usenet (later on, the acronym would be redefined to stand for the “GNU Image Manipulation Program”). Drop by the GIMP homepage and grab a copy of GIMP 2.8.16 to celebrate! Enjoy! ### XQuery and XPath Full Text 3.0 (Recommendation) Tuesday, November 24th, 2015 XQuery and XPath Full Text 3.0 From 1.1 Full-Text Search and XML: As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 3.0 and XPath 3.0. XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting. Full-text search is different from substring search in many ways: 1. A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string “lease” will return a news item that contains “Foobar Corporation releases version 20.9 …”. A full-text search for the token “lease” will not. 2. There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is “find me all the news items that contain a token with the same linguistic stem as ‘mouse'” (finds “mouse” and “mice”). Another example based on token proximity is “find me all the news items that contain the tokens ‘XML’ and ‘Query’ allowing up to 3 intervening tokens”. 3. Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than$100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for “mouse”, there is only 1 expected result set. When you do a full-text search for all the news items that contain the token “mouse”, you probably expect to find news items containing the token “mice”, and possibly “rodents”, or possibly “computers”. Not all results are equal. Some results are more “mousey” than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.

Note:

As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text 3.0.

Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of tokens, units of punctuation, and spaces.

Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.

Tokenization, including the definition of the term “tokens”, SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interpret the results of tokenization. Tokenization operates on the string value of an item; for element nodes this does not include the content of attribute nodes, but for attribute nodes it does. Tokenization is defined more formally in 4.1 Tokenization.

[Definition: A token is a non-empty sequence of characters returned by a tokenizer as a basic unit to be searched. Beyond that, tokens are implementation-defined.] [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]

Not a fast read but a welcome one!

XQuery and XPath increase the value of all XML-encoded documents, at least down to the level of their markup. Beyond nodes, you are on your own.

XQuery and XPath Full Text 3.0 extend XQuery and XPath beyond existing markup in documents. Content that was too expensive or simply not of enough interest to encode, can still be reached in a robust and reliable way.

If you can “see” it with your computer, you can annotate it.

You might have to possess a copy of the copyrighted content, but still, it isn’t a closed box that resists annotation. Enabling you to sell the annotation as a value-add to the copyrighted content.

XQuery and XPath Full Text 3.0 says token and phrase are implementation defined.

Imagine the user (name) commented version of X movie, which is a driver file that has XQuery links into DVD playing on your computer (or rather to the data stream).

I rather like that idea.

PS: Check with a lawyer before you commercialize that annotation idea. I am not familiar with all EULAs and national laws.

### What’s the significance of 0.05 significance?

Tuesday, November 24th, 2015

From the post:

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I’ve seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I’ve seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This “standard” 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

One of the best history/explanations of 0.05 significance I have ever read. Highly recommended!

In part because in the retelling of this story Carl includes references that will allow you to trace the story in even greater detail.

What is dogma today, 0.05 significance, started as a convention among scientists, without theory, without empirical proof, without any of gate keepers associated with scientific publishing of today.

Over time 0.05 significance has proved its utility. The question for you is what other dogmas of today rely on the chance practices of yesteryear?

I first saw this in a tweet by Kirk Borne.

### Cancel Thanksgiving and Christmas Travel Plans (U.S. State Department)

Monday, November 23rd, 2015

The State Department has issued a “Worldwide Travel Alert” from November 23, 2015 until February 24, 2016.

This is not a joke, or at least the State Department doesn’t consider it to be a joke.

The State Department alerts U.S. citizens to possible risks of travel due to increased terrorist threats. Current information suggests that ISIL (aka Da’esh), al-Qa’ida, Boko Haram, and other terrorist groups continue to plan terrorist attacks in multiple regions. These attacks may employ a wide variety of tactics, using conventional and non-conventional weapons and targeting both official and private interests. This Travel Alert expires on February 24, 2016.

Authorities believe the likelihood of terror attacks will continue as members of ISIL/Da’esh return from Syria and Iraq. Additionally, there is a continuing threat from unaffiliated persons planning attacks inspired by major terrorist organizations but conducted on an individual basis. Extremists have targeted large sporting events, theatres, open markets, and aviation services. In the past year, there have been multiple attacks in France, Nigeria, Denmark, Turkey, and Mali. ISIL/Da’esh has claimed responsibility for the bombing of a Russian airliner in Egypt.

U.S. citizens should exercise vigilance when in public places or using transportation. Be aware of immediate surroundings and avoid large crowds or crowed places. Exercise particular caution during the holiday season and at holiday festivals or events. U.S. citizens should monitor media and local information sources and factor updated information into personal travel plans and activities. Persons with specific safety concerns should contact local law enforcement authorities who are responsible for the safety and security of all visitors to their host country.

The State Department left out a more likely danger, that you are crushed by a coin-operated beverage machine you are trying to cheat out of a drink or treat.

I know that agency budgets are under assault but asking U.S. citizens to shelter in place, that’s what don’t travel means, for the next three (3) months is a bit extreme.

Next thing you know, the Department of Homeland Security will start storing grenades and ammunition at every tenth house just in case they are cut off by terrorists from their supply base.

Every agency will try to outdo the others in whipping up fear of terrorists.

Let’s tell the State Department thanks but no thanks for the injection of paranoia into our holiday season.

In fact, the State Department makes it easy for you to send that message:

Call 1-888-407-4747 toll-free in the United States and Canada or 1-202-501-4444 from other countries from 8:00 a.m. to 8:00 p.m. Eastern Standard Time, Monday through Friday (except U.S. federal holidays).

That’s 13:00 UTC until 01:00 UTC the next day, in case you are overseas.

Relevant U.S. federal holidays are: Thanksgiving Day (26 November 2015), Christmas Day (25th December 2015), New Year’s Day (1 January 2016), Martin Luther King, Jr. Day (18 January 2016), George Washington’s Birthday (15 February 2016).

Enjoy your holidays despite terrorists and their cheer leaders in the State Department and press. Imagine how little news coverage terrorists would get if left to their own devices.

### Televisions, Furniture and Appliances (TFA) versus Terrorists (TFA is winning)

Monday, November 23rd, 2015

Holiday concerns in the United States should be focused on unstable televisions, furniture and appliances (TFA) rather than terrorists.

The U.S. Consumer Product Safety Commission reports:

…injuries and fatalities associated with television, furniture, and appliance product instability or tip-over.

Of the estimated annual average of 38,000 emergency department-treated injuries (2011–2013) and the 430 reported fatalities occurring between 2000 and 2013 associated with tip-overs, staff noted the following:

Breakdown by victim (image to replicate the formatting):

While all levels of government spend $billions on hunting terrorists in the United States and coming up dry, You Are Safer Than You Think, we know that televisions, furniture and appliances are injuring and killing far more U.S. citizens than terrorists. Spend a few extra dollars this holiday season and insure the stability of televisions, furniture and appliances, new or old in your home. That expenditure will increase your safety measurably more than the$billion billions being spent by the government on terrorists they can’t seem to find.

Until after the fact of a terrorist attack that is.

### Christmas Comes Early for Law Enforcement

Sunday, November 22nd, 2015

From the post:

According to a document prepared by the New York District Attorney’s Office, older versions of Android can easily be remotely reset by Google if compelled by a court order, allowing investigators to easily view the contents of a device.

Ben reports that:

74.1 percent of devices are still using a version of Android that can be remotely accessed at any time.

I want a box of burner phones for Christmas, along with an electromagnet powerful enough to down the motor in a refrigerator.

The trick will be to be standing next to the electromagnet when unfriendly visitors arrive. 😉

### Why you should understand (a little) about TCP

Sunday, November 22nd, 2015

Why you should understand (a little) about TCP by Julia Evans.

From the post:

This isn’t about understanding everything about TCP or reading through TCP/IP Illustrated. It’s about how a little bit of TCP knowledge is essential. Here’s why.

When I was at the Recurse Center, I wrote a TCP stack in Python (and wrote about what happens if you write a TCP stack in Python). This was a fun learning experience, and I thought that was all.

A year later, at work, someone mentioned on Slack “hey I’m publishing messages to NSQ and it’s taking 40ms each time”. I’d already been thinking about this problem on and off for a week, and hadn’t gotten anywhere.

A little background: NSQ is a queue that you send to messages to. The way you publish a message is to make an HTTP request on localhost. It really should not take 40 milliseconds to send a HTTP request to localhost. Something was terribly wrong. The NSQ daemon wasn’t under high CPU load, it wasn’t using a lot of memory, it didn’t seem to be a garbage collection pause. Help.

Then I remembered an article I’d read a week before called In search of performance – how we shaved 200ms off every POST request. In that article, they talk about why every one of target=”_blank” their POST requests were taking 200 extra milliseconds. That’s.. weird. Here’s the key paragraph from the post

Julia’s posts are generally useful and entertaining to read and this one is no exception.

As Julia demonstrates in this post, TCP isn’t as low-level as you might think. 😉

The other lesson to draw here is the greater your general knowledge of how things work, the more likely you can fix (or cause) problems with a minimal amount of effort.

Learn a little TCP with Julia and keep bookmarked deeper resources should the need arise.

### Deep Learning and Parsing

Sunday, November 22nd, 2015

Jason Baldridge tweets that the work of James Henderson (Google Scholar) should get more cites for deep learning and parsing.

Jason points to the following two works (early 1990’s) in particular:

Description Based Parsing in a Connectionist Network by James B. Henderson.

Abstract:

Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. This dissertation investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures,
while keeping their advantages. However, the architecture does have some limitations, which impose computational constraints on parsing in this architecture. This dissertation argues that, despite these constraints, the architecture is computationally adequate for syntactic parsing, and that these constraints make signi cant linguistic predictions. To make these arguments, the nature of the architecture’s limitations are fi rst characterized as a set of constraints on symbolic
computation. This allows the investigation of the feasibility and implications of parsing in the architecture to be investigated at the same level of abstraction as virtually all other investigations of syntactic parsing. Then a specifi c parsing model is developed and implemented in the architecture. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints. Finally, this parsing model is tested on those phenomena which are of particular concern given the constraints, and on an approximately unbiased sample of sentences to check for unforeseen difficulties. The results show that this connectionist architecture is powerful enough for syntactic parsing. They also show that some linguistic phenomena are predicted by the limitations of this architecture. In particular, explanations are given for many cases of unacceptable center embedding, and for several signifi cant constraints on long distance dependencies. These results give evidence for the cognitive signi ficance
of this computational architecture and parsing model. This work also shows how the advantages of both connectionist and symbolic techniques can be uni ed in natural language processing applications. By analyzing how low level biological and computational considerations influence higher level processing, this work has furthered our understanding of the nature of language and how it can be efficiently and e ffectively processed.

Connectionist Syntactic Parsing Using Temporal Variable Binding by James Henderson.

Abstract:

Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. The work discussed here investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures, while keeping their advantages. However, the architecture does have some limitations, which impose constraints on parsing in this architecture. Despite these constraints, the architecture is computationally adequate for syntactic parsing. In addition, the constraints make some signifi cant linguistic predictions. These arguments are made using a specifi c parsing model. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints imposed by the architecture.

Enjoy!