Pandering for Complaints

August 21st, 2015

Yesterday I mentioned that the UK has joined the ranks of censors of Google and is attempting to fine tune search results for a given name. Censorship of Google Spreads to the UK.

Today, Simon Rice of the Information Commissioner’s Office, posted: Personal data in leaked datasets is still personal data.

Simon starts off by mentioning the Ashley Madison data dumps and then says:

Anyone in the UK who might download, collect or otherwise process the leaked data needs to be aware they could be taking on data protection responsibilities defined in the UK’s Data Protection Act.

Similarly, seeking to identify an individual from a leaked dataset will be an intrusion into their private life and could also lead to a breach of the DPA.

Individuals will have a range of personal reasons for having created an account with particular online services (or even had an account created without their knowledge) and any publication of further personal data without their consent can cause them significant damage or distress.

It’s worth noting too that any individual or organisation seeking to rely on the journalism exemption should be reminded that this is not a blanket exemption to the DPA and be encouraged to read our detailed guide on how the DPA applies to journalism.

Talk about chilling free speech. You shouldn’t even look to see if the data is genuine. Just don’t look!

You could let your “betters” in the professional press tell you what they want you to know, but I suspect you are brighter than that. What are the press motives behind what you see and what you don’t?

To make matters even worse, Simon closes with a solicitation for complaints:

If you find your personal data being published online then you have a right to go to that publisher and request that the information is removed. This applies equally to information being shared on social media. If the publisher is based in the UK and fails to remove your information you can complain to the ICO.

I don’t have a lot of extra webspace but if you get a complaint from the ICO, I’m willing to host whatever data I can. It won’t be much so don’t get too excited about free space.

We all need to step up and offer storage space for content censored by the UK and others.

Disclosing Government Contracts

August 21st, 2015

The More the Merrier? How much information on government contracts should be published and who will use it by Gavin Hayman.

From the post:

A huge bunch of flowers to Rick Messick for his excellent post asking two key questions about open contracting. And some luxury cars, expensive seafood and a vat or two of cognac.

Our lavish offerings all come from Slovakia, where in 2013 the Government Public Procurement Office launched a new portal publishing all its government contracts. All these items were part of the excessive government contracting uncovered by journalists, civil society and activists. In the case of the flowers, teachers investigating spending at the Department of Education uncovered florists’ bills for thousands of euros. Spending on all of these has subsequently declined: a small victory for fiscal probity.

The flowers, cars, and cognac help to answer the first of two important questions that Rick posed: Will anyone look at contracting information? In the case of Slovakia, it is clear that lowering the barriers to access information did stimulate some form of response and oversight.

The second question was equally important: “How much contracting information should be disclosed?”, especially in commercially sensitive circumstances.

These are two of key questions that we have been grappling with in our strategy at the Open Contracting Partnership. We thought that we would share our latest thinking below, in a post that is a bit longer than usual. So grab a cup of tea and have a read. We’ll be definitely looking forward to your continued thoughts on these issues.

Not a short read so do grab some coffee (outside of Europe) and settle in for a good read.

Disclosure: I’m financially interested in government disclosure in general and contracts in particular. With openness there comes more effort to conceal semantics and increase the need for topic maps to pierce the darkness.

I don’t think openness reduces the amount of fraud and misconduct in government, it only gives an alignment between citizens and the career interests of a prosecutor a sporting chance to catch someone out.

Disclosure should be as open as possible and what isn’t disclosed voluntarily, well, one hopes for brave souls who will leak the remainder.

Support disclosure of government contracts and leakers of the same.

If you need help “connecting the dots,” consider topic maps.

TSA Master Luggage Keys

August 21st, 2015

The paragons of security who keep you safe (sic) in the air, the TSA, helpfully offered a photo op for some of their masterkeys to your luggage.

tsa-master-keys

Frantic discussion of images of these keys should be tempered with the knowledge that an ordinary screwdriver:

Image converted using ifftoany

will open more suitcase locks than the entire set of TSA masterkeys.

Oh, but what if they want to keep it a secret? You mean like the people with the keys who put flyers in your bag when they open it? Yes?

Anyone else opening your luggage is looking for the quickest way possible and relocking isn’t a value to them.

Still, an example of the highly trained and security aware public servants who are making air travel safe, while never catching a single terrorist.

Must be like driving all the snakes out of Ireland.

Solving the Stable Marriage problem…

August 21st, 2015

Solving the Stable Marriage problem with Erlang by Yan Cui.

With all the Ashley Madison hack publicity, I didn’t know there was a “stable marriage problem.” ;-)

Turns out is it like the Eight-Queens problem. Is is a “problem” but it isn’t one you are likely to encounter outside of a CS textbook.

Yan sets up the problem with this quote from Wikipedia:

The stable marriage problem is commonly stated as:

Given n men and n women, where each person has ranked all members of the opposite sex with a unique number between 1 and n in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. If there are no such people, all the marriages are “stable”. (It is assumed that the participants are binary gendered and that marriages are not same-sex).

The wording is a bit awkward. I would rephrase it to say that for no pair, both partners prefer some other partner. One of the partner’s can prefer someone else, but if the someone else does not share that preference, both marriages are “stable.”

The Wikipedia article does observe:

While the solution is stable, it is not necessarily optimal from all individuals’ points of view.

Yan sets up the problem and then walks through the required code.

Enjoy!

Conversations With Datomic

August 21st, 2015

Conversations With Datomic by Carin Meier. (See Conversations With Datomic Part 2 as well.)

Perhaps not “new” but certainly an uncommon approach to introducing users to a database system.

Carin has a “conversation” with Datomic that starts from the very beginning of creating a database and goes forward.

Rewarding and a fun read!

Enjoy!

SEC To Accept Inline XBRL?

August 20th, 2015

SEC commits to human- and machine-readable format that could fix agency’s open data problems by Justin Duncan.

The gist of the story is that the SEC has, unfortunately, been accepting both plain text and XBRL filings since 2009. Since XBRL results in open data, you can imagine the effort put into that data by filers.

At the urging of Congress, the SEC has stated in writing:

SEC staff is currently developing recommendations for the Commission’s consideration to allow filers to submit XBRL data inline as part of their core filings, rather than filing XBRL data in an exhibit.

Before you celebrate too much, note that the SEC didn’t offer any dates for acceptance of inline XBRL filings.

Still, better an empty promise than no promise at all.

If you are interested in making sure that inline XBRL data does result in meaningful disclosure, if and when it is used for SEC filings, consult the following:

Inline XBRL 1.1 (standard)

An Integrator’s Guide to Inline XBRL

Plus you may want to consider how you would use XQuery to harvest and combine XBRL data with other data sources. It’s not too early be thinking about “enhanced” results.

Censorship of Google Spreads to the UK

August 20th, 2015

Google ordered to remove links to ‘right to be forgotten’ removal stories by Samuel Gibbs.

From the post:

Google has been ordered by the Information Commissioner’s office to remove nine links to current news stories about older reports which themselves were removed from search results under the ‘right to be forgotten’ ruling.

The search engine had previously removed links relating to a 10 year-old criminal offence by an individual after requests made under the right to be forgotten ruling. Removal of those links from Google’s search results for the claimant’s name spurred new news posts detailing the removals, which were then indexed by Google’s search engine.

Google refused to remove links to these later news posts, which included details of the original criminal offence, despite them forming part of search results for the claimant’s name, arguing that they are an essential part of a recent news story and in the public interest.

Google now has 35 days from the 18 August to remove the links from its search results for the claimant’s name. Google has the right to appeal to the General Regulatory Chamber against the notice.

It is spectacularly sad that this wasn’t the gnomes that run the EU bureaucracy, looking for something pointless to occupy their time and the time of others.

No, this was the Information Commissioner’s Office:

The UK’s independent authority set up to uphold information rights in the public interest, promoting openness by public bodies and data privacy for individuals.

Despite this being story of public interest and conceding that the public has an interest in finding stories about delisted searches:

27. Journalistic context — The Commissioner accepts that the search results in this case relate to journalistic content. Further, the Commissioner does not dispute that journalistic content relating to decisions to delist search results may be newsworthy and in the public interest. However, that interest can be adequately and properly met without a search made on the basis of the complaint’s name providing links to articles which reveal information about the complainant’s spent conviction.

The decision goes on to give Google 35 days from the 18th of August to delist websites which appear in search results on the basis of a censored name. And of course, the links are censored as well.

Despite having failed to fix the StageFright vulnerability which impacts 950 million Android users, the Information Commissioner’s Office wants to fine-tune the search results for a given name to exclude particular websites.

In the not too distant future, the search results displayed in Google will represent a vetting by the most oppressive regimes in the world to the silliest.

Google should not appeal this decision but simply ignore it.

It is an illegal and illegitimate intrusion both on the public’s right to search by any means or manner it chooses and Google’s right to truthfully report the results of searches.

Free Packtpub Books (Legitimate Ones)

August 20th, 2015

Packtpub Books is running a “free book per day” event. Most of you know Packtpub already so I won’t belabor the quality of their publications, etc.

The important news is that for 24 hours each day in August, Packtpub Books is offering a different book for free download! The current free book offer appears to expire at the end of August, 2015.

Packtpub Books – Free Learning

This is a great way to introduce non-Packtpub customers to Packtpub publications.

Please share this news widely (and with other publishers). ;-)

950 Million Users – Scoped and Bracketed – StageFright

August 20th, 2015

Summary: StageFright patch flawed – 950 Million Android users still vulnerable.

Jordan Gruskovnjak / @jgrusko (technical details) and Aaron Portnoy / @aaronportnoy (commentary) in Stagefright: Mission Accomplished? offer these findings on the StageFright patch from Google:

  • The flaw was initially reported over 120 days ago to Google, which exceeds even their own 90-day disclosure deadline
  • The patch is 4 lines of code and was (presumably) reviewed by Google engineers prior to shipping. The public at large believes the current patch protects them when it in fact does not.
  • The flaw affects an estimated 950 million Google customers.
  • Despite our notification (and their confirmation), Google is still currently distributing the faulty patch to Android devices via OTA updates
  • There has been an inordinate amount of attention drawn to the bug–we believe we are likely not the only ones to have noticed it is flawed. Others may have malicious intentions.
  • Google has not given us any indication of a timeline for correcting the faulty patch, despite our queries.
  • The Stagefright Detector application released by Zimperium (the company behind the initial discovery) reports “Congratulations! Your device is not affected by vulnerabilities in Stagefright!” when in fact it is, leading to a false sense of security among users.

Read the full post by Jordan Gruskovnjak and Aaron Portnoy for technical details and commentary on this failure to patch StageFright.

The Gremlin Graph Traversal Language (slides)

August 19th, 2015

The Gremlin Graph Traversal Language by Marko Rodriguez.

Forty-Five (45) out of fifty (50) slides have working Gremlin code!

Ninety percent (90%) of the slides have code you can enter!

It isn’t as complete as The Gremlin Graph Traversal Machine and Language, but on the other hand, it is a hell of a lot easier to follow along.

Enjoy!

“True” Size?

August 19th, 2015

This interactive map shows how ‘wrong’ other maps are by Adam Taylor.

From the post:

Given how popular the Mercator projection is, it’s wise to question how it makes us view the world. Many have noted, for example, how the distortion around the poles makes Africa look smaller than Greenland, when in reality Africa is about 14.5 times as big. In 2010, graphic artist Kai Krause made a map to illustrate just how big the African continent is. He found that he was able to fit the United States, India and much of Europe inside the outline of the African continent.

Inspired by Krause’s map, James Talmage, and Damon Maneice, two computer developers based out of Detroit, created an interactive graphic that really puts the distortion caused by the Mercator map into perspective. The tool, dubbed “The True Size” allows you to type in the name of any country and move the outline around to see how the scale of the country gets distorted the closer it gets to the poles.

Of course, one thing the map shows well is the sheer size of Africa. Here it is compared with the United States, China and India.

africa-clip

This is a great resource for anyone who wants to learn more about the physical size of countries, but it is also an illustration that no map is “wrong,” some display the information you seek better than others.

For another interesting take on world maps, see WorldMapper where you will find gems like:

GDP Wealth

gdp-wealth

Absolute Poverty

poverty-less-than-2-day

Or you can rank countries by their contributions to science:

Science Research

science-research

None of these maps is more “true” than the others.

Which one you choose depends on the cause you want to advance.

Sketchy

August 19th, 2015

Sketchy

sketchy

From the announcement of Sketchy:

One of the features we wanted to see in Scumblr was the ability to collect screenshots and text content from potentially malicious sites – this allows security analysts to preview Scumblr results without the risk of visiting the site directly. We wanted this collection system to be isolated from Scumblr and also resilient to sites that may perform malicious actions. We also decided it would be nice to build an API that we could use in other applications outside of Scumblr.

Although a variety of tools and frameworks exist for taking screenshots, we discovered a number of edge cases that made taking reliable screenshots difficult – capturing screenshots from AJAX-heavy sites, cut-off images with virtual X drivers, and SSL and compression issues in the PhantomJS driver for Selenium, to name a few. In order to solve these challenges, we decided to leverage the best possible tools and create an API framework that would allow for reliable, scalable, and easy to use screenshot and text scraping capabilities. Sketchy to the rescue!

Sketchy wiki

Docker for Sketchy

An interesting companion to Scumblr, especially since an action from Scumblr may visit an “unsafe” site in your absence.

Ways to monitor malicious sites once they are discovered? Suggestions?

Scumblr

August 19th, 2015

Scumblr

scumblr

If like me, you missed the Ashley Madison dump to an .onion site on Tuesday, 18 August 2015, one that is now thought to be authentic, you need a tool like Scumblr!

From the Netflix description of Scumblr:

What is Scumblr?

Scumblr is a web application that allows performing periodic searches and storing / taking actions on the identified results. Scumblr uses the Workflowable gem to allow setting up flexible workflows for different types of results.

How do I use Scumblr?

Scumblr is a web application based on Ruby on Rails. In order to get started, you’ll need to setup / deploy a Scumblr environment and configure it to search for things you care about. You’ll optionally want to setup and configure workflows so that you can track the status of identified results through your triage process.

What can Scumblr look for?

Just about anything! Scumblr searches utilize plugins called Search Providers. Each Search Provider knows how to perform a search via a certain site or API (Google, Bing, eBay, Pastebin, Twitter, etc.). Searches can be configured from within Scumblr based on the options available by the Search Provider. What are some things you might want to look for? How about:

  • Compromised credentials
  • Vulnerability / hacking discussion
  • Attack discussion
  • Security relevant social media discussion

These are just a few examples of things that you may want to keep an eye on!

Scumblr found stuff, now what?

Up to you! You can create simple or complex workflows to be used along with your results. This can be as simple as marking results as “Reviewed” once they’ve been looked at, or much more complex involving multiple steps with automated actions occurring during the process.

Sounds great! How do I get started?

Take a look at the wiki for detailed instructions on setup, configuration, and use!

This looks particularly useful if you are watching for developments and/or discussions in software forums, blogs, etc.

dgol – Distributed Game Of Life

August 19th, 2015

dgol – Distributed Game Of Life by Mirko Bonadei and Gabriele Lana.

From the webpage:

This project is an implementation of the Game of life done by Gabriele Lana and me during the last months.

We took it as a “toy project” to explore all the nontrivial decisions that need to be made when you have to program a distributed system (eg: choose the right supervision strategy, how to make sub-systems communicate each other, how to store data to make it fault tolerant, ecc…).

It is inspired by the Torben Hoffman’s version and on the talk Thinking like an Erlanger.

The project is still under development, at the moment we are doing a huge refactoring of the codebase because we are reorganizing the supervision strategy.

Don’t just nod at the Thinking like an Erlanger link. Part of its description reads:

If you find Erlang is a bit tough, or if testing gives you headaches, this webinar is for you. We will spend most of this intensive session looking at how to design systems with asynchronous message passing between processes that do not share any memory.

Definitely watch the video and progress in this project!

Non-News: Algorithms Are Biased

August 19th, 2015

Programming and prejudice

From the post:

Software may appear to operate without bias because it strictly uses computer code to reach conclusions. That’s why many companies use algorithms to help weed out job applicants when hiring for a new position.

But a team of computer scientists from the University of Utah, University of Arizona and Haverford College in Pennsylvania have discovered a way to find out if an algorithm used for hiring decisions, loan approvals and comparably weighty tasks could be biased like a human being.

The researchers, led by Suresh Venkatasubramanian, an associate professor in the University of Utah’s School of Computing, have discovered a technique to determine if such software programs discriminate unintentionally and violate the legal standards for fair access to employment, housing and other opportunities. The team also has determined a method to fix these potentially troubled algorithms.

Venkatasubramanian presented his findings Aug. 12 at the 21st Association for Computing Machinery’s Conference on Knowledge Discovery and Data Mining in Sydney, Australia.

“There’s a growing industry around doing resume filtering and resume scanning to look for job applicants, so there is definitely interest in this,” says Venkatasubramanian. “If there are structural aspects of the testing process that would discriminate against one community just because of the nature of that community, that is unfair.”

It’s a puff piece and therefore misses that all algorithms are biased, but some algorithms are biased in ways not permitted under current law.

The paper, which this piece avoids citing for some reason, Certifying and removing disparate impact by Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian

The abstract for the paper does a much better job of setting the context for this research:

What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process.

When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses.

We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.

If you are a bank, you want a loan algorithm to be biased against people with a poor history of paying their debts. The distinction being that is a legitimate basis for discrimination among loan applicants.

The lesson here is that all algorithms are biased, the question is whether the bias is in your favor or not.

Suggestion: Only bet when using your own dice (algorithm).

4,000 Deep Web Links [but no AshleyMadison]

August 19th, 2015

4,000 Deep Web Links by Nikoloz Kokhreidze.

This listing was generated in early June, 2015, so it doesn’t include the most recent AshleyMadison data dump.

Apparently an authentic (according to some commentators) data dump from AshleyMadison was posted yesterday but I haven’t been able to find a report with an address for the dump.

One commentator, with a major technical site, sniffed that the dump was news but:

I’m not really interested in actively outing anyone’s private information

What a twit!

Why should I take even the New York Times’ word for the contents of the dump when the data should be searchable by anyone?

It maybe that N number of email addresses end in .mil but how many end in nyimes.com? Unlikely to see that reported in the New York Times. Yes?

Summarizing data and attempting to persuade me your summary is both useful and accurate is great. Saves me the time and trouble of wrangling the data.

However, the raw data that you have summarized should be available for verification by others. That applies to data held by Wikileaks, the New York Times or any government.

The history of government and corporate leaks has one overriding lesson: Deception is universal.

Suggested motto for data geeks:

In God We Trust, All Others Must Provide Raw Data.

PS: A good place to start touring the Deep/Dark Web: http://7g5bqm7htspqauum.onion/ – The Hidden Wiki (requires Tor for access).

The Gremlin Graph Traversal Machine and Language

August 18th, 2015

The Gremlin Graph Traversal Machine and Language by Marko A. Rodriguez.

Abstract:

Gremlin is a graph traversal machine and language designed, developed, and distributed by the Apache TinkerPop project. Gremlin, as a graph traversal machine, is composed of three interacting components: a graph G, a traversal \Psi, and a set of traversers T. The traversers move about the graph according to the instructions specified in the traversal, where the result of the computation is the ultimate locations of all halted traversers. A Gremlin machine can be executed over any supporting graph computing system such as an OLTP graph database and/or an OLAP graph processor. Gremlin, as a graph traversal language, is a functional language implemented in the user’s native programming language and is used to define the $\Psi$ of a Gremlin machine. This article provides a mathematical description of Gremlin and details its automaton and functional properties. These properties enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, hybrid depth-and breadth-first evaluation, as well as the existence of a Universal Gremlin Machine and its respective entailments.

Why Marko wants to overload terms, like Gremlin, I don’t know. (Hi! Marko!) Despite that overloading, if you are fond of Gremlin (in any sense) and looking for a challenging read, you have found it.

This won’t be a quick read. ;-)

Why anyone would desire imperative querying is another unknown.

Static results are an edge case of the dynamic systems they purport to represent. (An insight suggested to me by Sam Hunting.)

Think about it for a moment. There are no non-dynamic systems, only non-dynamic representations of dynamic systems. Which makes non-dynamic representations false, albeit sometimes useful, but also an edge case.

Sorry, didn’t mean to get distracted.

Deeply recommend this “Gremlin” overloaded paper and that you check into the recently released TinkerPop3 release!

Graphs, Source Code Auditing, Vulnerabilities

August 18th, 2015

While I was skimming the Praeorian website, I ran across this blog entry: Why You Should Add Joern to Your Source Code Audit Toolkit by Kelby Ludwig.

From the post:

What is Joern?

Joern is a static analysis tool for C / C++ code. It builds a graph that models syntax. The graphs are built out using Joern’s fuzzy parser. The fuzzy parser allows for Joern to parse code that is not necessarily in a working state (i.e., does not have to compile). Joern builds this graph with multiple useful properties that allow users to define meaningful traversals. These traversals can be used to identify potentially vulnerable code with a low false-positive rate.

Joern is easy to set up and import code with. The graph traversals, which are written using a graph database query language called Gremlin, are simple to write and easy to understand.

Why use Joern?

Joern builds a Code Property Graph out of the imported source code. Code Property Graphs combine the properties of Abstract Syntax Trees, Control Flow Graphs, and Program Dependence Graphs. By leveraging various properties from each of these three source code representations, Code Property Graphs can model many different types of vulnerabilities. Code Property Graphs are explained in much greater detail in the whitepaper on the subject. Example queries can be found in a presentation on Joern’s capabilities. While the presentation does an excellent job of demonstrating the impact of running Joern on the source code for the Linux kernel (running two queries led to seven 0-days out of the 11 total results!), we will be running a slightly more general query on a simple code snippet. By following the query outlined in the presentation, we can write similar queries for other potentially dangerous methods.

There are graphs, Gremlin, discovery of zero-day vulnerabilities, this is a post that pushes so many buttons!

Consider it to be a “lite” introduction to Joern, which I have mentioned before.

1,002 Things To Do With Your Drone

August 18th, 2015

You saw Code.org Facebook post last January referring to Drones Will Be Everywhere Watching, Listening, and…Planting Millions of Trees? as “1.001 uses for drones.”

Here is use #1,002:

How Drones Can Find and Hack Internet-of-Things Network Things From the Sky by Mohit Kumar.

From the post:

Security researchers have developed a Flying Drone with a custom-made tracking tool capable of sniffing out data from the devices connected to the Internet – better known as the Internet-of-things.

Under its Internet of Things Map Project, a team of security researchers at the Texas-based firm Praetorian wanted to create a searchable database that will be the Shodan search engine for SCADA devices.

Located More Than 1600+ Devices Using Drone

To make it possible, the researchers devised a drone with their custom built connected-device tracking appliance and flew it over Austin, Texas in real time.

During an 18 minute flight, the drone found nearly 1,600 Internet-connected devices, of which 453 IoT devices are made by Sony and 110 by Philips. You can see the full Austin map here.

The map of Austin is way cool! What IoT map do you want to create?

Which reminds me, how do you defend against the intrusion of a drone? According to the Wall Street Journal, your options are limited and expensive.

I didn’t see the IoT scanning drone at Praetorian in either finished or kit form.

But I expect IoT scanning drones on the virtual shelves of online retailers long before the holiday season of 2015.

You will be able to spot popular holiday shopping venues by the clouds of drones sniffing for vulnerable automobiles.

PS: Give the military a couple of years to get into the IoT. Flying an IoT sniffing drone over the Pentagon should be a real hoot.

Top 5 Security Practices and the Art of Summarization

August 18th, 2015

Iulia Ion, Rob Reeder, and Sunny Consolvo, craft the best summary of a thirty (30) page research paper I have seen in: New research: Comparing how security experts and non-experts stay safe online.

The paper reported the results of a survey of expert and non-experts to discover their security practices online. The gist of the paper was summarized as follows:

Experts’ and non-experts’ top 5 security practices

Here are experts’ and non-experts’ top security practices, according to our study. We asked each participant to list 3 practices:

Beutler_Google_Security-practices-v6

The full paper is quite good and worth your time to read.

If the behavior of experts influences your software security policies, consider the difference in software updates versus antivirus software:

35% of experts and only 2% of non-experts said that installing software updates was one of their top security practices. Experts recognize the benefits of updates—“Patch, patch, patch,” said one expert—while non-experts not only aren’t clear on them, but are concerned about the potential risks of software updates. A non-expert told us: “I don’t know if updating software is always safe. What [if] you download malicious software?” and “Automatic software updates are not safe in my opinion, since it can be abused to update malicious content.”

Meanwhile, 42% of non-experts vs. only 7% of experts said that running antivirus software was one of the top three three things they do to stay safe online. Experts acknowledged the benefits of antivirus software, but expressed concern that it might give users a false sense of security since it’s not a bulletproof solution.

I would summarize that difference as choosing between repairing broken software and adding more broken software to your IT stack.

Which one is a higher priority for you?

101 webscraping and research tasks for the data journalist

August 17th, 2015

101 webscraping and research tasks for the data journalist by Dan Nguyen.

From the webpage:

This repository contains 101 Web data-collection tasks in Python 3 that I assigned to my Computational Journalism class in Spring 2015 to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.

The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the Computational Journalism class wasn’t intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized…(especially since I’m new to Python myself!)

Too good of an idea to not steal! Practical and immediate results, introduction to coding, etc.

What 101 tasks do you want to document and with what tool?

PS: The Computational Journalism class site has a nice set of online references for Python.

Programming for Humanists at TAMU [and Business Types]

August 17th, 2015

Programming for Humanists at TAMU

From the webpage:

[What is DH?] Digital Humanities studies the intersection and mutual influence of humanities ideas and digital methods, with the goal of understanding how the use of digital technologies and approaches alters the practice and theory of humanities scholarship. In this sense it is concerned with studying the emergence of scholarly disciplines and communicative practices at a time when those are in flux, under the influence of rapid technological, institutional and cultural change. As a way of identifying digital interests and efforts within traditional humanities fields, the term “digital humanities” also identifies, in a general way, any kind of critical engagement with digital tools and methods in a humanities context. This includes the creation of digital editions and digital text or image collections, and the creation and use of digital tools for the investigation and analysis of humanities research materials. – Julia Flanders, Northeastern University (http://goo.gl/BJeXk2)

 

Programming4Humanists is a two-semester course sequence designed to introduce participants to methodologies, coding, and programming languages associated with the Digital Humanities. We focus on creation, editing, and searchability of digital archives, but also introduce students to data mining and statistical analysis. Our forte at Texas A&M University (TAMU) is Optical Character Recognition of early modern texts, a skill we learned in completing the Early Modern OCR Project, or eMOP. Another strength that the Initiative for Digital Humanities, Media, and Culture (http://idhmc.tamu.edu) at TAMU brings to this set of courses is the Texas A&M University Press book series called “Programming for Humanists.” We use draft and final versions of these books, as well as many additional resources available on companion web pages, for participants in the workshop. The books in this series are of course upon publication available to anyone, along with the companion sites, whether the person has participated in the workshop or not. However, joining the Programming4Humanists course enables participants to communicate with the authors of these books for the sake of asking questions and indeed, through their questioning, helping us to improve the books and web materials. Our goal is to help people learn Digital Humanities methods and techniques.

Participants

Those who should attend include faculty, staff, librarians, undergraduates, and graduate students, interested in making archival and cultural materials available to a wide audience while encoding them digitally according to best practices, standards that will allow them to submit their digital editions for peer review by organizations such as the MLA Committee for Scholarly Edition and NINES / 18thConnect. Librarians will be especially interested in learning our OCR procedures as a means for digitizing large archives. Additionally, scholars, students, and librarians will receive an introduction to text mining and XQuery, the latter used for analyzing semantically rich data sets. This course gives a good overview of what textual and archival scholars are accomplishing in the field of Digital Humanities, even though the course is primarily concerned with teaching skills to participants. TAMU graduate and undergraduate students may take this course for 2 credit hours, see Schedule of Classes for Fall 2015: LBAR 489 or 689 Digital Scholarship and Publication.

Prerequisites

No prior knowledge is required but some familiarity with TEI/XML, HTML, and CSS will be helpful (See previous Programming 4 Humanists course syllabus). Certificate registrants will receive certificates confirming that they have a working knowledge of Drupal, XSLT, XQuery, and iPython Notebooks. Registration for those getting a certificate includes continued access to all class videos during the course period and an oXygen license. Non-certificate registrants will have access to the class videos for one week.

Everything that Julia says is true and this course will be very valuable for traditional humanists.

It will also be useful for business types who aren’t quants or CS majors/minors. The same “friendly” learning curve is suitable to both audiences.

You won’t be a “power user” at the end of this course but you will sense when CS folks are blowing smoke. It happens.

State of the Haskell ecosystem – August 2015

August 17th, 2015

State of the Haskell ecosystem – August 2015 by Gabriel Gonzalez.

From the webpage:

In this post I will describe the current state of the Haskell ecosystem to the best of my knowledge and its suitability for various programming domains and tasks. The purpose of this post is to discuss both the good and the bad by advertising where Haskell shines while highlighting where I believe there is room for improvement.

This post is grouped into two sections: the first section covers Haskell’s suitability for particular programming application domains (i.e. servers, games, or data science) and the second section covers Haskell’s suitability for common general-purpose programming needs (such as testing, IDEs, or concurrency).

The topics are roughly sorted from greatest strengths to greatest weaknesses. Each programming area will also be summarized by a single rating of either:

  • Best in class: the best experience in any language
  • Mature: suitable for most programmers
  • Immature: only acceptable for early-adopters
  • Bad: pretty unusable

The more positive the rating the more I will support the rating with success stories in the wild. The more negative the rating the more I will offer constructive advice for how to improve things.

There is nothing that provokes discussion more than a listing of items with quality rankings!

Enjoy!

AT&T’s Betrayal of Its Customers

August 16th, 2015

NSA Spying Relies on AT&T’s ‘Extreme Willingness to Help’ by by Julia Angwin and Jeff Larson, ProPublica; Charlie Savage and James Risen, The New York Times; and Henrik Moltke and Laura Poitras, special to ProPublica.

From the post:

The National Security Agency’s ability to spy on vast quantities of Internet traffic passing through the United States has relied on its extraordinary, decades-long partnership with a single company: the telecom giant AT&T.

While it has been long known that American telecommunications companies worked closely with the spy agency, newly disclosed NSA documents show that the relationship with AT&T has been considered unique and especially productive. One document described it as “highly collaborative,” while another lauded the company’s “extreme willingness to help.”

Timelines, source documents, analysis, sketch a damning outline of AT&T’s betrayal of its customers for more than a decade.

If you are an AT&T customer, this article is a must read. If you know someone who is an AT&T customer, please forward this article to their attention. Post it to Facebook, Twitter, etc.

You may not be able to force changes in government spy programs but as customers, collectively we can impact the bottom line of their co-conspirators.

I saw a cartoon that is a fair take on government rhetoric in this area today:

shadow-jihadist

The word to pass on to vendors is: You can be my friend or a friend of the government. Choose carefully.

New Organizations to Support Astroinformatics and Astrostatistics

August 15th, 2015

New Organizations to Support Astroinformatics and Astrostatistics by Eric D. Feigelson, Željko Ivezić, Joseph Hilbe, Kirk D. Borne.

Abstract:

In the past two years, the environment within which astronomers conduct their data analysis and management has rapidly changed. Working Groups associated with international societies and Big Data projects have emerged to support and stimulate the new fields of astroinformatics and astrostatistics. Sponsoring societies include the Intenational Statistical Institute, International Astronomical Union, American Astronomical Society, and Large Synoptic Survey Telescope project. They enthusiastically support cross-disciplinary activities where the advanced capabilities of computer science, statistics and related fields of applied mathematics are applied to advance research on planets, stars, galaxies and the Universe. The ADASS community is encouraged to join these organizations and to explore and engage in their public communication Web site, the Astrostatistics and Astroinformatics Portal (this http URL).

I don’t suppose that any of the terminology is going to change as astroinformatics and astrostatistics develop. Do you?

Whether any of us will be clever enough to capture those changes as they happen, as opposed to large legacy data projects remains to be seen.

Do visit: Astrostatistics and Astroinformatics Portal (http://asaip.psu.edu). There are a large number of exciting resources.

Modeling and Analysis of Complex Systems

August 15th, 2015

Introduction to the Modeling and Analysis of Complex Systems by Hiroki Sayama.

From the webpage:

Introduction to the Modeling and Analysis of Complex Systems introduces students to mathematical/computational modeling and analysis developed in the emerging interdisciplinary field of Complex Systems Science. Complex systems are systems made of a large number of microscopic components interacting with each other in nontrivial ways. Many real-world systems can be understood as complex systems, where critically important information resides in the relationships between the parts and not necessarily within the parts themselves. This textbook offers an accessible yet technically-oriented introduction to the modeling and analysis of complex systems. The topics covered include: fundamentals of modeling, basics of dynamical systems, discrete-time models, continuous-time models, bifurcations, chaos, cellular automata, continuous field models, static networks, dynamic networks, and agent-based models. Most of these topics are discussed in two chapters, one focusing on computational modeling and the other on mathematical analysis. This unique approach provides a comprehensive view of related concepts and techniques, and allows readers and instructors to flexibly choose relevant materials based on their objectives and needs. Python sample codes are provided for each modeling example.

This textbook is available for purchase in both grayscale and color via Amazon.com and CreateSpace.com.

Do us all a favor and pass along the purchase options for classroom hard copies. This style of publishing will last only so long as a majority of us support it. Thanks!

From the introduction:

This is an introductory textbook about the concepts and techniques of mathematical/computational modeling and analysis developed in the emerging interdisciplinary field of complex systems science. Complex systems can be informally defined as networks of many interacting components that may arise and evolve through self-organization. Many real-world systems can be modeled and understood as complex systems, such as political organizations, human cultures/languages, national and international economies, stock markets, the Internet, social networks, the global climate, food webs, brains, physiological systems, and even gene regulatory networks within a single cell; essentially, they are everywhere. In all of these systems, a massive amount of microscopic components are interacting with each other in nontrivial ways, where important information resides in the relationships between the parts and not necessarily within the parts themselves. It is therefore imperative to model and analyze how such interactions form and operate in order to understand what will emerge at a macroscopic scale in the system.

Complex systems science has gained an increasing amount of attention from both inside and outside of academia over the last few decades. There are many excellent books already published, which can introduce you to the big ideas and key take-home messages about complex systems. In the meantime, one persistent challenge I have been having in teaching complex systems over the last several years is the apparent lack of accessible, easy-to-follow, introductory-level technical textbooks. What I mean by technical textbooks are the ones that get down to the “wet and dirty” details of how to build mathematical or
computational models of complex systems and how to simulate and analyze them. Other books that go into such levels of detail are typically written for advanced students who are already doing some kind of research in physics, mathematics, or computer science. What I needed, instead, was a technical textbook that would be more appropriate for a broader audience—college freshmen and sophomores in any science, technology, engineering, and mathematics (STEM) areas, undergraduate/graduate students in other majors, such as the social sciences, management/organizational sciences, health sciences and the humanities, and even advanced high school students looking for research projects who are interested in complex systems modeling.

Can you imagine that? A technical textbook appropriate for a broad audience?

Perish the thought!

I could name several W3C standards that could have used that editorial stance as opposed to: “…we know what we meant….”

I should consider that as a market opportunity, to translate insider jargon (and deliberately so) into more generally accessible language. Might even help with uptake of the standards.

While I think about that, enjoy this introduction to complex systems, with Python none the less.

Proofing R Functions Cheatsheet?

August 14th, 2015

Lillian Pierson has posted a cheatsheet of R functions.

If you want to do a good deed this weekend, lend a hand with the proofing.

What do you make of the ordering under each heading? I prefer alphabetical. You?

Enjoy!

Conspiring to Provide Material Support to Terrorists (buy a plane ticket)

August 14th, 2015

Brooklyn, New York, Resident Pleads Guilty to Conspiring to Provide Material Support to Terrorists

And what was the conspiracy and act in furtherance of it that resulted in this charge?


According to previous court filings, in August 2014, Juraboev posted a threat on an Uzbek-language website to kill President Obama in an act of martyrdom on behalf of ISIL. In subsequent interviews by federal agents, Juraboev stated his belief in ISIL’s terrorist agenda, including the establishment by force of an Islamic caliphate in Iraq and Syria. Juraboev stated that he wanted to travel to Syria to fight on behalf of ISIL but lacked the means to travel. He stated that if he were unable to travel, he would engage in an act of martyrdom on U.S. soil if ordered to do so by ISIL, such as killing the President or planting a bomb on Coney Island, New York. During the next several months, Juraboev and a co-conspirator discussed plans to travel to Syria to fight on behalf of ISIL, culminating in Juraboev’s purchase on Dec. 27, 2014, of a ticket to travel from John F. Kennedy International Airport in Queens, New York, to Istanbul, departing on March 29, 2015.

The longer the “war on terrorism” lasts the more absurd the government’s conduct.

You may be interested to learn that despite being interviewed by FBI agents, twice about his desire to either join the Islamic State of the Iraq and Syria or commit a terrorist act in the United States, nothing was done to discourage Juraboev from such actions. As a matter of fact, a paid informant later makes contact with Juraboev and it is only after the appearance of the confidential informant, that the case begins to look like more than random talk.

Murtaza Hussain has a great recounting of the chain of events in Confidential Information Played Key Role in FBI Foiling its Own Terror Plot.

Terrorists are so scarce in the United States that the FBI has to pay informants to encourage people to cross over the line into criminal behavior.

That should give you some idea of how much a non-problem terrorism is in the United States.

Death and Rebirth? (a moon shot in government IT)

August 14th, 2015

mooning-garden-gnome_zps08471463

No, not that sort of moon shot!

Steve O’Keeffe writes in Death and Rebirth?

The OPM breach – and the subsequent Cyber Sprint – may be just the jolt we need to euthanize our geriatric Fed IT. According to Tony Scott and GAO at this week’s FITARA Forum, we spend more than 80 percent of the $80 billion IT budget on operations and maintenance for legacy systems. You see with the Cyber Sprint we’ve been looking hard at how to secure our systems. And, the simple truth of the matter is – it’s impossible. It’s impossible to apply two-factor authentication to systems and applications built in the ’60s, ’70s, ’80s, ’90s, and naughties.

Here’s an opportunity for real leadership – to move away from advocating for incremental change, like Cloud First, Mobile First, FDCCI, HSPD-12, TIC, etc. These approaches have clearly failed us. Now’s the time for a moon shot in government IT – a digital Interstate Highway program. I’m going to call this .usa 2020 – the idea to completely replace our aging Federal IT infrastructure by 2020. You see, IT is the highway artery system that connects America today. I’m proposing that we take inspiration from the OPM disaster – and the next cyber disaster lurking oh so inevitably around the next corner – to undertake a mainstream modernization of the Federal government’s IT infrastructure and applications. It’s not about transformation, it’s about death and rebirth.

To be clear, this is not simply about moving to the cloud. It’s about really reinventing government IT. It’s not just that our Federal IT systems are decrepit and insecure – it’s about the fact they’re dysfunctional. How can it be that the top five addresses in America received 4,900 tax refunds in 2014? How did a single address in Lithuania get 699 tax refunds? How can we have 777 supply chain systems in the Federal government?

You can’t see it but when Steve asked the tax refunds question, my hand looked like a helicopter blade in the air. ;-) I know the answer to that question.

First, the IRS estimates 249 million tax returns will be filed in 2015.

Second, in addition to IT reductions, the IRS is required by law to “pay first, ask questions later,” and to deliver refunds within thirty (30) days. Tax-refund fraud to hit $21 billion, and there’s little the IRS can do.

I agree that federal IT systems could be improved, but if funds are not available for present systems, what are the odds of adequate funding being available for a complete overhaul?

BTW, the “80 percent of the $80 billion IT budget” works out to about $64 billion. If you were getting part of that $64 billion now, how hard would you resist changes that eliminated your part of that $64 billion?

Bear in mind that elimination of legacy systems also means users of those legacy systems will have to be re-trained on the replacement systems. We all know how popular forsaking legacy applications is among users.

As a practical matter, rip-n-replace proposals buy you virulent opposition from people currently enjoying $64 billion in payments every year and the staffers who use those legacy systems.

On the other hand, layering solutions, like topic maps, buy you support from people currently enjoying $64 billion in payments every year and the staffer who use those legacy systems.

Being a bright, entrepreneurial sort of person, which option are you going to choose?

Spreadsheets – 90+ million End User Programmers…

August 13th, 2015

Spreadsheets – 90+ million End User Programmers With No Comment Tracking or Version Control by Patrick Durusau and Sam Hunting.

From all available reports, Sam Hunting did a killer job presenting our paper at the Balisage conference on Wednesday of this week! Way to go Sam!

I will be posting the slides and the files shown in the presentation tomorrow.

BTW, development of the topic map for one or more Enron spreadsheets will continue.

Watch this blog for future developments!