Archive for the ‘Python’ Category


Monday, June 12th, 2017

FreeDiscovery: Open Source e-Discovery and Information Retrieval Engine

From the webpage:

FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.

In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.

Python 3.5+ required.

Homepage has command line examples, with a pointer to: for more examples.

The additional examples use a subset of the TREC 2009 legal collection. Cool!

I saw this in a tweet by Lynn Cherny today.


Python for Data Journalists: Analyzing Money in Politics

Friday, May 19th, 2017

Python for Data Journalists: Analyzing Money in Politics by Knight Center.

From the webpage:

Data journalists are the newest rock stars of the newsroom. Using computer programming and data journalism techniques, they have the power to cull through big data to find original and important stories.

Learn these techniques and some savvy computer programming to produce your own bombshell investigations in the latest massive open online course (MOOC) from the Knight Center, “Python for Data Journalists: Analyzing Money in Politics.”

Instructor Ben Welsh, editor of the Los Angeles Times Data Desk and co-founder of the California Civic Data Coalition, will show students how to turn big data into great journalism with speed and veracity. The course takes place from June 12 to July 9, 2017, so register now.

A high priority for your summer because:

  1. You will learn techniques for data analysis
  2. Learning #1 enables you to perform data analysis
  3. Learning #1 enables you to better question data analysis

I skimmed the post and did not see any coverage of obtaining concealed information.

Perhaps that will be the subject of a wholly anonymous MOOC. 😉

Do register! This looks like useful and fun!

PS: Developing a relationship with a credit bureau or bank staffer should be an early career goal. No one is capable of obtaining “extra” money and just sitting on it forever.

Web Scraping Reference: …

Thursday, April 6th, 2017

Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python by Hartley Brody.

From the post:

Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.

Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.

I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.

Brody uses Beautiful Soup, a Python library that will parse even the worst formed HTML.

I mention this so I will remember the next time I scrape Wikileaks, instead of the download then repair with Tidy, parse with Saxon/XQuery, there are easier ways to do the job!


Mining Twitter Data with Python [Trump Years Ahead]

Wednesday, December 21st, 2016

Marco Bonzanini, author of Mastering Social Media Mining with Python, has a seven part series of posts on mining Twitter with Python.

If you haven’t been mining Twitter before now, President-elect Donald Trump is about to change all that.

What if Trump continues to tweet as President and authorizes his appointees to do the same? Spontaneity isn’t the same thing as openness but it could prove to be interesting.

How to get superior text processing in Python with Pynini

Saturday, November 19th, 2016

How to get superior text processing in Python with Pynini by Kyle Gorman and Richard Sproat.

From the post:

It’s hard to beat regular expressions for basic string processing. But for many problems, including some deceptively simple ones, we can get better performance with finite-state transducers (or FSTs). FSTs are simply state machines which, as the name suggests, have a finite number of states. But before we talk about all the things you can do with FSTs, from fast text annotation—with none of the catastrophic worst-case behavior of regular expressions—to simple natural language generation, or even speech recognition, let’s explore what a state machine is, what they have to do with regular expressions.

Reporters, researchers and others will face a 2017 where the rate of information has increased, along with noise from media spasms over the latest taut from president-elect Trump.

Robust text mining/filtering will your daily necessities, if they aren’t already.

Tagging text is the first example. Think about auto-generating graphs from emails with “to:,” “from:,” “date:,” and key terms in the email. Tagging the key terms is essential to that process.

Once tagged, you can slice and dice the text as more information is uncovered.


Python Data Science Handbook

Saturday, November 19th, 2016

Python Data Science Handbook (Github)

From the webpage:

Jupyter notebook content for my OReilly book, the Python Data Science Handbook.


See also the free companion project, A Whirlwind Tour of Python: a fast-paced introduction to the Python language aimed at researchers and scientists.

This repository will contain the full listing of IPython notebooks used to create the book, including all text and code. I am currently editing these, and will post them as I make my way through. See the content here:


Parsing Emails With Python, A Quick Tip

Monday, October 31st, 2016

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 68, in get_token
nextchar =
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Received: by with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received: from ( [])
by with ESMTP id m26si29354pof.3.2008.;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass ( domain of designates
Received: from ([])
by with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from ([])
by with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [] by;
Wed, 05 Nov 2008 16:10:28 +0000

To: “Podesta” ,
CC: “Denis McDonough OFA” ,”,,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Wednesday, October 26th, 2016

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.


Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

Clinton/Podesta 1-18,,

Tuesday, October 25th, 2016

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:


An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Tuesday, October 25th, 2016

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.


The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, was the sender and was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 506, in verify
File “/usr/lib/python2.7/dist-packages/dkim/”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Tuesday, October 25th, 2016

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.


Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:


sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)


So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):


for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Sunday, October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

Python and Machine Learning in Astronomy (Rejuvenate Your Emotional Health)

Saturday, October 22nd, 2016

Python and Machine Learning in Astronomy (Episode #81) (Jack VanderPlas)

From the webpage:

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein’s theory of general relativity is correct.

It probably won’t surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you’ll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp

Jake on the web:

Python Data Science Handbook:

Python Data Science Handbook on GitHub:

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data:

PyData Talk:

eScience Institue: @UWeScience

Large Synoptic Survey Telescope:

AstroML: Machine Learning and Data Mining for Astronomy:

Astropy project:

altair package:

If you social media feeds have been getting you down, rejoice! This interview with Jake VanderPlas covers Python, machine learning and astronomy.

Nary a mention of current social dysfunction around the globe!

Replace an hour of TV this weekend with this podcast. (Or more hours with others.)

Not only will you have more knowledge, you will be in much better emotional shape to face the coming week!

Watch your Python script with strace

Sunday, September 11th, 2016


Modern operating systems sandbox each process inside of a virtual memory map from which direct I/O operations are generally impossible. Instead, a process has to ask the operating system every time it wants to modify a file or communicate bytes over the network. By using operating system specific tools to watch the system calls a Python script is making — using “strace” under Linux or “truss” under Mac OS X — you can study how a program is behaving and address several different kinds of bugs.

Brandon Rhodes does a delightful presentation on using strace with Python.

Slides for Tracing Python with strace or truss.

I deeply enjoyed this presentation, which I discovered while looking at a Python regex issue.

Anticipate running strace on the Python script this week and will report back on any results or failure to obtain results! (Unlike in academic publishing, experiments and investigations do fail.)

Dark Web OSINT With Python Part Three: Visualization

Thursday, September 1st, 2016

Dark Web OSINT With Python Part Three: Visualization by Justin.

From the post:

Welcome back! In this series of blog posts we are wrapping the awesome OnionScan tool and then analyzing the data that falls out of it. If you haven’t read parts one and two in this series then you should go do that first. In this post we are going to analyze our data in a new light by visualizing how hidden services are linked together as well as how hidden services are linked to clearnet sites.

One of the awesome things that OnionScan does is look for links between hidden services and clearnet sites and makes these links available to us in the JSON output. Additionally it looks for IP address leaks or references to IP addresses that could be used for deanonymization.

We are going to extract these connections and create visualizations that will assist us in looking at interesting connections, popular hidden services with a high number of links and along the way learn some Python and how to use Gephi, a visualization tool. Let’s get started!

Jason tops off this great series on OnionScan by teaching the rudiments of using Gephi to visualize and explore the resulting data.

Can you map yourself from the Dark Web to visible site?

If so, you aren’t hidden well enough.

A Whirlwind Tour of Python (Excellent!)

Tuesday, August 23rd, 2016

A Whirlwind Tour of Python by Jake VanderPlas.

From the webpage:

To tap into the power of Python’s open data science stack—including NumPy, Pandas, Matplotlib, Scikit-learn, and other tools—you first need to understand the syntax, semantics, and patterns of the Python language. This report provides a brief yet comprehensive introduction to Python for engineers, researchers, and data scientists who are already familiar with another programming language.

Author Jake VanderPlas, an interdisciplinary research director at the University of Washington, explains Python’s essential syntax and semantics, built-in data types and structures, function definitions, control flow statements, and more, using Python 3 syntax.

You’ll explore:

  • Python syntax basics and running Python code
  • Basic semantics of Python variables, objects, and operators
  • Built-in simple types and data structures
  • Control flow statements for executing code blocks conditionally
  • Methods for creating and using reusable functions
  • Iterators, list comprehensions, and generators
  • String manipulation and regular expressions
  • Python’s standard library and third-party modules
  • Python’s core data science tools
  • Recommended resources to help you learn more

Jake VanderPlas is a long-time user and developer of the Python scientific stack. He currently works as an interdisciplinary research director at the University of Washington, conducts his own astronomy research, and spends time advising and consulting with local scientists from a wide range of fields.

A Whirlwind Tour of Python, can be recommended without reservation.

In addition to the book, the Jupyter notebooks behind the book have been posted.


29 common beginner Python errors on one page [Something Similar For XQuery?]

Friday, August 19th, 2016

29 common beginner Python errors on one page

From the webpage:

A few times a year, I have the job of teaching a bunch of people who have never written code before how to program from scratch. The nature of programming being what it is, the same error crop up every time in a very predictable pattern. I usually encourage my students to go through a step-by-step troubleshooting process when trying to fix misbehaving code, in which we go through these common errors one by one and see if they could be causing the problem. Today, I decided to finally write this troubleshooting process down and turn it into a flowchart in non-threatening colours.

Behold, the “my code isn’t working” step-by-step troubleshooting guide! Follow the arrows to find the likely cause of your problem – if the first thing you reach doesn’t work, then back up and try again.

Click the image for full-size, and click here for a printable PDF. Colour scheme from Luna Rosa.

Useful for Python beginner’s and should be inspirational for other languages.

Thoughts on something similar for XQuery Errors? Suggestions for collecting the “most common” XQuery errors?

Readable Regexes In Python?

Friday, August 19th, 2016

Doug Mahugh retweeted Raymond Hettinger tweeting:

#python tip: Complicated regexes can be organized into readable, commented chucks.

Twitter hasn’t gotten around to censoring Python related tweets for accuracy so I did check the reference:


This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

Which is the better question?

Why would anyone want to produce a readable regex in Python?


Why would anyone NOT produce a readable regex given the opportunity?


PS: It occurs to me that with a search expression you could address such strings as subjects in a topic map. A more robust form of documentation than # syntax.

Grokking Deep Learning

Wednesday, August 17th, 2016

Grokking Deep Learning by Andrew W. Trask.

From the description:

Artificial Intelligence is the most exciting technology of the century, and Deep Learning is, quite literally, the “brain” behind the world’s smartest Artificial Intelligence systems out there. Loosely based on neuron behavior inside of human brains, these systems are rapidly catching up with the intelligence of their human creators, defeating the world champion Go player, achieving superhuman performance on video games, driving cars, translating languages, and sometimes even helping law enforcement fight crime. Deep Learning is a revolution that is changing every industry across the globe.

Grokking Deep Learning is the perfect place to begin your deep learning journey. Rather than just learn the “black box” API of some library or framework, you will actually understand how to build these algorithms completely from scratch. You will understand how Deep Learning is able to learn at levels greater than humans. You will be able to understand the “brain” behind state-of-the-art Artificial Intelligence. Furthermore, unlike other courses that assume advanced knowledge of Calculus and leverage complex mathematical notation, if you’re a Python hacker who passed high-school algebra, you’re ready to go. And at the end, you’ll even build an A.I. that will learn to defeat you in a classic Atari game.

In the Manning Early Access Program (MEAP) with three (3) chapters presently available.

A much more plausible undertaking than DARPA’s quest for “Explainable AI” or “XAI.” (DARPA WANTS ARTIFICIAL INTELLIGENCE TO EXPLAIN ITSELF) DARPA reasons that:

Potential applications for defense are endless—autonomous aerial and undersea war-fighting or surveillance, among others—but humans won’t make full use of AI until they trust it won’t fail, according to the Defense Advanced Research Projects Agency. A new DARPA effort aims to nurture communication between machines and humans by investing in AI that can explain itself as it works.

If non-failure is the criteria for trust, U.S. troops should refuse to leave their barracks in view of the repeated failures of military strategy since the end of WWII.

DARPA should choose a less stringent criteria for trusting an AI. However, failing less often than the Joint Chiefs of Staff may be too low a bar to set.


Wednesday, August 17th, 2016

Pandas by Reuven M. Lerner.

From the post:

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren’t quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that’s what I’m going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It’s hard to overstate the number of people I’ve met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn’t yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article “Analyzing Data“, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won’t necessarily reach any amazing conclusions, you’ll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Of course, scientific articles are written as though questions drop out of the sky and data is interrogated for the answer.

Aside from being rhetoric to badger others with, does anyone really think that is how science operates in fact?

Whether you have delusions about how science works in fact or not, you will find that Pandas will assist you in exploring data.

Dark Web OSINT with Python Part Two: … [Prizes For Unmasking Government Sites?]

Wednesday, August 10th, 2016

Dark Web OSINT with Python Part Two: SSH Keys and Shodan by Justin.

From the post:

Welcome back good Python soldiers. In Part One of this series we created a wrapper around OnionScan, a fantastic tool created by Sarah Jamie Lewis (@sarajamielewis). If you haven’t read Part One then go do so now. Now that you have a bunch of data (or you downloaded it from here) we want to do some analysis and further intelligence gathering with it. Here are a few objectives we are going to cover in the rest of the series.

  1. Attempt to discover clearnet servers that share SSH fingerprints with hidden services, using Shodan. As part of this we will also analyze whether the same SSH key is shared amongst hidden services.
  2. Map out connections between hidden services, clearnet sites and any IP address leaks.
  3. Discover clusters of sites that are similar based on their index pages, this can help find knockoffs or clones of “legitimate” sites. We’ll use a machine learning library called scikit-learn to achieve this.

The scripts that were created for this series are quick little one-offs, so there is some shared code between each script. Feel free to tighten this up into a function or a module you can import. The goal is to give you little chunks of code that will teach you some basics on how to begin analyzing some of the data and more importantly to give you some ideas on how you can use it for your own purposes.

In this post we are going to look at how to connect hidden services by their SSH public key fingerprints, as well as how to expand our intelligence gathering using Shodan. Let’s get started!

Expand your Dark Web OSINT intell skills!

Being mindful that if you can discover your Dark Web site, so can others.

Anyone awarding Black Hat conference registrations for unmasking government sites on the Dark Web?

Pandas Exercises

Saturday, July 30th, 2016

Pandas Exercises

From the post:

Fed up with a ton of tutorials but no easy way to find exercises I decided to create a repo just with exercises to practice pandas. Don’t get me wrong, tutorials are great resources, but to learn is to do. So unless you practice you won’t learn.

There will be three different types of files:

  1. Exercise instructions
  2. Solutions without code
  3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial or video and then do exercises. Learn one more topic and do exercises. If you got the answer wrong, don’t go to the solution with code, follow this advice instead.

Suggestions and collaborations are more than welcome. 🙂

I’m sure you will find this useful but when I search for pandas exercise python, I get 298,000 “hits.”

Adding exercises here isn’t going to improve the findability of pandas for particular subject areas or domains.

Perhaps as exercises are added here, links to exercises by subject area can be added as well.

With nearly 300K potential sources, there is no shortage of exercises to go around!

Dark Web OSINT With Python and OnionScan: Part One

Saturday, July 30th, 2016

Dark Web OSINT With Python and OnionScan: Part One by Justin.

When you tire of what passes for political discussion on Twitter and/or Facebook this weekend, why not try your hand at something useful?

Like looking for data leaks on the Dark Web?

You could, in theory at least, notify the sites of their data leaks. 😉

One of the aspects of announced leaks that never ceases to amaze me are reports that read:

Well, we pawned the (some string of letters) database and then notified them of the issue.

Before getting a copy of the entire database? What’s the point?

All you have accomplished is making another breach more difficult and demonstrating your ability to breach a system where the root password was most likely “god.”

Anyway, Justin gets you started on seeking data leaks on the Dark Web saying:

You may have heard of this awesome tool called OnionScan that is used to scan hidden services in the dark web looking for potential data leaks. Recently the project released some cool visualizations and a high level description of what their scanning results looked like. What they didn’t provide is how to actually go about scanning as much of the dark web as possible, and then how to produce those very cool visualizations that they show.

At a high level we need to do the following:

  1. Setup a server somewhere to host our scanner 24/7 because it takes some time to do the scanning work.
  2. Get TOR running on the server.
  3. Get OnionScan setup.
  4. Write some Python to handle the scanning and some of the other data management to deal with the scan results.
  5. Write some more Python to make some cool graphs. (Part Two of the series)

Let’s get started!

Very much looking forward to Part 2!


greek-accentuation 1.0.0 Released

Thursday, July 28th, 2016

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

Volumetric Data Analysis – yt

Friday, June 17th, 2016

One of those rotating homepages:

Volumetric Data Analysis – yt

yt is a python package for analyzing and visualizing volumetric, multi-resolution data from astrophysical simulations, radio telescopes, and a burgeoning interdisciplinary community.

Quantitative Analysis and Visualization

yt is more than a visualization package: it is a tool to seamlessly handle simulation output files to make analysis simple. yt can easily knit together volumetric data to investigate phase-space distributions, averages, line integrals, streamline queries, region selection, halo finding, contour identification, surface extraction and more.

Many formats, one language

yt aims to provide a simple uniform way of handling volumetric data, regardless of where it is generated. yt currently supports FLASH, Enzo, Boxlib, Athena, arbitrary volumes, Gadget, Tipsy, ART, RAMSES and MOAB. If your data isn’t already supported, why not add it?

From the non-rotating part of the homepage:

To get started using yt to explore data, we provide resources including documentation, workshop material, and even a fully-executable quick start guide demonstrating many of yt’s capabilities.

But if you just want to dive in and start using yt, we have a long list of recipes demonstrating how to do various tasks in yt. We even have sample datasets from all of our supported codes on which you can test these recipes. While yt should just work with your data, here are some instructions on loading in datasets from our supported codes and formats.

Professional astronomical data and tools like yt put exploration of the universe at your fingertips!


Computer Programming for Lawyers:… [Educating a Future Generation of Judges]

Friday, May 6th, 2016

Computer Programming for Lawyers: An Introduction by Paul Ohm and Jonathan Frankle.

From the syllabus:

This class provides an introduction to computer programming for law students. The programming language taught may vary from year-to-year, but it will likely be a language designed to be both easy to learn and powerful, such as Python or JavaScript. There are no prerequisites, and even students without training in computer science or engineering should be able successfully to complete the class.

The course is based on the premise that computer programming has become a vital skill for non-technical professionals generally and for future lawyers and policymakers specifically. Lawyers, irrespective of specialty or type of practice, organize, evaluate, and manipulate large sets of text-based data (e.g. cases, statutes, regulations, contracts, etc.) Increasingly, lawyers are asked to deal with quantitative data and complex databases. Very simple programming techniques can expedite and simplify these tasks, yet these programming techniques tend to be poorly understood in legal practice and nearly absent in legal education. In this class, students will gain proficiency in various programming-related skills.

A secondary goal for the class is to introduce students to computer programming and computer scientific concepts they might encounter in the substantive practice of law. Students might discuss, for example, how programming concepts illuminate and influence current debates in privacy, intellectual property, consumer protection, antidiscrimination, antitrust, and criminal procedure.

The language for this year is Python. The course website, does not have any problem sets posted, yet. Be sure to check back for those.

Recommend this to any and all lawyers you encounter. It isn’t possible to predict who will or will not be a judge someday. Judges with a basic understanding of computing could improve the overall quality of decisions on computer technology.

Like discounting DOJ spun D&D tales about juvenile behavior.

Hello World – Machine Learning Recipes #1

Saturday, April 16th, 2016

Hello World – Machine Learning Recipes #1 by Josh Gordon.

From the description:

Six lines of Python is all it takes to write your first machine learning program! In this episode, we’ll briefly introduce what machine learning is and why it’s important. Then, we’ll follow a recipe for supervised learning (a technique to create a classifier from examples) and code it up.

The first in a promised series on machine learning using scikit learn and TensorFlow.

The quality of video that you wish was available to intermediate and advanced treatments.

Quite a treat! Pass onto anyone interested in machine learning.


PySparNN [nearest neighbors in sparse, high dimensional spaces (like text documents).]

Thursday, April 7th, 2016


From the post:

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 – cosine_similarity).

PySparNN benefits:

  • Designed to be efficent on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Work in progress – Min, Max distance thresholds can be set at query time (not index time). Example: return the k closest items on the interval [0.8, 0.9] from a query point.

If your data is NOT SPARSE – please consider annoy. Annoy uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.
General rule of thumb – annoy performs better if you can get your data to fit into memory (as a dense vector).

The most comparable library to PySparNN is scikit-learn’s LSHForrest module. As of this writing, PySparNN is ~1.5x faster on the 20newsgroups dataset. A more robust benchmarking on sparse data is desired. Here is the comparison.

I included the text snippet in the title because PySparNN isn’t clueful, at least not at first glance.

I looked for a good explanation on nearest neighbors and encountered this lecture by Pat Wilson’s (MIT OpenCourseWare):

The lecture has a number of gems, including the observation that:

Town and Country readers tend to be social parasites.

Observations on text and nearest neightbors, time marks 17:30 – 24:17.

You should make an effort to watch the entire video. You will have a broader appreciate for the sheer power of nearest neighbor analysis and as a bonus, some valuable insights on why going without sleep is a very bad idea.

I first saw this in a tweet by Lynn Cherny.

Advanced Data Mining with Weka – Starts 25 April 2016

Wednesday, April 6th, 2016

Advanced Data Mining with Weka by Ian Witten.

From the webpage:

This course follows on from Data Mining with Weka and More Data Mining with Weka. It provides a deeper account of specialized data mining tools and techniques. Again the emphasis is on principles and practical data mining using Weka, rather than mathematical theory or advanced details of particular algorithms. Students will analyse time series data, mine data streams, use Weka to access other data mining packages including the popular R statistical computing language, script Weka in Python, and deploy it within a cluster computing framework. The course also includes case studies of applications such as classifying tweets, functional MRI data, image classification, and signal peptide prediction.

The syllabus:

Advanced Data Mining with Weka is open for enrollment and starts 25 April 2016.

Five very intense weeks await!

Will you be there?

I first saw this in a tweet by Alyona Medelyan.

Python Code + Data + Visualization (Little to No Prose)

Tuesday, April 5th, 2016

Up and Down the Python Data and Web Visualization Stack

Using the “USGS dataset listing every wind turbine in the United States:” this notebook walks you through data analysis and visualization with only code and visualizations.

That’s it.

Aside from very few comments, there is no prose in this notebook at all.

You will either hate it or be rushing off to do a similar notebook on a topic of interest to you.

Looking forward to seeing the results of those choices!