Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 10, 2018

PyCoder’s Weekly Archive 2012-2018 [Indexing Data Set?]

Filed under: Indexing,Python,Search Engines,Searching — Patrick Durusau @ 8:53 pm

PyCoder’s Weekly Archive 2012-2018

Python programmers already know about PyCoder Weekly but if you don’t, it’s a weekly newsletter with headline Python news, discussions, Python jobs, articles & tutorials, projects & code, and events. Yeah, every week!

I mention it too as a potential indexing set for search software. I’m reasoning you are more likely to devote effort to indexing material of interest than out of copyright newspapers. Besides, you will be better able to judge a good search result from a bad one when indexing PyCoder’s Weekly.

Enjoy!

September 26, 2018

pandas: powerful Python data analysis toolkit & Data Skepticism

Filed under: Pandas,Python,Skepticism — Patrick Durusau @ 12:52 pm

pandas: powerful Python data analysis toolkit

From the webpage:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

[if you need more enticement]

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

I need to spend more time with pandas but have to confess that meta-issues with data interest me more than “alleged” data distributed by governments, corporations and others.

I saw “alleged” data because unless you know the means by which it was collected, the criteria for that collection, what was available but excluded from collection, plus a host of other questions about any data set, about all you know is that X claims the “alleged” data means “something.”

The “something” claimed for data varies on who is reporting it and what purpose they have in telling you. I immediately discount explanations that involve my or the public’s benefit. No, rather say the data was released in hopes that I or the public would see it as a benefit. A bit closer to the truth.

All that said, there are any number of interesting ways that processing data shades it as well, so a deep appreciate for pandas will help you spot those tricks as well.

PS: I don’t mean to contend we can ever be bias free, but I do think we can aspire to expose the biases of others.

I first saw this in a tweet by Kirk Borne

August 22, 2018

Data and the Midterm Elections:… [Enigma contest, swag prizes, September 21 deadline]

Filed under: Data Science,Government,Python — Patrick Durusau @ 4:44 pm

Data and the Midterm Elections: Enigma Public Call for Submissions

Calling all public data enthusiasts! To celebrate the launch of Enigma Public’s Python SDK, Enigma is hosting a contest for projects – ranging from data science to data visualization, data journalism and more – featuring Enigma’s public data in exploration of the upcoming U.S. elections.

We are excited to incentivize the creation of data-driven projects, exploring the critical U.S. midterm elections this fall. In this turbulent and confusing period in U.S. politics, data can help us interpret and understand both the news we’re reading and changes we’re seeing.

One of the suggested ideas:

Census Bureau data on voter registration by demographic category.

shows that Lakoff’s point about Clinton losing educated women around Philadelphia, “her” demographic, has failed to register with political types.

Let me say it in bold type: Demographics are not a reliable indicator of voting behavior.

Twice? Demographics are not a reliable indicator of voting behavior.

Demographics are easy to gather. Demographics are easy to analyze. But easy to gather and analyze, does not equal useful in planning campaign strategy.

Here’s an idea: Don’t waste money on traditional demographics, voting patterns, etc., but enlist vendors who market to those voting populations to learn what they focus on for their products.

There’s no golden bullet but repeating the mistakes of the past is a step towards repeating the failures of the past. (How would you like to be known as the only candidate for president beaten by a WWF promoter? That’s got to sting.)

May 6, 2018

Natural Language Toolkit (NLTK) 3.3 Drops!

Filed under: Linguistics,Natural Language Processing,NLTK,Python — Patrick Durusau @ 7:52 pm

Natural Language Toolkit (NLTK) 3.3 has arrived!

From NLTK News:

NLTK 3.3 release: May 2018

Support Python 3.6, New interface to CoreNLP, Support synset retrieval by sense key, Minor fixes to CoNLL Corpus Reader, AlignedSent, Fixed minor inconsistencies in APIs and API documentation, Better conformance to PEP8, Drop Moses Tokenizer (incompatible license)

Whether you have fantasies about propaganda turning voters into robots, believe “persuasion” is a matter of “facts,” or other pre-Derrida illusions, or not, the NLTK is a must have weapon in such debates.

Enjoy!

March 6, 2018

Numba Versus C++ – On Wolfram CAs

Filed under: C/C++,Cellular Automata,Programming,Python — Patrick Durusau @ 7:49 pm

Numba Versus C++ by David Butts, Gautham Dharuman, Bill Punch and Michael S. Murillo.

Python is a programming language that first appeared in 1991; soon, it will have its 27th birthday. Python was created not as a fast scientific language, but rather as a general-purpose language. You can use Python as a simple scripting language or as an object-oriented language or as a functional language…and beyond; it is very flexible. Today, it is used across an extremely wide range of disciplines and is used by many companies. As such, it has an enormous number of libraries and conferences that attract thousands of people every year.

But, Python is an interpreted language, so it is very slow. Just how slow? It depends, but you can count on about 10-100 times as slow as, say, C/C++. If you want fast code, the general rule is: don’t use Python. However, a few more moments of thought lead to a more nuanced perspective. What if you spend most of the time coding, and little time actually running the code? Perhaps your familiarity with the (slow) language, or its vast set of libraries, actually saves you time overall? And, what if you learned a few tricks that made your Python code itself a bit faster? Maybe that is enough for your needs? In the end, for true high performance computing applications, you will want to explore fast languages like C++; but, not all of our needs fall into that category.

As another example, consider the fact that many applications use two languages, one for the core code and one for the wrapper code; this allows for a smoother interface between the user and the core code. A common use case is C or C++ wrapped by, of course, Python. As a user, you may not even know that the code you are using is in another language! Such a situation is referred to as the “two-language problem”. This situation is great provided you don’t need to work in the core code, or you don’t mind working in two languages – some people don’t mind, but some do. The question then arises: if you are one of those people who would like to work only in the wrapper language, because it was chosen for its user friendliness, what options are available to make that language (Python in this example) fast enough that it can also be used for the core code?

We wanted to explore these ideas a bit further by writing a code in both Python and C++. Our past experience suggested that while Python is very slow, it could be made about as fast as C using the crazily-simple-to-use library Numba. Our basic comparisons here are: basic Python, Numba and C++. Because we are not religious about Python, and you shouldn’t be either, we invited expert C++ programmers to have the chance to speed up the C++ as much as they could (and, boy could they!).

This webpage is highly annoying, in both Mozilla and Chrome. You’ll have to visit to get the full impact.

It is, however, also a great post on using Numba to obtain much faster results while still using Python. The use of Wolfram CAs (cellular automata) as examples is an added bonus.

Enjoy!

February 12, 2018

Reducing the Emotional Toll of Debating Bigots, Fascists and Misogynists

Filed under: Keras,Politics,Python,TensorFlow — Patrick Durusau @ 5:08 pm

Victims of bigots, fascists and misogynists on social media can (and many have) recounted the emotional toll of engaging with them.

How would you like to reduce your emotional toll and consume minutes if not hours of their time?

I thought you might be interested. 😉

Follow the link to DeepPavlov. (Ignore the irony of the name considering the use case I’m outlining.)

From the webpage:

An open source library for building end-to-end dialog systems and training chatbots.

We are in a really early Alfa release. You have to be ready for hard adventures.

An open-source conversational AI library, built on TensorFlow and Keras, and designed for

  • NLP and dialog systems research
  • implementation and evaluation of complex conversational systems

Our goal is to provide researchers with:

  • a framework for implementing and testing their own dialog models with subsequent sharing of that models
  • set of predefined NLP models / dialog system components (ML/DL/Rule-based) and pipeline templates
  • benchmarking environment for conversational models and systematized access to relevant datasets

and AI-application developers with:

  • framework for building conversational software
  • tools for application integration with adjacent infrastructure (messengers, helpdesk software etc.)

… (emphasis in the original)

Only one component for a social media engagement bot to debate bigots, fascists and misogynists but a very important one. A trained AI can take the emotional strain off of victims/users and at least in some cases, inflict that toll on your opponents.

For OpSec reasons, don’t announce the accounts used by such an AI backed system.

PS: AI ethics debaters. This use of an AI isn’t a meaningful interchange of ideas online. My goals are: reduce the emotional toll on victims, waste the time of their attackers. Disclosing you aren’t hurting someone on the other side (the bot) isn’t a requirement in my view.

February 6, 2018

What the f*ck Python! 🐍

Filed under: Programming,Python — Patrick Durusau @ 8:32 pm

What the f*ck Python! 🐍

From the post:

Python, being a beautifully designed high-level and interpreter-based programming language, provides us with many features for the programmer’s comfort. But sometimes, the outcomes of a Python snippet may not seem obvious to a regular user at first sight.

Here is a fun project to collect such tricky & counter-intuitive examples and lesser-known features in Python, attempting to discuss what exactly is happening under the hood!

While some of the examples you see below may not be WTFs in the truest sense, but they’ll reveal some of the interesting parts of Python that you might be unaware of. I find it a nice way to learn the internals of a programming language, and I think you’ll find them interesting as well!

If you’re an experienced Python programmer, you can take it as a challenge to get most of them right in first attempt. You may be already familiar with some of these examples, and I might be able to revive sweet old memories of yours being bitten by these gotchas 😅

If you’re a returning reader, you can learn about the new modifications here.

So, here we go…

What better way to learn than being really pissed off that your code isn’t working? Or isn’t working as expected.

😉

This looks like a real hoot! Too late today to do much with it but I’ll be returning to it.

Enjoy!

January 31, 2018

Python’s One Hundred and Thirty-Nine Week Lectionary Cycle

Filed under: Programming,Python — Patrick Durusau @ 7:41 pm

Python 3 Module of the Week by Doug Hellmann

From the webpage:

PyMOTW-3 is a series of articles written by Doug Hellmann to demonstrate how to use the modules of the Python 3 standard library….

Hellman documents one hundred and thirty-nine (139) modules in the Python standard library.

How many of them can you name?

To improve your score, use Hellman’s list as a one hundred and thirty-nine (139) week lectionary cycle on Python.

Some modules may take less than a week, but some, re — Regular Expressions, will take more than a week.

Even if you don’t finish a longer module, push on after two weeks so you can keep that feeling of progress and encountering new material.

January 12, 2018

Getting Started with Python/CLTK for Historical Languages

Filed under: Classics,Language,Python — Patrick Durusau @ 2:03 pm

Getting Started with Python/CLTK for Historical Languages by Patrick J. Burns.

From the post:

This is a ongoing project to collect online resources for anybody looking to get started with working with Python for historical languages, esp. using the Classical Language Toolkit. If you have suggestions for this lists, email me at patrick[at]diyclassics[dot]org.

What classic or historical language resources would you recommend?

December 14, 2017

Twitter Bot Template – If You Can Avoid Twitter Censors

Filed under: Bots,Python,Twitter — Patrick Durusau @ 11:04 am

Twitter Bot Template

From the webpage:

Boilerplate for creating simple, non-interactive twitter bots that post periodically. My comparisons bot, @botaphor, is an example of how I use this template in practice.

This is intended for coders familiar with Python and bash.

If you can avoid Twitter censors (new rules, erratically enforced, a regular “feature”), then this Twitter bot template may interest you.

Make tweet filtering a commercial opportunity and Twitter can drop the cost with no profit center of tweet censorship.

Unlikely because policing other people is such a power turn-on.

Still, this is the season for wishes.

November 14, 2017

pynlp – Pythonic Wrapper for Stanford CoreNLP [& Rand Paul]

Filed under: Natural Language Processing,Python,Stanford NLP — Patrick Durusau @ 4:36 pm

pynlp – Pythonic Wrapper for Stanford CoreNLP by Sina.

The example text for this wrapper:

text = (
'GOP Sen. Rand Paul was assaulted in his home in Bowling Green, 
Kentucky, on Friday, ''according to Kentucky State Police. State 
troopers responded to a call to the senator\'s ''residence at 3:21 
p.m. Friday. Police arrested a man named Rene Albert Boucher, who 
they ''allege "intentionally assaulted" Paul, causing him "minor 
injury. Boucher, 59, of Bowling ''Green was charged with one count of 
fourth-degree assault. As of Saturday afternoon, he ''was being held 
in the Warren County Regional Jail on a $5,000 bond.')

[Warning: Reformatted for readability. See the Github page for the text]

Nice to see examples using contemporary texts. Any of the recent sexual abuse apologies or non-apologies would work as well.

Enjoy!

November 12, 2017

Scipy Lecture Notes

Filed under: Programming,Python,Scientific Computing — Patrick Durusau @ 9:10 pm

Scipy Lecture Notes edited by Gaël Varoquaux, Emmanuelle Gouillart, Olav Vahtras.

From the webpage:

Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.

In PDF format, some six-hundred and fifty-seven pages of top quality material on Scipy.

In addition to the main editors, there are fourteen chapter editors and seventy-three contributors.

Good documentation needs maintenance so if you improvements or examples to offer, perhaps your name will appear here in the not too distant future.

Enjoy!

October 26, 2017

SciPy 1.0.0! [Awaiting Your Commands]

Filed under: Programming,Python — Patrick Durusau @ 10:50 am

SciPy 1.0.0

From the webpage:

We are extremely pleased to announce the release of SciPy 1.0, 16 years after version 0.1 saw the light of day. It has been a long, productive journey to get here, and we anticipate many more exciting new features and releases in the future.

Why 1.0 now?

A version number should reflect the maturity of a project – and SciPy was a mature and stable library that is heavily used in production settings for a long time already. From that perspective, the 1.0 version number is long overdue.

Some key project goals, both technical (e.g. Windows wheels and continuous integration) and organisational (a governance structure, code of conduct and a roadmap), have been achieved recently.

Many of us are a bit perfectionist, and therefore are reluctant to call something “1.0” because it may imply that it’s “finished” or “we are 100% happy with it”. This is normal for many open source projects, however that doesn’t make it right. We acknowledge to ourselves that it’s not perfect, and there are some dusty corners left (that will probably always be the case). Despite that, SciPy is extremely useful to its users, on average has high quality code and documentation, and gives the stability and backwards compatibility guarantees that a 1.0 label imply.

In case your hands are trembling too much to type in the URLs:

SciPy.org

SciPy Cookbook

Scipy 1.0.0 Reference Guide, [HTML+zip], [PDF]

Like most tools, it isn’t weaponized until you apply it to data.

Enjoy!

PS: If you want to get ahead of a co-worker, give them this URL: http://planet.scipy.org/. Don’t look, it’s a blog feed for SciPy. Sorry, you looked didn’t you?

August 10, 2017

Why Astronomers Love Python And Why You Should Too (Search Woes)

Filed under: Astroinformatics,Python — Patrick Durusau @ 3:27 pm

https://www.youtube.com/watch?v=W9dwGZ6yY0k

From the description:

The Python programming language is a widely used tool for basic and advanced research in Astronomy. Watch this amazing presentation to learn specifics of using Python by astronomers. (Jake Vanderplas, speaker)

The only downside to the presentation is Vanderplas mentions software being on Github, but doesn’t supply the URLs.

For example, if you go to Github and search for for “Large Synoptic Survey Telescope” you get two (2) results:

Both “hits” are relevant but what did we miss?

Try searching for LSSTC.

There are twelve (12) “hits” with the first one being highly relevant and completely missed by the prior search.

Two lessons here:

  1. Search is a lossy way to navigate Github.
  2. Do NOT wave your hands in the direction of Github for software. Give URLs.

Links from above:

bho4/LSST Placeholder, no content.

LSSTC-DSFP-Sessions

Lecture slides, Jupyter notebooks, and other material from the LSSTC Data Science Fellowship Program

smonkewitz/scisql

Science-specific tools and extensions for SQL. Currently the project contains user defined functions (UDFs) for MySQL including spatial geometry, astronomy specific functions and mathematical functions. The project was motivated by the needs of the Large Synoptic Survey Telescope (LSST).

July 11, 2017

Graphing the distribution of English letters towards…

Filed under: Language,Linguistics,Python — Patrick Durusau @ 9:05 pm

Graphing the distribution of English letters towards the beginning, middle or end of words by David Taylor.

From the post:

(partial image)

Some data visualizations tell you something you never knew. Others tell you things you knew, but didn’t know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I’ll just repeat the most crucial here:

The counts here were generated from Brown corpus, which is composed of texts printed in 1961.

Take Taylor’s post as an inducement to read both Prooffreader Plus and Prooffreader on a regular basis.

June 12, 2017

FreeDiscovery

Filed under: Python,Scikit-Learn,Search Engines — Patrick Durusau @ 4:30 pm

FreeDiscovery: Open Source e-Discovery and Information Retrieval Engine

From the webpage:

FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.

In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.

Python 3.5+ required.

Homepage has command line examples, with a pointer to: http://freediscovery.io/doc/stable/examples/ for more examples.

The additional examples use a subset of the TREC 2009 legal collection. Cool!

I saw this in a tweet by Lynn Cherny today.

Enjoy!

May 19, 2017

Python for Data Journalists: Analyzing Money in Politics

Filed under: Journalism,News,Politics,Python,Reporting — Patrick Durusau @ 4:33 pm

Python for Data Journalists: Analyzing Money in Politics by Knight Center.

From the webpage:

Data journalists are the newest rock stars of the newsroom. Using computer programming and data journalism techniques, they have the power to cull through big data to find original and important stories.

Learn these techniques and some savvy computer programming to produce your own bombshell investigations in the latest massive open online course (MOOC) from the Knight Center, “Python for Data Journalists: Analyzing Money in Politics.”

Instructor Ben Welsh, editor of the Los Angeles Times Data Desk and co-founder of the California Civic Data Coalition, will show students how to turn big data into great journalism with speed and veracity. The course takes place from June 12 to July 9, 2017, so register now.

A high priority for your summer because:

  1. You will learn techniques for data analysis
  2. Learning #1 enables you to perform data analysis
  3. Learning #1 enables you to better question data analysis

I skimmed the post and did not see any coverage of obtaining concealed information.

Perhaps that will be the subject of a wholly anonymous MOOC. 😉

Do register! This looks like useful and fun!

PS: Developing a relationship with a credit bureau or bank staffer should be an early career goal. No one is capable of obtaining “extra” money and just sitting on it forever.

April 6, 2017

Web Scraping Reference: …

Filed under: Python,Web Scrapers — Patrick Durusau @ 1:15 pm

Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python by Hartley Brody.

From the post:

Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.

Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.

I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.

Brody uses Beautiful Soup, a Python library that will parse even the worst formed HTML.

I mention this so I will remember the next time I scrape Wikileaks, instead of the download then repair with Tidy, parse with Saxon/XQuery, there are easier ways to do the job!

Enjoy!

December 21, 2016

Mining Twitter Data with Python [Trump Years Ahead]

Filed under: Data Mining,Python,Twitter — Patrick Durusau @ 5:24 pm

Marco Bonzanini, author of Mastering Social Media Mining with Python, has a seven part series of posts on mining Twitter with Python.

If you haven’t been mining Twitter before now, President-elect Donald Trump is about to change all that.

What if Trump continues to tweet as President and authorizes his appointees to do the same? Spontaneity isn’t the same thing as openness but it could prove to be interesting.

November 19, 2016

How to get superior text processing in Python with Pynini

Filed under: FSTs,Journalism,News,Python,Reporting,Text Mining — Patrick Durusau @ 9:35 pm

How to get superior text processing in Python with Pynini by Kyle Gorman and Richard Sproat.

From the post:

It’s hard to beat regular expressions for basic string processing. But for many problems, including some deceptively simple ones, we can get better performance with finite-state transducers (or FSTs). FSTs are simply state machines which, as the name suggests, have a finite number of states. But before we talk about all the things you can do with FSTs, from fast text annotation—with none of the catastrophic worst-case behavior of regular expressions—to simple natural language generation, or even speech recognition, let’s explore what a state machine is, what they have to do with regular expressions.

Reporters, researchers and others will face a 2017 where the rate of information has increased, along with noise from media spasms over the latest taut from president-elect Trump.

Robust text mining/filtering will your daily necessities, if they aren’t already.

Tagging text is the first example. Think about auto-generating graphs from emails with “to:,” “from:,” “date:,” and key terms in the email. Tagging the key terms is essential to that process.

Once tagged, you can slice and dice the text as more information is uncovered.

Interested?

Python Data Science Handbook

Filed under: Data Science,Programming,Python — Patrick Durusau @ 5:27 pm

Python Data Science Handbook (Github)

From the webpage:

Jupyter notebook content for my OReilly book, the Python Data Science Handbook.

pdsh-cover

See also the free companion project, A Whirlwind Tour of Python: a fast-paced introduction to the Python language aimed at researchers and scientists.

This repository will contain the full listing of IPython notebooks used to create the book, including all text and code. I am currently editing these, and will post them as I make my way through. See the content here:

Enjoy!

October 31, 2016

Parsing Emails With Python, A Quick Tip

Filed under: Data Mining,Email,Python — Patrick Durusau @ 1:32 pm

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “test-clinton-script-31Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 68, in get_token
nextchar = self.instream.read(1)
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Delivered-To: john.podesta@gmail.com
Received: by 10.142.49.14 with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by 10.114.144.1 with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Return-Path:
Received: from QMTA09.emeryville.ca.mail.comcast.net (qmta09.emeryville.ca.mail.comcast.net [76.96.30.96])
by mx.google.com with ESMTP id m26si29354pof.3.2008.11.05.08.11.38;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass (google.com: domain of sewallconroy@comcast.net designates
Received: from OMTA03.emeryville.ca.mail.comcast.net ([76.96.30.27])
by QMTA09.emeryville.ca.mail.comcast.net with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from amailcenter06.comcast.net ([204.127.225.106])
by OMTA03.emeryville.ca.mail.comcast.net with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [24.34.75.99] by amailcenter06.comcast.net;
Wed, 05 Nov 2008 16:10:28 +0000
From: sewallconroy@comcast.net

To: “Podesta” , ricesusane@aol.com
CC: “Denis McDonough OFA” ,
djsberg@gmail.com”, marklippert@yahoo.com,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216 55799697019D02010C04040E990A9C@comcast.net>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

–NextPart_Webmail_9m3u9jl4l_3048_1225901428_0
Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

October 26, 2016

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 8:02 pm

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: https://archive.org/details/PodestaEmailszipped today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.

Thoughts/suggestions?

Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

October 25, 2016

Clinton/Podesta 1-18, DKIM-verified-podesta-1-18.txt.zip, DKIM-complete-podesta-1-18.txt.zip

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 9:51 pm

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:

DKIM-complete-podesta-1-18.txt.gz.

An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 7:32 pm

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.

😉

The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, info@centerpeace.org was the sender and podesta@law.georgetown.edu was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 506, in verify
validate_signature_fields(sig)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 4:26 pm

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.

Why?

Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

http://docs.python.org/library/os.html?highlight=os.listdir#os.listdir

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:

sorted(glob.glob('*.png'))

sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)

etc.

So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):

to:

for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

October 23, 2016

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Filed under: Data Science,Python,R,Social Sciences — Patrick Durusau @ 3:53 pm

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

October 22, 2016

Python and Machine Learning in Astronomy (Rejuvenate Your Emotional Health)

Filed under: Astroinformatics,Machine Learning,Python — Patrick Durusau @ 10:11 am

Python and Machine Learning in Astronomy (Episode #81) (Jack VanderPlas)

From the webpage:

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein’s theory of general relativity is correct.

It probably won’t surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you’ll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp

Jake on the web: staff.washington.edu/jakevdp

Python Data Science Handbook: shop.oreilly.com/product/0636920034919.do

Python Data Science Handbook on GitHub: github.com/jakevdp/PythonDataScienceHandbook

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data: press.princeton.edu/titles/10159.html

PyData Talk: youtube.com/watch?v=qOOk6l-CHNw

eScience Institue: @UWeScience

Large Synoptic Survey Telescope: lsst.org

AstroML: Machine Learning and Data Mining for Astronomy: astroml.org

Astropy project: astropy.org

altair package: pypi.org/project/altair

If you social media feeds have been getting you down, rejoice! This interview with Jake VanderPlas covers Python, machine learning and astronomy.

Nary a mention of current social dysfunction around the globe!

Replace an hour of TV this weekend with this podcast. (Or more hours with others.)

Not only will you have more knowledge, you will be in much better emotional shape to face the coming week!

September 11, 2016

Watch your Python script with strace

Filed under: Profiling,Programming,Python — Patrick Durusau @ 7:21 pm

Description:

Modern operating systems sandbox each process inside of a virtual memory map from which direct I/O operations are generally impossible. Instead, a process has to ask the operating system every time it wants to modify a file or communicate bytes over the network. By using operating system specific tools to watch the system calls a Python script is making — using “strace” under Linux or “truss” under Mac OS X — you can study how a program is behaving and address several different kinds of bugs.

Brandon Rhodes does a delightful presentation on using strace with Python.

Slides for Tracing Python with strace or truss.

I deeply enjoyed this presentation, which I discovered while looking at a Python regex issue.

Anticipate running strace on the Python script this week and will report back on any results or failure to obtain results! (Unlike in academic publishing, experiments and investigations do fail.)

September 1, 2016

Dark Web OSINT With Python Part Three: Visualization

Filed under: Dark Web,Open Source Intelligence,Python,Tor — Patrick Durusau @ 4:40 pm

Dark Web OSINT With Python Part Three: Visualization by Justin.

From the post:

Welcome back! In this series of blog posts we are wrapping the awesome OnionScan tool and then analyzing the data that falls out of it. If you haven’t read parts one and two in this series then you should go do that first. In this post we are going to analyze our data in a new light by visualizing how hidden services are linked together as well as how hidden services are linked to clearnet sites.

One of the awesome things that OnionScan does is look for links between hidden services and clearnet sites and makes these links available to us in the JSON output. Additionally it looks for IP address leaks or references to IP addresses that could be used for deanonymization.

We are going to extract these connections and create visualizations that will assist us in looking at interesting connections, popular hidden services with a high number of links and along the way learn some Python and how to use Gephi, a visualization tool. Let’s get started!

Jason tops off this great series on OnionScan by teaching the rudiments of using Gephi to visualize and explore the resulting data.

Can you map yourself from the Dark Web to visible site?

If so, you aren’t hidden well enough.

Older Posts »

Powered by WordPress