Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 14, 2015

Data Science from Scratch

Filed under: Data Science,Python — Patrick Durusau @ 8:30 pm

Data Science from Scratch by Joel Grus.

Joel provides a whirlwind tour of Python that is part of the employee orientation at DataSciencester. Not everything you need to know about Python but a good sketch of why it is important to data scientists.

I first saw this in a tweet by Kirk Borne.

August 31, 2015

Intermediate Python

Filed under: Programming,Python — Patrick Durusau @ 7:43 pm

Intermediate Python by Muhammad Yasoob Ullah Khalid.


Python is an amazing language with a strong and friendly community of pro- grammers. However, there is a lack of documentation on what to learn after getting the basics of Python down your throat. Through this book I aim to solve this problem. I would give you bits of information about some interesting topics which you can further explore.

The topics which are discussed in this book open up your mind towards some nice corners of Python language. This book is an outcome of my desire to have something like this when I was beginning to learn Python.

If you are a beginner, intermediate or even an advanced programmer there is something for you in this book.

Read online at Python Tips or get the donation version at Gumroad.

I first saw this in a tweet by Christophe Lalanne.

August 29, 2015


Filed under: Data Science,Python,R — Patrick Durusau @ 3:19 pm

DataPyR by Kranthi Kumar.

Twenty (20) lists of programming resources on data science, Python and R.

A much easier collection of resources to scan than attempting to search for resources on any of these topics.

At the same time, you have to visit each resource and mine it for an answer to any particular problem.

For example, there is a list of Python Packages for Datamining, which is useful, but even more useful would be a list of common datamining tasks with pointers to particular data mining libraries. That would enable users to search across multiple libraries by task, as opposed to exploring each library.

Expand that across a set of resources on data science, Python and R and you’re talking about saving time and resources across the entire community.

I first saw this in a tweet by Kirk Borne.

August 17, 2015

101 webscraping and research tasks for the data journalist

Filed under: Journalism,News,Python,Reporting,Web Scrapers — Patrick Durusau @ 4:56 pm

101 webscraping and research tasks for the data journalist by Dan Nguyen.

From the webpage:

This repository contains 101 Web data-collection tasks in Python 3 that I assigned to my Computational Journalism class in Spring 2015 to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.

The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the Computational Journalism class wasn’t intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized…(especially since I’m new to Python myself!)

Too good of an idea to not steal! Practical and immediate results, introduction to coding, etc.

What 101 tasks do you want to document and with what tool?

PS: The Computational Journalism class site has a nice set of online references for Python.

August 15, 2015

Modeling and Analysis of Complex Systems

Filed under: Analytics,Complexity,Modeling,Python — Patrick Durusau @ 8:07 pm

Introduction to the Modeling and Analysis of Complex Systems by Hiroki Sayama.

From the webpage:

Introduction to the Modeling and Analysis of Complex Systems introduces students to mathematical/computational modeling and analysis developed in the emerging interdisciplinary field of Complex Systems Science. Complex systems are systems made of a large number of microscopic components interacting with each other in nontrivial ways. Many real-world systems can be understood as complex systems, where critically important information resides in the relationships between the parts and not necessarily within the parts themselves. This textbook offers an accessible yet technically-oriented introduction to the modeling and analysis of complex systems. The topics covered include: fundamentals of modeling, basics of dynamical systems, discrete-time models, continuous-time models, bifurcations, chaos, cellular automata, continuous field models, static networks, dynamic networks, and agent-based models. Most of these topics are discussed in two chapters, one focusing on computational modeling and the other on mathematical analysis. This unique approach provides a comprehensive view of related concepts and techniques, and allows readers and instructors to flexibly choose relevant materials based on their objectives and needs. Python sample codes are provided for each modeling example.

This textbook is available for purchase in both grayscale and color via and

Do us all a favor and pass along the purchase options for classroom hard copies. This style of publishing will last only so long as a majority of us support it. Thanks!

From the introduction:

This is an introductory textbook about the concepts and techniques of mathematical/computational modeling and analysis developed in the emerging interdisciplinary field of complex systems science. Complex systems can be informally defined as networks of many interacting components that may arise and evolve through self-organization. Many real-world systems can be modeled and understood as complex systems, such as political organizations, human cultures/languages, national and international economies, stock markets, the Internet, social networks, the global climate, food webs, brains, physiological systems, and even gene regulatory networks within a single cell; essentially, they are everywhere. In all of these systems, a massive amount of microscopic components are interacting with each other in nontrivial ways, where important information resides in the relationships between the parts and not necessarily within the parts themselves. It is therefore imperative to model and analyze how such interactions form and operate in order to understand what will emerge at a macroscopic scale in the system.

Complex systems science has gained an increasing amount of attention from both inside and outside of academia over the last few decades. There are many excellent books already published, which can introduce you to the big ideas and key take-home messages about complex systems. In the meantime, one persistent challenge I have been having in teaching complex systems over the last several years is the apparent lack of accessible, easy-to-follow, introductory-level technical textbooks. What I mean by technical textbooks are the ones that get down to the “wet and dirty” details of how to build mathematical or
computational models of complex systems and how to simulate and analyze them. Other books that go into such levels of detail are typically written for advanced students who are already doing some kind of research in physics, mathematics, or computer science. What I needed, instead, was a technical textbook that would be more appropriate for a broader audience—college freshmen and sophomores in any science, technology, engineering, and mathematics (STEM) areas, undergraduate/graduate students in other majors, such as the social sciences, management/organizational sciences, health sciences and the humanities, and even advanced high school students looking for research projects who are interested in complex systems modeling.

Can you imagine that? A technical textbook appropriate for a broad audience?

Perish the thought!

I could name several W3C standards that could have used that editorial stance as opposed to: “…we know what we meant….”

I should consider that as a market opportunity, to translate insider jargon (and deliberately so) into more generally accessible language. Might even help with uptake of the standards.

While I think about that, enjoy this introduction to complex systems, with Python none the less.

July 26, 2015

Learning Data Science Using Functional Python

Filed under: Data Science,Functional Programming,Python — Patrick Durusau @ 8:14 pm

Learning Data Science Using Functional Python by Joel Grus.

Something fun to start the week off!

Apologies for the “lite” posting of late. I am munging some small but very ugly data for a report this coming week. The data sources range from spreadsheets to forms delivered in PDF, in no particular order and some without the original numbering. What fun!

Complaints about updating URLs that were redirects were meet with replies that “private redirects” weren’t of interest and they would continue to use the original URLs. Something tells me the responsible parties didn’t quite get what URL redirects are about.

Another day or so and I will be back at full force with more background on the Balisage presentation and more useful posts every day.

July 21, 2015

Ibis on Impala: Python at Scale for Data Science

Filed under: Cloudera,Python — Patrick Durusau @ 7:24 pm

Ibis on Impala: Python at Scale for Data Science by Marcel Kornacker and Wes McKinney.

From the post:

Ibis: Same Great Python Ecosystem at Hadoop Scale

Co-founded by the respective architects of the Python pandas toolkit and Impala and now incubating in Cloudera Labs, Ibis is a new data analysis framework with the goal of enabling advanced data analysis on a 100% Python stack with full-fidelity data. With Ibis, for the first time, developers and data scientists will be able to utilize the last 15 years of advances in high-performance Python tools and infrastructure in a Hadoop-scale environment—without compromising user experience for performance. It’s exactly the same Python you know and love, only at scale!

In this initial (unsupported) Cloudera Labs release, Ibis offers comprehensive support for the analytical capabilities presently provided by Impala, enabling Python users to run Big Data workloads in a manner similar to that of “small data” tools like pandas. Next, we’ll extend Impala and Ibis in several ways to make the Python ecosystem a seamless part of the stack:

  • First, Ibis will enable more natural data modeling by leveraging Impala’s upcoming support for nested types (expected by end of 2015).
  • Second, we’ll add support for Python user-defined logic so that Ibis will integrate with the existing Python data ecosystem—enabling custom Python functions at scale.
  • Finally, we’ll accelerate performance further through low-level integrations between Ibis and Impala with a new Python-friendly, in-memory columnar format and Python-to-LLVM code generation. These updates will accelerate Python to run at native hardware speed.

See: Getting Started with Ibis and How to Contribute (same authors, opposite order) in order to cut to the chase and get started.


June 25, 2015

Eidyia (Scientific Python)

Filed under: Python,Science — Patrick Durusau @ 2:30 pm


From the webpage:

A scientific Python 3 environment configured with Vagrant. This environment is designed to be used by professionals and students, with ease of access a priority.

Libraries included:


Eidyia also includes MongoDB and PostgreSQL

Getting Started

With Vagrant and VirtualBox installed:

Watch the Vagrant link on the Github page, it is broken. Correct link appears above. (I am posting an issue about the link to Github.)

The more experience I have with virtual environments, the more I like them. Mostly from a configuration perspective. I don’t have to worry about library upgrades stepping on other programs, port confusion, etc.


June 15, 2015

A gallery of interesting IPython Notebooks

Filed under: Programming,Python — Patrick Durusau @ 3:05 pm

A gallery of interesting IPython Notebooks by David Mendler.

From the webpage:

This page is a curated collection of IPython notebooks that are notable for some reason. Feel free to add new content here, but please try to only include links to notebooks that include interesting visual or technical content; this should not simply be a dump of a Google search on every ipynb file out there.

The table of contents:

  1. Entire books or other large collections of notebooks on a topic

  2. Scientific computing and data analysis with the SciPy Stack

  3. General Python Programming
  4. Notebooks in languages other than Python

  5. Miscellaneous topics about doing various things with the Notebook itself
  6. Reproducible academic publications
  7. Other publications using the Notebook
  8. Data-driven journalism
  9. Whimsical notebooks
  10. Videos of IPython being used in the wild

Yes, quoting the table of contents may impact my ranking by Google but I prefer content that is useful to me and hopefully you. Please bookmark this site and pass it on.

June 13, 2015

Python Mode for Processing

Filed under: Processing,Python,Visualization — Patrick Durusau @ 3:20 pm

Python Mode for Processing

From the webpage:

You write Processing code. In Python.

Processing is a programming language, development environment, and online community. Since 2001, Processing has promoted software literacy within the visual arts and visual literacy within technology. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.

Processing was initially released with a Java-based syntax, and with a lexicon of graphical primitives that took inspiration from OpenGL, Postscript, Design by Numbers, and other sources. With the gradual addition of alternative progamming interfaces — including JavaScript, Python, and Ruby — it has become increasingly clear that Processing is not a single language, but rather, an arts-oriented approach to learning, teaching, and making things with code.

We are thrilled to make available this public release of the Python Mode for Processing, and its associated documentation. More is on the way! If you’d like to help us improve the implementation of Python Mode and its documentation, please find us on Github!

A screen shot of part of one image from will give you a glimpse of the power of Processing:


BTW, this screen shot pales on comparison to the original image.

Enough said?

June 11, 2015

NumPy / SciPy / Pandas Cheat Sheet

Filed under: Numpy,Python — Patrick Durusau @ 9:53 am

NumPy / SciPy / Pandas Cheat Sheet From quandl.

Useful but also an illustration of the tension between a true cheatsheet (one page, tiny print) and edging towards a legible but multi-page booklet.

I suspect the greatest benefit of a “cheatsheet” accrues to its author. The chores of selecting, typing and correcting being repetition that leads to memorization of the material.

I first saw this in a tweet by Kirk Borne.

June 2, 2015

Statistical and Mathematical Functions with DataFrames in Spark

Filed under: Data Frames,Python,Spark — Patrick Durusau @ 2:59 pm

Statistical and Mathematical Functions with DataFrames in Spark by Burak Yavuz and Reynold Xin.

From the post:

We introduced DataFrames in Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy to announce improved support for statistical and mathematical functions in the upcoming 1.4 release.

In this blog post, we walk through some of the important functions, including:

  1. Random data generation
  2. Summary and descriptive statistics
  3. Sample covariance and correlation
  4. Cross tabulation (a.k.a. contingency table)
  5. Frequent items
  6. Mathematical functions

We use Python in our examples. However, similar APIs exist for Scala and Java users as well.

You do know you have to build Spark yourself to find these features before the release of 1.4. Yes? For that:

Have you ever heard the expression “used in anger?”

That’s what Spark and its components deserve, to be “used in anger.”


May 28, 2015

Content Recommendation From Links Shared on Twitter Using Neo4j and Python

Filed under: Cypher,Graphs,Neo4j,Python,Twitter — Patrick Durusau @ 4:50 pm

Content Recommendation From Links Shared on Twitter Using Neo4j and Python by William Lyon.

From the post:


I’ve spent some time thinking about generating personalized recommendations for articles since I began working on an iOS reading companion for the bookmarking service. One of the features I want to provide is a feed of recommended articles for my users based on articles they’ve saved and read. In this tutorial we will look at how to implement a similar feature: how to recommend articles for users based on articles they’ve shared on Twitter.


The main tools we will use are Python and Neo4j, a graph database. We will use Python for fetching the data from Twitter, extracting keywords from the articles shared and for inserting the data into Neo4j. To find recommendations we will use Cypher, the Neo4j query language.

Very clear and complete!


May 24, 2015

LOFAR Transients Pipeline (“TraP”)

Filed under: Astroinformatics,Python,Science — Patrick Durusau @ 5:41 pm

LOFAR Transients Pipeline (“TraP”)

From the webpage:

The LOFAR Transients Pipeline (“TraP”) provides a means of searching a stream of N-dimensional (two spatial, frequency, polarization) image “cubes” for transient astronomical sources. The pipeline is developed specifically to address data produced by the LOFAR Transients Key Science Project, but may also be applicable to other instruments or use cases.

The TraP codebase provides the pipeline definition itself, as well as a number of supporting routines for source finding, measurement, characterization, and so on. Some of these routines are also available as stand-alone tools.

High-level overview

The TraP consists of a tightly-coupled combination of a “pipeline definition” – effectively a Python script that marshals the flow of data through the system – with a library of analysis routines written in Python and a database, which not only contains results but also performs a key role in data processing.

Broadly speaking, as images are ingested by the TraP, a Python-based source-finding routine scans them, identifying and measuring all point-like sources. Those sources are ingested by the database, which associates them with previous measurements (both from earlier images processed by the TraP and from other catalogues) to form a lightcurve. Measurements are then performed at the locations of sources which were expected to be seen in this image but which were not detected. A series of statistical analyses are performed on the lightcurves constructed in this way, enabling the quick and easy identification of potential transients. This process results in two key data products: an archival database containing the lightcurves of all point-sources included in the dataset being processed, and community alerts of all transients which have been identified.

Exploiting the results of the TraP involves understanding and analysing the resulting lightcurve database. The TraP itself provides no tools directly aimed at this. Instead, the Transients Key Science Project has developed the Banana web interface to the database, which is maintained separately from the TraP. The database may also be interrogated by end-user developed tools using SQL.

While it uses the term “association,” I think you will conclude it is much closer to merging in a topic map sense:

The association procedure knits together (“associates”) the measurements in extractedsource which are believed to originate from a single astronomical source. Each such source is given an entry in the runningcatalog table which ties together all of the measurements by means of the assocxtrsource table. Thus, an entry in runningcatalog can be thought of as a reference to the lightcurve of a particular source.

Perhaps not of immediate use but good reading and a diversion from corruption, favoritism, oppression and other usual functions of government.

May 2, 2015

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

Filed under: Cassandra,Data Frames,Python — Patrick Durusau @ 8:17 pm

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.


New Natural Language Processing and NLTK Videos

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 3:59 pm

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences and Stop Words – Natural Language Processing With Python and NLTK p.2 by Harrison Kinsley.

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Playlist link:…

sample code:

Use the Playlist link:… link as I am sure more videos will be appearing in the near future.


May 1, 2015

Large-Scale Social Phenomena – Data Mining Demo

Filed under: Data Mining,Python — Patrick Durusau @ 7:48 pm

Large-Scale Social Phenomena – Data Mining Demo by Artemy Kolchinsky.

From the webpage:

For your mid-term hack-a-thons, you will be expected to quickly acquire, analyze and draw conclusion from some real-world datasets. The goal of this tutorial is to provide you with some tools that will hopefully enable you to spend less time debugging and more time generating and testing interesting ideas.

Here, I chose to focus on Python. It is beautiful language that is quickly developing an ecosystem of powerful and free scientific computing and data mining tools (e.g. the Homogenization of scientific computing, or why Python is steadily eating other languages’ lunch). For this reason, as well as my own familiarity with it, I encourage (though certainly not require) you to use it for your mid-term hack-a-thons. From my own experience, getting comfortable with these tools will pay off in terms of making many future data analysis projects (including perhaps your final projects) easier & more enjoyable.

Just in time for the weekend! I first saw this in a tweet by Lynn Cherny.

Suggestions of odd data sources for mining?

April 26, 2015

Getting Started with Spark (in Python)

Filed under: Hadoop,MapReduce,Python,Spark — Patrick Durusau @ 2:21 pm

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. 😉

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.


I first saw this in a tweet by DataMining.

April 25, 2015

pandas: powerful Python data analysis toolkit Release 0.16

Filed under: Data Analysis,Programming,Python — Patrick Durusau @ 7:42 pm

pandas: powerful Python data analysis toolkit Release 0.16 by Wes McKinney and PyData Development Team.

I mentioned Wes’ 2011 paper on pandas in 2011 and a lot has changed since then.

From the homepage:

pandas: powerful Python data analysis toolkit

PDF Version

Zipped HTML

Date: March 24, 2015 Version: 0.16.0

Binary Installers:

Source Repository:

Issues & Ideas:

Q&A Support:

Developer Mailing List:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

  • pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
  • pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
  • pandas has been used extensively in production in financial applications.


This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.

Not that I’m one to make editorial suggestions, ;-), but with almost 200 pages of What’s New entries going back to September of 2011 and topping out at over 1600 pages, I would move all but the latest What’s New to the end. Yes?

BTW, at 1600 pages, you may already be behind in your reading. Are you sure you want to get further behind?

Not only will the reading be entertaining, it will have the side benefit of improving your data analysis skills as well.


I first saw this mentioned in a tweet by Kirk Borne.

April 16, 2015

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training

Filed under: Bioinformatics,Python — Patrick Durusau @ 1:21 pm

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training by Teresa K. Atwood, et al. (PLOS Published: April 9, 2015 DOI: 10.1371/journal.pcbi.1004143)


In recent years, high-throughput technologies have brought big data to the life sciences. The march of progress has been rapid, leaving in its wake a demand for courses in data analysis, data stewardship, computing fundamentals, etc., a need that universities have not yet been able to satisfy—paradoxically, many are actually closing “niche” bioinformatics courses at a time of critical need. The impact of this is being felt across continents, as many students and early-stage researchers are being left without appropriate skills to manage, analyse, and interpret their data with confidence. This situation has galvanised a group of scientists to address the problems on an international scale. For the first time, bioinformatics educators and trainers across the globe have come together to address common needs, rising above institutional and international boundaries to cooperate in sharing bioinformatics training expertise, experience, and resources, aiming to put ad hoc training practices on a more professional footing for the benefit of all.

Great background on GOBLET,

One of the functions of GOBLET is to share training materials in bioinformatics and that is well underway. The Training Portal has eighty-nine (89) sets of training materials as of today, ranging from Pathway and Network Analysis 2014 Module 1 – Introduction to Gene Lists to Parsing data records using Python programming and points in between!

If your training materials aren’t represented, perhaps it is time for you to correct that oversight.


I first saw this in a tweet by Mick Watson.

April 8, 2015

PyCon 2015 Scikit-learn Tutorial

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 8:45 am

PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas.


Machine learning is the branch of computer science concerned with the development of algorithms which can be trained by previously-seen data in order to make predictions about future data. It has become an important aspect of work in a variety of applications: from optimization of web searches, to financial forecasts, to studies of the nature of the Universe.

This tutorial will explore machine learning with a hands-on introduction to the scikit-learn package. Beginning from the broad categories of supervised and unsupervised learning problems, we will dive into the fundamental areas of classification, regression, clustering, and dimensionality reduction. In each section, we will introduce aspects of the Scikit-learn API and explore practical examples of some of the most popular and useful methods from the machine learning literature.

The strengths of scikit-learn lie in its uniform and well-document interface, and its efficient implementations of a large number of the most important machine learning algorithms. Those present at this tutorial will gain a basic practical background in machine learning and the use of scikit-learn, and will be well poised to begin applying these tools in many areas, whether for work, for research, for Kaggle-style competitions, or for their own pet projects.

You can view the tutorial at: PyCon 2015 Scikit-Learn Tutorial Index.

Jake is presenting today (April 8, 2015), so this is very current news!


April 6, 2015

Fast Lane to Python…

Filed under: Programming,Python — Patrick Durusau @ 12:43 pm

Fast Lane to Python – A quick, sensible route to the joys of Python coding by Norm Matloff.

From the preface:

My approach here is different from that of most Python books, or even most Python Web tutorials. The usual approach is to painfully go over all details from the beginning, with little or no context. For example, the usual approach would be to first state all possible forms that a Python integer can take on, all possible forms a Python variable name can have, and for that matter how many different ways one can launch Python with.

I avoid this here. Again, the aim is to enable the reader to quickly acquire a Python foundation. He/she should then be able to delve directly into some special topic if and when the need arises. So, if you want to know, say, whether Python variable names can include underscores, you’ve come to the wrong place. If you want to quickly get into Python programming, this is hopefully the right place. (emphasis in the original)

You may know Norm Matloff as the author of Algorithms to Z-Scores:… or Programming on Parallel Machines, both open source textbooks.

What do you think about Norm’s approach to teaching Python? Noting that we don’t teach children language by sitting them down with a grammar but through corrected usage, and lots of it. At some point they learn or can look up the edge cases. Parallel’s to Norm’s approach?

I first saw this in a tweet by Christophe Lalanne.

April 5, 2015

How to format Python code without really trying

Filed under: Programming,Python — Patrick Durusau @ 7:08 pm

How to format Python code without really trying by Bill Wendling.

From the post:

Years of writing and maintaining Python code have taught us the value of automated tools for code formatting, but the existing ones didn’t quite do what we wanted. In the best traditions of the open source community, it was time to write yet another Python formatter.

YAPF takes a different approach to formatting Python code: it reformats the entire program, not just individual lines or constructs that violate a style guide rule. The ultimate goal is to let engineers focus on the bigger picture and not worry about the formatting. The end result should look the same as if an engineer had worried about the formatting.

You can run YAPF on the entire program or just a part of the program. It’s also possible to flag certain parts of a program which YAPF shouldn’t alter, which is useful for generated files or sections with large literals.

One step towards readable code!


March 19, 2015

GPU-Accelerated Graph Analytics in Python with Numba

Filed under: GPU,Graph Analytics,Python — Patrick Durusau @ 6:28 pm

GPU-Accelerated Graph Analytics in Python with Numba Siu Kwan Lam.


Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

Python enthusiasts, I would not take the “…from a glue language to a complete programming environment…” comment to heart.

The author also says:

Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

and then summarizes the results of code in the post:

Our GPU PageRank implementation completed in just 163 seconds on the full graph of 623 million edges and 43 million nodes using a single NVIDIA Tesla K20 GPU accelerator. Our equivalent Numba CPU-JIT version took at least 5 times longer on a smaller graph.

plus points out techniques for optimizing the code.

I’d say no hard feelings. Yes? 😉

March 16, 2015

Computing the optimal road trip across the U.S.

Filed under: Mapping,Maps,Python — Patrick Durusau @ 4:31 pm

Computing the optimal road trip across the U.S. by Randal S. Olson.

From the webpage:

This notebook provides the methodology and code used in the blog post, Computing the optimal road trip across the U.S..

This is a nice surprise for a Monday!

The original post goes into the technical details and is quite good.

DYI Web Server

Filed under: Python,Software,WWW — Patrick Durusau @ 8:01 am

Let’s Build A Web Server. Part 1. by Ruslan Spivak.

From the post:

Out for a walk one day, a woman came across a construction site and saw three men working. She asked the first man, “What are you doing?” Annoyed by the question, the first man barked, “Can’t you see that I’m laying bricks?” Not satisfied with the answer, she asked the second man what he was doing. The second man answered, “I’m building a brick wall.” Then, turning his attention to the first man, he said, “Hey, you just passed the end of the wall. You need to take off that last brick.” Again not satisfied with the answer, she asked the third man what he was doing. And the man said to her while looking up in the sky, “I am building the biggest cathedral this world has ever known.” While he was standing there and looking up in the sky the other two men started arguing about the errant brick. The man turned to the first two men and said, “Hey guys, don’t worry about that brick. It’s an inside wall, it will get plastered over and no one will ever see that brick. Just move on to another layer.”1

The moral of the story is that when you know the whole system and understand how different pieces fit together (bricks, walls, cathedral), you can identify and fix problems faster (errant brick).

What does it have to do with creating your own Web server from scratch?

I believe to become a better developer you MUST get a better understanding of the underlying software systems you use on a daily basis and that includes programming languages, compilers and interpreters, databases and operating systems, web servers and web frameworks. And, to get a better and deeper understanding of those systems you MUST re-build them from scratch, brick by brick, wall by wall. (emphasis in original)

You probably don’t want to try this with an office suite package but for a basic web server this could be fun!

More installments to follow.


March 14, 2015

Mapping Your Music Collection [Seeing What You Expect To See]

Filed under: Audio,Machine Learning,Music,Python,Visualization — Patrick Durusau @ 4:11 pm

Mapping Your Music Collection by Christian Peccei.

From the post:

In this article we’ll explore a neat way of visualizing your MP3 music collection. The end result will be a hexagonal map of all your songs, with similar sounding tracks located next to each other. The color of different regions corresponds to different genres of music (e.g. classical, hip hop, hard rock). As an example, here’s a map of three albums from my music collection: Paganini’s Violin Caprices, Eminem’s The Eminem Show, and Coldplay’s X&Y.


To make things more interesting (and in some cases simpler), I imposed some constraints. First, the solution should not rely on any pre-existing ID3 tags (e.g. Arist, Genre) in the MP3 files—only the statistical properties of the sound should be used to calculate the similarity of songs. A lot of my MP3 files are poorly tagged anyways, and I wanted to keep the solution applicable to any music collection no matter how bad its metadata. Second, no other external information should be used to create the visualization—the only required inputs are the user’s set of MP3 files. It is possible to improve the quality of the solution by leveraging a large database of songs which have already been tagged with a specific genre, but for simplicity I wanted to keep this solution completely standalone. And lastly, although digital music comes in many formats (MP3, WMA, M4A, OGG, etc.) to keep things simple I just focused on MP3 files. The algorithm developed here should work fine for any other format as long as it can be extracted into a WAV file.

Creating the music map is an interesting exercise. It involves audio processing, machine learning, and visualization techniques.

It would take longer than a weekend to complete this project with a sizable music collection but it would be a great deal of fun!

Great way to become familiar with several Python libraries.

BTW, when I saw Coldplay, I thought of Coal Chamber by mistake. Not exactly the same subject. 😉

I first saw this in a tweet by Kirk Borne.

March 9, 2015

Kalman and Bayesian Filters in Python

Filed under: Bayesian Models,Filters,Kalman Filter,Python — Patrick Durusau @ 6:39 pm

Kalman and Bayesian Filters in Python by Roger Labbe.

Apologies for the lengthy quote but Roger makes a great case for interactive textbooks, IPython notebooks, writing for the reader as opposed to making the author feel clever, and finally, making content freely available.

It is a quote that I am going to make a point to read on a regular basis.

And all of that before turning to the subject at hand!


From the preface:

This is a book for programmers that have a need or interest in Kalman filtering. The motivation for this book came out of my desire for a gentle introduction to Kalman filtering. I’m a software engineer that spent almost two decades in the avionics field, and so I have always been ‘bumping elbows’ with the Kalman filter, but never implemented one myself. They always has a fearsome reputation for difficulty, and I did not have the requisite education. Everyone I met that did implement them had multiple graduate courses on the topic and extensive industrial experience with them. As I moved into solving tracking problems with computer vision the need to implement them myself became urgent. There are classic textbooks in the field, such as Grewal and Andrew’s excellent Kalman Filtering. But sitting down and trying to read many of these books is a dismal and trying experience if you do not have the background. Typically the first few chapters fly through several years of undergraduate math, blithely referring you to textbooks on, for example, Itō calculus, and presenting an entire semester’s worth of statistics in a few brief paragraphs. These books are good textbooks for an upper undergraduate course, and an invaluable reference to researchers and professionals, but the going is truly difficult for the more casual reader. Symbology is introduced without explanation, different texts use different words and variables names for the same concept, and the books are almost devoid of examples or worked problems. I often found myself able to parse the words and comprehend the mathematics of a definition, but had no idea as to what real world phenomena these words and math were attempting to describe. “But what does that mean?” was my repeated thought.

However, as I began to finally understand the Kalman filter I realized the underlying concepts are quite straightforward. A few simple probability rules, some intuition about how we integrate disparate knowledge to explain events in our everyday life and the core concepts of the Kalman filter are accessible. Kalman filters have a reputation for difficulty, but shorn of much of the formal terminology the beauty of the subject and of their math became clear to me, and I fell in love with the topic.

As I began to understand the math and theory more difficulties itself. A book or paper’s author makes some statement of fact and presents a graph as proof. Unfortunately, why the statement is true is not clear to me, nor is the method by which you might make that plot obvious. Or maybe I wonder “is this true if R=0?” Or the author provides pseudocode – at such a high level that the implementation is not obvious. Some books offer Matlab code, but I do not have a license to that expensive package. Finally, many books end each chapter with many useful exercises. Exercises which you need to understand if you want to implement Kalman filters for yourself, but exercises with no answers. If you are using the book in a classroom, perhaps this is okay, but it is terrible for the independent reader. I loathe that an author withholds information from me, presumably to avoid ‘cheating’ by the student in the classroom.

None of this necessary, from my point of view. Certainly if you are designing a Kalman filter for a aircraft or missile you must thoroughly master of all of the mathematics and topics in a typical Kalman filter textbook. I just want to track an image on a screen, or write some code for my Arduino project. I want to know how the plots in the book are made, and chose different parameters than the author chose. I want to run simulations. I want to inject more noise in the signal and see how a filter performs. There are thousands of opportunities for using Kalman filters in everyday code, and yet this fairly straightforward topic is the provenance of rocket scientists and academics.

I wrote this book to address all of those needs. This is not the book for you if you program avionics for Boeing or design radars for Raytheon. Go get a degree at Georgia Tech, UW, or the like, because you’ll need it. This book is for the hobbyist, the curious, and the working engineer that needs to filter or smooth data.

This book is interactive. While you can read it online as static content, I urge you to use it as intended. It is written using IPython Notebook, which allows me to combine text, python, and python output in one place. Every plot, every piece of data in this book is generated from Python that is available to you right inside the notebook. Want to double the value of a parameter? Click on the Python cell, change the parameter’s value, and click ‘Run’. A new plot or printed output will appear in the book.

This book has exercises, but it also has the answers. I trust you. If you just need an answer, go ahead and read the answer. If you want to internalize this knowledge, try to implement the exercise before you read the answer.

This book has supporting libraries for computing statistics, plotting various things related to filters, and for the various filters that we cover. This does require a strong caveat; most of the code is written for didactic purposes. It is rare that I chose the most efficient solution (which often obscures the intent of the code), and in the first parts of the book I did not concern myself with numerical stability. This is important to understand – Kalman filters in aircraft are carefully designed and implemented to be numerically stable; the naive implementation is not stable in many cases. If you are serious about Kalman filters this book will not be the last book you need. My intention is to introduce you to the concepts and mathematics, and to get you to the point where the textbooks are approachable.

Finally, this book is free. The cost for the books required to learn Kalman filtering is somewhat prohibitive even for a Silicon Valley engineer like myself; I cannot believe the are within the reach of someone in a depressed economy, or a financially struggling student. I have gained so much from free software like Python, and free books like those from Allen B. Downey here [1]. It’s time to repay that. So, the book is free, it is hosted on free servers, and it uses only free and open software such as IPython and mathjax to create the book.

I first saw this in a tweet by nixCraft.

March 8, 2015

clf – Command line tool to search snippets on

Filed under: Linux OS,Python — Patrick Durusau @ 6:30 pm

clf – Command line tool to search snippets on by Nicolas Crocfer.

From the webpage: is the place to record awesome command-line snippets. This tool allows you to search and view the results into your terminal.

What a very clever idea!

Imagine if all the sed/awk scripts were collected from various archive sites, deduped and made searchable via such an interface!


March 7, 2015

Hands-on with machine learning

Filed under: Journalism,Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 5:20 pm

Hands-on with machine learning by Chase Davis.

From the webpage:

First of all, let me be clear about one thing: You’re not going to “learn” machine learning in 60 minutes.

Instead, the goal of this session is to give you some sense of how to approach one type of machine learning in practice, specifically

For this exercise, we’ll be training a simple classifier that learns how to categorize bills from the California Legislature based only on their titles. Along the way, we’ll focus on three steps critical to any supervised learning application: feature engineering, model building and evaluation.

To help us out, we’ll be using a Python library called, which is the easiest to understand machine learning library I’ve seen in any language.

That’s a lot to pack in, so this session is going to move fast, and I’m going to assume you have a strong working knowledge of Python. Don’t get caught up in the syntax. It’s more important to understand the process.

Since we only have time to hit the very basics, I’ve also included some additional points you might find useful under the “What we’re not covering” heading of each section below. There are also some resources at the bottom of this document that I hope will be helpful if you decide to learn more about this on your own.

A great starting place for journalists or anyone else who wants to understand basic machine learning.

I first saw this in a tweet by Hanna Wallach.

« Newer PostsOlder Posts »

Powered by WordPress