Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 23, 2016

A Whirlwind Tour of Python (Excellent!)

Filed under: Programming,Python — Patrick Durusau @ 12:35 pm

A Whirlwind Tour of Python by Jake VanderPlas.

From the webpage:

To tap into the power of Python’s open data science stack—including NumPy, Pandas, Matplotlib, Scikit-learn, and other tools—you first need to understand the syntax, semantics, and patterns of the Python language. This report provides a brief yet comprehensive introduction to Python for engineers, researchers, and data scientists who are already familiar with another programming language.

Author Jake VanderPlas, an interdisciplinary research director at the University of Washington, explains Python’s essential syntax and semantics, built-in data types and structures, function definitions, control flow statements, and more, using Python 3 syntax.

You’ll explore:

  • Python syntax basics and running Python code
  • Basic semantics of Python variables, objects, and operators
  • Built-in simple types and data structures
  • Control flow statements for executing code blocks conditionally
  • Methods for creating and using reusable functions
  • Iterators, list comprehensions, and generators
  • String manipulation and regular expressions
  • Python’s standard library and third-party modules
  • Python’s core data science tools
  • Recommended resources to help you learn more

Jake VanderPlas is a long-time user and developer of the Python scientific stack. He currently works as an interdisciplinary research director at the University of Washington, conducts his own astronomy research, and spends time advising and consulting with local scientists from a wide range of fields.

A Whirlwind Tour of Python, can be recommended without reservation.

In addition to the book, the Jupyter notebooks behind the book have been posted.

Enjoy!

August 19, 2016

29 common beginner Python errors on one page [Something Similar For XQuery?]

Filed under: Programming,Python — Patrick Durusau @ 3:55 pm

29 common beginner Python errors on one page

From the webpage:

A few times a year, I have the job of teaching a bunch of people who have never written code before how to program from scratch. The nature of programming being what it is, the same error crop up every time in a very predictable pattern. I usually encourage my students to go through a step-by-step troubleshooting process when trying to fix misbehaving code, in which we go through these common errors one by one and see if they could be causing the problem. Today, I decided to finally write this troubleshooting process down and turn it into a flowchart in non-threatening colours.

Behold, the “my code isn’t working” step-by-step troubleshooting guide! Follow the arrows to find the likely cause of your problem – if the first thing you reach doesn’t work, then back up and try again.

Click the image for full-size, and click here for a printable PDF. Colour scheme from Luna Rosa.

Useful for Python beginner’s and should be inspirational for other languages.

Thoughts on something similar for XQuery Errors? Suggestions for collecting the “most common” XQuery errors?

Readable Regexes In Python?

Filed under: Python,Regex — Patrick Durusau @ 10:45 am

Doug Mahugh retweeted Raymond Hettinger tweeting:

#python tip: Complicated regexes can be organized into readable, commented chucks.
https://docs.python.org/3/library/re.html#re.X

Twitter hasn’t gotten around to censoring Python related tweets for accuracy so I did check the reference:

re.X
re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

Which is the better question?

Why would anyone want to produce a readable regex in Python?

or,

Why would anyone NOT produce a readable regex given the opportunity?

Enjoy!

PS: It occurs to me that with a search expression you could address such strings as subjects in a topic map. A more robust form of documentation than # syntax.

August 17, 2016

Grokking Deep Learning

Filed under: Deep Learning,Military,Numpy,Python — Patrick Durusau @ 8:58 pm

Grokking Deep Learning by Andrew W. Trask.

From the description:

Artificial Intelligence is the most exciting technology of the century, and Deep Learning is, quite literally, the “brain” behind the world’s smartest Artificial Intelligence systems out there. Loosely based on neuron behavior inside of human brains, these systems are rapidly catching up with the intelligence of their human creators, defeating the world champion Go player, achieving superhuman performance on video games, driving cars, translating languages, and sometimes even helping law enforcement fight crime. Deep Learning is a revolution that is changing every industry across the globe.

Grokking Deep Learning is the perfect place to begin your deep learning journey. Rather than just learn the “black box” API of some library or framework, you will actually understand how to build these algorithms completely from scratch. You will understand how Deep Learning is able to learn at levels greater than humans. You will be able to understand the “brain” behind state-of-the-art Artificial Intelligence. Furthermore, unlike other courses that assume advanced knowledge of Calculus and leverage complex mathematical notation, if you’re a Python hacker who passed high-school algebra, you’re ready to go. And at the end, you’ll even build an A.I. that will learn to defeat you in a classic Atari game.

In the Manning Early Access Program (MEAP) with three (3) chapters presently available.

A much more plausible undertaking than DARPA’s quest for “Explainable AI” or “XAI.” (DARPA WANTS ARTIFICIAL INTELLIGENCE TO EXPLAIN ITSELF) DARPA reasons that:


Potential applications for defense are endless—autonomous aerial and undersea war-fighting or surveillance, among others—but humans won’t make full use of AI until they trust it won’t fail, according to the Defense Advanced Research Projects Agency. A new DARPA effort aims to nurture communication between machines and humans by investing in AI that can explain itself as it works.

If non-failure is the criteria for trust, U.S. troops should refuse to leave their barracks in view of the repeated failures of military strategy since the end of WWII.

DARPA should choose a less stringent criteria for trusting an AI. However, failing less often than the Joint Chiefs of Staff may be too low a bar to set.

Pandas

Filed under: Data Science,Pandas,Python — Patrick Durusau @ 8:19 pm

Pandas by Reuven M. Lerner.

From the post:

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren’t quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that’s what I’m going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It’s hard to overstate the number of people I’ve met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn’t yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article “Analyzing Data“, I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won’t necessarily reach any amazing conclusions, you’ll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Of course, scientific articles are written as though questions drop out of the sky and data is interrogated for the answer.

Aside from being rhetoric to badger others with, does anyone really think that is how science operates in fact?

Whether you have delusions about how science works in fact or not, you will find that Pandas will assist you in exploring data.

August 10, 2016

Dark Web OSINT with Python Part Two: … [Prizes For Unmasking Government Sites?]

Filed under: Dark Web,Open Source Intelligence,Python,Tor — Patrick Durusau @ 4:31 pm

Dark Web OSINT with Python Part Two: SSH Keys and Shodan by Justin.

From the post:

Welcome back good Python soldiers. In Part One of this series we created a wrapper around OnionScan, a fantastic tool created by Sarah Jamie Lewis (@sarajamielewis). If you haven’t read Part One then go do so now. Now that you have a bunch of data (or you downloaded it from here) we want to do some analysis and further intelligence gathering with it. Here are a few objectives we are going to cover in the rest of the series.

  1. Attempt to discover clearnet servers that share SSH fingerprints with hidden services, using Shodan. As part of this we will also analyze whether the same SSH key is shared amongst hidden services.
  2. Map out connections between hidden services, clearnet sites and any IP address leaks.
  3. Discover clusters of sites that are similar based on their index pages, this can help find knockoffs or clones of “legitimate” sites. We’ll use a machine learning library called scikit-learn to achieve this.

The scripts that were created for this series are quick little one-offs, so there is some shared code between each script. Feel free to tighten this up into a function or a module you can import. The goal is to give you little chunks of code that will teach you some basics on how to begin analyzing some of the data and more importantly to give you some ideas on how you can use it for your own purposes.

In this post we are going to look at how to connect hidden services by their SSH public key fingerprints, as well as how to expand our intelligence gathering using Shodan. Let’s get started!

Expand your Dark Web OSINT intell skills!

Being mindful that if you can discover your Dark Web site, so can others.

Anyone awarding Black Hat conference registrations for unmasking government sites on the Dark Web?

July 30, 2016

Pandas Exercises

Filed under: Pandas,Programming,Python — Patrick Durusau @ 4:53 pm

Pandas Exercises

From the post:

Fed up with a ton of tutorials but no easy way to find exercises I decided to create a repo just with exercises to practice pandas. Don’t get me wrong, tutorials are great resources, but to learn is to do. So unless you practice you won’t learn.

There will be three different types of files:

  1. Exercise instructions
  2. Solutions without code
  3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial or video and then do exercises. Learn one more topic and do exercises. If you got the answer wrong, don’t go to the solution with code, follow this advice instead.

Suggestions and collaborations are more than welcome. 🙂

I’m sure you will find this useful but when I search for pandas exercise python, I get 298,000 “hits.”

Adding exercises here isn’t going to improve the findability of pandas for particular subject areas or domains.

Perhaps as exercises are added here, links to exercises by subject area can be added as well.

With nearly 300K potential sources, there is no shortage of exercises to go around!

Dark Web OSINT With Python and OnionScan: Part One

Filed under: Dark Web,Open Source Intelligence,Python — Patrick Durusau @ 10:47 am

Dark Web OSINT With Python and OnionScan: Part One by Justin.

When you tire of what passes for political discussion on Twitter and/or Facebook this weekend, why not try your hand at something useful?

Like looking for data leaks on the Dark Web?

You could, in theory at least, notify the sites of their data leaks. 😉

One of the aspects of announced leaks that never ceases to amaze me are reports that read:

Well, we pawned the (some string of letters) database and then notified them of the issue.

Before getting a copy of the entire database? What’s the point?

All you have accomplished is making another breach more difficult and demonstrating your ability to breach a system where the root password was most likely “god.”

Anyway, Justin gets you started on seeking data leaks on the Dark Web saying:

You may have heard of this awesome tool called OnionScan that is used to scan hidden services in the dark web looking for potential data leaks. Recently the project released some cool visualizations and a high level description of what their scanning results looked like. What they didn’t provide is how to actually go about scanning as much of the dark web as possible, and then how to produce those very cool visualizations that they show.

At a high level we need to do the following:

  1. Setup a server somewhere to host our scanner 24/7 because it takes some time to do the scanning work.
  2. Get TOR running on the server.
  3. Get OnionScan setup.
  4. Write some Python to handle the scanning and some of the other data management to deal with the scan results.
  5. Write some more Python to make some cool graphs. (Part Two of the series)

Let’s get started!

Very much looking forward to Part 2!

Enjoy!

July 28, 2016

greek-accentuation 1.0.0 Released

Filed under: Greek,Language,Parsing,Python — Patrick Durusau @ 4:32 pm

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

June 17, 2016

Volumetric Data Analysis – yt

Filed under: Astroinformatics,Data Analysis,Python,Science,Visualization — Patrick Durusau @ 7:23 pm

One of those rotating homepages:

Volumetric Data Analysis – yt

yt is a python package for analyzing and visualizing volumetric, multi-resolution data from astrophysical simulations, radio telescopes, and a burgeoning interdisciplinary community.

Quantitative Analysis and Visualization

yt is more than a visualization package: it is a tool to seamlessly handle simulation output files to make analysis simple. yt can easily knit together volumetric data to investigate phase-space distributions, averages, line integrals, streamline queries, region selection, halo finding, contour identification, surface extraction and more.

Many formats, one language

yt aims to provide a simple uniform way of handling volumetric data, regardless of where it is generated. yt currently supports FLASH, Enzo, Boxlib, Athena, arbitrary volumes, Gadget, Tipsy, ART, RAMSES and MOAB. If your data isn’t already supported, why not add it?

From the non-rotating part of the homepage:

To get started using yt to explore data, we provide resources including documentation, workshop material, and even a fully-executable quick start guide demonstrating many of yt’s capabilities.

But if you just want to dive in and start using yt, we have a long list of recipes demonstrating how to do various tasks in yt. We even have sample datasets from all of our supported codes on which you can test these recipes. While yt should just work with your data, here are some instructions on loading in datasets from our supported codes and formats.

Professional astronomical data and tools like yt put exploration of the universe at your fingertips!

Enjoy!

May 6, 2016

Computer Programming for Lawyers:… [Educating a Future Generation of Judges]

Filed under: Law,Programming,Python — Patrick Durusau @ 8:46 pm

Computer Programming for Lawyers: An Introduction by Paul Ohm and Jonathan Frankle.

From the syllabus:

This class provides an introduction to computer programming for law students. The programming language taught may vary from year-to-year, but it will likely be a language designed to be both easy to learn and powerful, such as Python or JavaScript. There are no prerequisites, and even students without training in computer science or engineering should be able successfully to complete the class.

The course is based on the premise that computer programming has become a vital skill for non-technical professionals generally and for future lawyers and policymakers specifically. Lawyers, irrespective of specialty or type of practice, organize, evaluate, and manipulate large sets of text-based data (e.g. cases, statutes, regulations, contracts, etc.) Increasingly, lawyers are asked to deal with quantitative data and complex databases. Very simple programming techniques can expedite and simplify these tasks, yet these programming techniques tend to be poorly understood in legal practice and nearly absent in legal education. In this class, students will gain proficiency in various programming-related skills.

A secondary goal for the class is to introduce students to computer programming and computer scientific concepts they might encounter in the substantive practice of law. Students might discuss, for example, how programming concepts illuminate and influence current debates in privacy, intellectual property, consumer protection, antidiscrimination, antitrust, and criminal procedure.

The language for this year is Python. The course website, http://paulohm.com/classes/cpl16/ does not have any problem sets posted, yet. Be sure to check back for those.

Recommend this to any and all lawyers you encounter. It isn’t possible to predict who will or will not be a judge someday. Judges with a basic understanding of computing could improve the overall quality of decisions on computer technology.

Like discounting DOJ spun D&D tales about juvenile behavior.

April 16, 2016

Hello World – Machine Learning Recipes #1

Filed under: Machine Learning,Python,Scikit-Learn,TensorFlow — Patrick Durusau @ 7:43 pm

Hello World – Machine Learning Recipes #1 by Josh Gordon.

From the description:

Six lines of Python is all it takes to write your first machine learning program! In this episode, we’ll briefly introduce what machine learning is and why it’s important. Then, we’ll follow a recipe for supervised learning (a technique to create a classifier from examples) and code it up.

The first in a promised series on machine learning using scikit learn and TensorFlow.

The quality of video that you wish was available to intermediate and advanced treatments.

Quite a treat! Pass onto anyone interested in machine learning.

Enjoy!

April 7, 2016

PySparNN [nearest neighbors in sparse, high dimensional spaces (like text documents).]

Filed under: Nearest Neighbor,Python,Sparse Data — Patrick Durusau @ 8:28 pm

PySparNN

From the post:

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 – cosine_similarity).

PySparNN benefits:

  • Designed to be efficent on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Work in progress – Min, Max distance thresholds can be set at query time (not index time). Example: return the k closest items on the interval [0.8, 0.9] from a query point.

If your data is NOT SPARSE – please consider annoy. Annoy uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.
General rule of thumb – annoy performs better if you can get your data to fit into memory (as a dense vector).

The most comparable library to PySparNN is scikit-learn’s LSHForrest module. As of this writing, PySparNN is ~1.5x faster on the 20newsgroups dataset. A more robust benchmarking on sparse data is desired. Here is the comparison.

I included the text snippet in the title because PySparNN isn’t clueful, at least not at first glance.

I looked for a good explanation on nearest neighbors and encountered this lecture by Pat Wilson’s (MIT OpenCourseWare):

The lecture has a number of gems, including the observation that:

Town and Country readers tend to be social parasites.

Observations on text and nearest neightbors, time marks 17:30 – 24:17.

You should make an effort to watch the entire video. You will have a broader appreciate for the sheer power of nearest neighbor analysis and as a bonus, some valuable insights on why going without sleep is a very bad idea.

I first saw this in a tweet by Lynn Cherny.

April 6, 2016

Advanced Data Mining with Weka – Starts 25 April 2016

Filed under: Machine Learning,Python,R,Spark,Weka — Patrick Durusau @ 4:43 pm

Advanced Data Mining with Weka by Ian Witten.

From the webpage:

This course follows on from Data Mining with Weka and More Data Mining with Weka. It provides a deeper account of specialized data mining tools and techniques. Again the emphasis is on principles and practical data mining using Weka, rather than mathematical theory or advanced details of particular algorithms. Students will analyse time series data, mine data streams, use Weka to access other data mining packages including the popular R statistical computing language, script Weka in Python, and deploy it within a cluster computing framework. The course also includes case studies of applications such as classifying tweets, functional MRI data, image classification, and signal peptide prediction.

The syllabus: https://weka.waikato.ac.nz/advanceddataminingwithweka/assets/pdf/syllabus.pdf.

Advanced Data Mining with Weka is open for enrollment and starts 25 April 2016.

Five very intense weeks await!

Will you be there?

I first saw this in a tweet by Alyona Medelyan.

April 5, 2016

Python Code + Data + Visualization (Little to No Prose)

Filed under: Graphics,Programming,Python,Visualization — Patrick Durusau @ 12:46 pm

Up and Down the Python Data and Web Visualization Stack

Using the “USGS dataset listing every wind turbine in the United States:” this notebook walks you through data analysis and visualization with only code and visualizations.

That’s it.

Aside from very few comments, there is no prose in this notebook at all.

You will either hate it or be rushing off to do a similar notebook on a topic of interest to you.

Looking forward to seeing the results of those choices!

February 13, 2016

You Can Confirm A Gravity Wave!

Filed under: Physics,Python,Science,Signal Processing,Subject Identity,Subject Recognition — Patrick Durusau @ 5:35 pm

Unless you have been unconscious since last Wednesday, you have heard about the confirmation of Einstein’s 1916 prediction of gravitational waves.

An very incomplete list of popular reports include:

Einstein, A Hunch And Decades Of Work: How Scientists Found Gravitational Waves (NPR)

Einstein’s gravitational waves ‘seen’ from black holes (BBC)

Gravitational Waves Detected, Confirming Einstein’s Theory (NYT)

Gravitational waves: breakthrough discovery after a century of expectation (Guardian)

For the full monty, see the LIGO Scientific Collaboration itself.

Which brings us to the iPython notebook with the gravitational wave discovery data: Signal Processing with GW150914 Open Data

From the post:

Welcome! This ipython notebook (or associated python script GW150914_tutorial.py ) will go through some typical signal processing tasks on strain time-series data associated with the LIGO GW150914 data release from the LIGO Open Science Center (LOSC):

To begin, download the ipython notebook, readligo.py, and the data files listed below, into a directory / folder, then run it. Or you can run the python script GW150914_tutorial.py. You will need the python packages: numpy, scipy, matplotlib, h5py.

On Windows, or if you prefer, you can use a python development environment such as Anaconda (https://www.continuum.io/why-anaconda) or Enthought Canopy (https://www.enthought.com/products/canopy/).

Questions, comments, suggestions, corrections, etc: email losc@ligo.org

v20160208b

Unlike the toadies at the New England Journal of Medicine, Parasitic Re-use of Data? Institutionalizing Toadyism, Addressing The Concerns Of The Selfish, the scientists who have labored for decades on the gravitational wave question are giving their data away for free!

Not only giving the data away, but striving to help others learn to use it!

Beyond simply “doing the right thing,” and setting an example for other scientists, this is a great opportunity to learn more about signal processing.

Signal processing being an important method of “subject identification” when you stop to think about it in a large number of domains.

Detecting a gravity wave is beyond your personal means but with the data freely available…, further analysis is a matter of interest and perseverance.

December 14, 2015

Data Science Lessons [Why You Need To Practice Programming]

Filed under: Data Science,Programming,Python — Patrick Durusau @ 7:30 pm

Data Science Lessons by Shantnu Tiwari.

Shantnu has authored several programming books using Python and has a series of videos (with more forthcoming) on doing data science with Python.

Shantnu had me when he used data from the Hubble Space telescope in his Introduction to Pandas with Practical examples.

The videos build one upon another and new users will appreciate that not very move is the correct one. 😉

If I had to pick one video to share, of those presently available, it would be:

Why You Need To Practice Programming.

It’s not new advice but it certainly is advice that needs repeating.

This anecdote is told about Pablo Casals (world famous cellist):

When Casals (then age 93) was asked why he continued to practice the cello three hours a day, he replied, “I’m beginning to notice some improvement.”

What are you practicing three hours a day?

December 12, 2015

Previously Unknown Hip Replacement Side Effect: Web Crawler Writing In Python

Filed under: Python,Search Engines,Searching,Webcrawler — Patrick Durusau @ 8:02 pm

Crawling the web with Python 3.x by Doug Mahugh.

From the post:

These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.

So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)

But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.

In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well.

Doug has been writing publicly about his hip replacement surgery so I don’t think this has any privacy issues. 😉

I am interested to see what he writes once he is fully recovered.

My contacts at the American Medical Association disavow any knowledge of hip replacement surgery driving patients to write in Python and/or to write web crawlers.

I suppose there could be liability implications, especially for C/C++ programmers who lose their programming skills except for Python following such surgery.

Still, glad to hear Doug has been making great progress and hope that it continues!

December 11, 2015

Cleaning CSV Data… [Interview Questions?]

Filed under: CSV,Python — Patrick Durusau @ 10:03 pm

Cleaning CSV Data Using the Command Line and csvkit, Part 1 by Srini Kadamati.

From the post:

The Museum of Modern Art is one of the most influential museums in the world and they have released a dataset on the artworks in their collection. The dataset has some data quality issues, however, and requires cleanup.

In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we’ll learn about how to use the csvkit library to acquire and explore tabular data.

Why the command line?

Great question! When working in cloud data science environments, you sometimes only have access to a server’s shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job. Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy.

Some experience working in the command line is expected for this post. If you’re new to the command line, I recommend checking out our interactive command line course.

csvkit

csvkit is a library optimized for working with CSV files. It’s written in Python but the primary interface is the command line. You can install csvkit using pip:

pip install csvkit

You’ll need this library to follow along with this post.

If you want to be a successful data scientist, may I suggest you follow this series and similar posts on data cleaning techniques?

Reports vary but the general figure is 50% to 90% of the time of a data scientist is spent cleaning data. Report: Data scientists spend bulk of time cleaning up

Being able to clean data, the 50% to 90% of your future duties, may not get you past the data scientist interview.

There are several 100+ data scientist interview question sets that don’t have any questions about data cleaning.

Seriously, not a single question.

I won’t name names in order to protect the silly but can say that SAS does have one data cleaning question out of twenty. Err, that’s 5% for those of you comparing to the duties of a data scientist at 50% to 90%. Of course the others I reviewed, had 0% out of 50% to 90% so they were even worse.

Oh, the SAS question on data cleaning:

Give examples of data cleaning techniques you have used in the past.

You have to wonder about a data science employer who asks so many questions unrelated to the day to day duties of data scientists.

Maybe when asked some arcane question you can ask back:

An when in the last six (6) months has your average data scientist hire used that concept/technique?

It might not land you a job but do you really want to work at a firm that can’t apply data science to its own hiring process?

Data science employers, heal yourselves!

PS: I rather doubt most data science interviewers understand the epistemological assumptions behind most algorithms so you can memorize a bit of that for your interview.

Will convince them customers will believe your success is just short of divine intervention in their problem.

It’s an old but reliable technique.

December 7, 2015

Jupyter on Apache Spark [Holiday Game]

Filed under: Python,Reddit,Spark — Patrick Durusau @ 4:46 pm

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3.

This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

In Need a Bigoted, Racist Uncle for Holiday Meal? I mentioned the 1.6 billion Reddit comments that are the subject of this tutorial.

If you can’t find comments offensive to your guests in the Reddit comment collection, they are comatose and/or inanimate objects.

Big Data Holiday Game:

Divide into teams with at least one Jupyter/Apache Spark user on each team.

Play three timed rounds (time for each round dependent on your local schedule) where each team attempts to discover a Reddit comment that is the most offensive for the largest number of guests.

The winner gets bragging rights until next year, you get to show off your data mining skills, not to mention, you get a free pass on saying offensive things to your guests.

Watch for more formalized big data games of this nature by the holiday season for 2016!

Enjoy!

I first saw this in a tweet by Data Science Renee.

November 29, 2015

Idiomatic Python Resources

Filed under: Programming,Python — Patrick Durusau @ 4:57 pm

Idiomatic Python Resources by Andrew Montalenti.

From the post:

Let’s say you’ve just joined my team and want to become an idiomatic Python programmer. Where do you begin?

There are twenty-three resources listed and the benefits of being an idiomatic Python programmer (or an idiomatic programmer in any other language) aren’t limited to employment with Andrew. 😉

One of the advantages to being an idiomatic programmer is that you will be more easily understood by other programmers. Being understood isn’t a bad thing. Really.

Another advantage to being an idiomatic programmer is that it will influence the programmers around you and result in code that is easier for you to understand. Again, understanding isn’t a bad thing.

As if that weren’t enough, perusing the resources that Andrew lists will make you a better programmer overall, which is never a bad thing.

Enjoy!

November 28, 2015

Docker and Jupyter [Advantages over VMware or VirtualBox?]

Filed under: Python,Virtual Machines — Patrick Durusau @ 10:47 am

How to setup a data science environment in minutes using Docker and Jupyter by Vik Paruchuri.

From the post:

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry.

The past few years have seen the rise of technologies that help with this by creating isolated environments. We’ll be exploring one in particular, Docker. Docker makes it fast and easy to create new data science environments, and use tools such as Jupyter notebooks to explore your data.

With Docker, we can download an image file that contains a set of packages and data science tools. We can then boot up a data science environment using this image within seconds, without the need to manually install packages or wait around. This environment is called a Docker container. Containers eliminate configuration problems – when you start a Docker container, it has a known good state, and all the packages work properly.

A nice walk through on installing a Docker container and Jupyter. I do wonder about the advantages claimed over VMware and VirtualBox:


Although virtual machines enable Linux development to take place on Windows, for example, they have some downsides. Virtual machines take a long time to boot up, they require significant system resources, and it’s hard to create a virtual machine from an image, install some packages, and then create another image. Linux containers solve this problem by enabling multiple isolated environments to run on a single machine. Think of containers as a faster, easier way to get started with virtual machines.

I have never noticed long boot times on VirtualBox and “require significant system resources” is too vague to evaluate.

As far as “it’s hard to create a virtual machine from an image, install some packages, and then create another image,” I thought the point of the post was to facilitate quick access to a data science environment?

In that case, I would download an image of my choosing, import it into VirtualBox and then fire it up. How hard is that?

There are pre-configured images with Solr, Solr plus web search engines, and a host of other options.

For more details, visit VirtualBox.org and for a stunning group of “appliances” see VirtualBoxImages.com.

You can use VMs with Docker so it isn’t strictly an either/or choice.

I first saw this in a tweet by Data Science Renee.


Update: Data Science Renee encountered numerous issues trying to follow this install on Windows 7 Professional 64-bit, using VirtualBox 5.0.10 r104061. You can read more about her travails here: Trouble setting up default, maybe caused by virtualbox. After 2 nights of effort, she succeeded! Great!

Error turned out to (apparently) be in VirtualBox. Or at least upgrading to a test version of VirtualBox fixed the problem. I know, I was surprised too. My assumption was that it was Windows. 😉

November 3, 2015

Python Mode for Processing

Filed under: Graphics,Processing,Python — Patrick Durusau @ 6:35 pm

Python Mode for Processing

From the post:

Python Mode for Processing 3 is out! Download it through the contributions manager, and try it out.

Processing is a programming language, development environment, and online community. Since 2001, Processing has promoted software literacy within the visual arts and visual literacy within technology. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.

Processing was initially released with a Java-based syntax, and with a lexicon of graphical primitives that took inspiration from OpenGL, Postscript, Design by Numbers, and other sources. With the gradual addition of alternative progamming interfaces — including JavaScript, Python, and Ruby — it has become increasingly clear that Processing is not a single language, but rather, an arts-oriented approach to learning, teaching, and making things with code.

We are thrilled to make available this public release of the Python Mode for Processing, and its associated documentation. More is on the way! If you’d like to help us improve the implementation of Python Mode and its documentation, please find us on Github!

When I see new language support, I am reminded that semantic diversity is far more commonplace than you would think.

Enjoy!

I first saw this in a tweet by Lynn Cherny.

October 16, 2015

scikit-learn 0.17b1 is out!

Filed under: Python,Scikit-Learn — Patrick Durusau @ 3:14 pm

scikit-learn 0.17b1 is out! by Olivier Grisel.

From the announcement:

The 0.17 beta release of scikit-learn has been uploaded to PyPI. As of now only the source tarball is available. I am waiting for the CI server to build the binary packages for the Windows and Mac OSX platform. They should be online tonight or tomorrow morning.

https://pypi.python.org/pypi/scikit-learn/0.17b1

Please test it as much as possible especially if you have a test suite for a project that has scikit-learn as a dependency.

If you find regressions from 0.16.1 please open issues on github and put `[REGRESSION]` in the title of the issue:

https://github.com/scikit-learn/scikit-learn/issues

Any bugfix will have to be merged to the master branch first and then we will do a cherrypick of the fix into the 0.17.X branch that will be used to generate 0.17.0 final, probably in less than 2 weeks.

Just in time for the weekend! 😉

Comment early and often.

Enjoy!

October 13, 2015

Rodeo 1.0: a Python IDE on your Desktop

Filed under: Programming,Python — Patrick Durusau @ 7:05 pm

Rodeo 1.0: a Python IDE on your Desktop by Greg.

From the post:

When we released our in-browser IDE for Python earlier this year, we couldn’t believe the response. Thousands of our readers all over the world saddled up and told their friends and colleagues to do the same (no more puns, we promise).

That reaction, as well as the endless search for hacks to make our lives easier, got us thinking about how to make Rodeo even better. Over the past few months, we’ve been working on Rodeo 1.0, a version of Rodeo than runs right on your desktop. Download the installers for Windows, OS X, or Linux here.

Something new for Python readers!

I grabbed the 64-bit version for Linux and will install it tomorrow.

Enjoy!

October 12, 2015

Python Week 2015 (Packt Publishing)

Filed under: Books,Python — Patrick Durusau @ 7:48 pm

Python Week 2015 (Packt Publishing)

Packt Publishing is giving away free ebooks and offering 20% off their top selling Python books and videos.

The free book for today (good for approximately 22 hours from this posting):

Building Machine Learning Systems with Python

Expand your Python knowledge and learn all about machine-learning libraries in this user-friendly manual. ML is the next big breakthrough in technology and this book will give you the head-start you need.

  • Master Machine Learning using a broad set of Python libraries and start building your own Python-based ML systems
  • Covers classification, regression, feature engineering, and much more guided by practical examples
  • A scenario-based tutorial to get into the right mind-set of a machine learner (data exploration) and successfully implement this in your new or existing projects

I didn’t know this was Python week! 😉

BTW, there is a website devoted to awareness days, weeks, months: http://www.national-awareness-days.com/

They seem to take the idea quite seriously but they didn’t have Python week on their calendar.

September 26, 2015

Writing “Python Machine Learning”

Filed under: Books,Machine Learning,Python — Patrick Durusau @ 8:46 pm

Writing “Python Machine Learning” by Sebastian Raschka.

From the post:

It’s been about time. I am happy to announce that “Python Machine Learning” was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing “Python Machine Learning” really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

A delightful tale for those of us who have authored books and an inspiration (with some practical suggestions) for anyone who hopes to write a book.

Sebastian’s productivity hints will ring familiar for those with similar habits and bear study by those who hope to become more productive.

Sebastian never comes out and says it but his writing approach breaks each stage of the book into manageable portions. It is far easier to say (and do) “write an outline” than to “write the complete and fixed outline for an almost 500 page book.”

If the task is too large, the complete and immutable outline, you won’t get up enough momentum to make a reasonable start.

After reading Sebastian’s post, what book are you thinking about writing?

September 22, 2015

Python for Scientists [Warning – Sporadic Content Ahead]

Filed under: Programming,Python,Science — Patrick Durusau @ 10:36 am

Python for Scientists: A Curated Collection of Chapters from the O’Reilly Data and Programming Libraries

From the post:

More and more, scientists are seeing tech seep into their work. From data collection to team management, various tools exist to make your lives easier. But, where to start? Python is growing in popularity in scientific circles, due to its simple syntax and seemingly endless libraries. This free ebook gets you started on the path to a more streamlined process. With a collection of chapters from our top scientific books, you’ll learn about the various options that await you as you strengthen your computational thinking.

This free ebook includes chapters from:

  • Python for Data Analysis
  • Effective Computation in Physics
  • Bioinformatics Data Skills
  • Python Data Science Handbook

Warning: You give your name and email to the O’Reilly marketing marketing machine and get:

Python for Data Analysis

Python Language Essentials Appendix

Effective Computation in Physics

Chapter 1: Introduction to the Command Line
Chapter 7: Analysis and Visualization
Chapter 20: Publication

Bioinformatics Data Skills

Chapter 4: Working with Remote Machines
Chapter 5: Git for Scientists

Python Data Science Handbook

Chapter 3: Introduction to NumPy
Chapter 4: Introduction to Pandas

The content present is very good. The content missing is vast.

Topic Modeling and Twitter

Filed under: Latent Dirichlet Allocation (LDA),Python,Twitter — Patrick Durusau @ 9:57 am

Alex Perrier has two recent posts of interest to Twitter users and topic modelers:

Topic Modeling of Twitter Followers

In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about?

Knowing the evolution or the segmentation of an account’s followers can give actionable insights to a marketing department into near real time concerns of existing or potential customers. Carrying topic analysis of followers of politicians can produce a complementary view of opinion polls.

Segmentation of Twitter Timelines via Topic Modeling

Following up on our first post on the subject, Topic Modeling of Twitter Followers, we compare different unsupervised methods to further analyze the timelines of the followers of the @alexip account. We compare the results obtained through Latent Semantic Analysis and Latent Dirichlet Allocation and we segment Twitter timelines based on the inferred topics. We find the optimal number of clusters using silhouette scoring.

Alex has Python code, an interesting topic, great suggestions for additional reading, what is there not to like?

LDA, machine learning types follow @alexip but privacy advocates should as well.

Consider this recent tweet by Alex:

In the end the best way to protect your privacy is to behave erratically so that the Machine Learning algo will detect you as an outlier!

Perhaps, perhaps, but I suspect outliers/outsiders are classed as dangerous by several government agencies in the US.

September 21, 2015

Python & R codes for Machine Learning

Filed under: Machine Learning,Python,R — Patrick Durusau @ 7:54 pm

While I am thinking about machine learning, I wanted to mention: Cheatsheet – Python & R codes for common Machine Learning Algorithms by Manish Saraswat.

From the post:

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it!

Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming.

Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them?

Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Here’s a very good idea! Whether you want to learn these algorithms or a new Emacs mode. 😉

Sure, you can always look up the answer but that breaks your chain of thought, over and over again.

Enjoy!

« Newer PostsOlder Posts »

Powered by WordPress