Archive for the ‘Python’ Category

GPU Scripting and Code Generation with PyCUDA

Saturday, May 11th, 2013

GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.

Abstract:

High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

Point taken.

More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.

Hacking Secret Ciphers with Python

Tuesday, April 16th, 2013

“Hacking Secret Ciphers with Python” Released by Al Sweigart.

From the post:

My third book, Hacking Secret Ciphers with Python, is finished. It is free to download under a Creative Commons license, and available for purchase as a physical book on Amazon for $25 (which qualifies it for free shipping). This book is aimed at people who have no experience programming or with cryptography. The book goes through writing Python programs that not only implement several ciphers but also can hack these ciphers.

100% of the proceeds from the book sales will be donated to the Electronic Frontier Foundation, Creative Commons, and The Tor Project.

This looks like fun!

Unlike the secrecy cultists in cybersecurity, I think new ideas and insights into cryptography can come from anyone who spends time working on it.

To paraphrase Buffalo Springfield, “…increase the government’s paranoia like looking in a mirror and seeing the public working on cryptography….”

I never claimed to be a song writer. ;-)

PS: Download a copy and buy a hard copy to give to someone.

Or donate the hard copy to your local library!

PyData and More Tools…

Thursday, April 11th, 2013

PyData and More Tools for Getting Started with Python for Data Scientists by Sean Murphy.

From the post:

It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post. As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

I make out at least seventeen (17) different Pthon resources, libraries, etc.

Enough to keep you busy for more than a little while. ;-)

Scrapely

Tuesday, April 9th, 2013

Scrapely

From the webpage:

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

A tool for data mining similar HTML pages.

Supports a command line interface.

A Programmer’s Guide to Data Mining

Saturday, April 6th, 2013

A Programmer’s Guide to Data Mining – The Ancient Art of the Numerati by Ron Zacharski.

From the webpage:

Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don’t get me wrong, the information in those books is extremely important. However, if you are a programmer interested in learning a bit about data mining you might be interested in a beginner’s hands-on guide as a first step. That’s what this book provides.

This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques. This book is available for download for free under a Creative Commons license (see link in footer). You are free to share the book, and remix it. Someday I may offer a paper copy, but the online version will always be free.

If you are looking for explanations of data mining that fall between the “dummies” variety and arXiv.org papers, you are at the right place!

Not new information but well presented information, always a rare thing.

Take the time to read this book.

If not for the content, to get some ideas on how to improve your next book.

Implementing the RAKE Algorithm with NLTK

Monday, March 25th, 2013

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at Amazon.com.

Could be a handy tool to quickly extract candidates for topics in a topic map.

PyCon US 2013

Sunday, March 24th, 2013

PyCon US 2013

In case you missed it, videos from PyCon US 2013 are online!

I am just beginning to scroll through the presentations and will be pulling out some favorites.

What are yours?

How Sharehoods Created Neomodel Along The Way [London]

Friday, March 22nd, 2013

How Sharehoods Created Neomodel Along The Way

EVENT DETAILS

What: Neo4J User Group:CASE STUDY: How Sharehoods Created Neomodel Along The Way
Where: The Skills Matter eXchange, London
When: 27 Mar 2013 Starts at 18:30

From the description:

Sharehoods is a global online portal for foreigners. and the first place where new-comers to a city can build their social relationships and network – online or from a mobile phone.

In this talk, Sharehoods Head of Technology Robin Edwards will explain why and how Neo4j is used at this exciting tech startup. Robin will also give a whirlwind tour of neomodel, a new Python framework for neo4j and its integration with the Django stack.

Join this talk if you’d like to learn how to get productive with Neo4j, Python and Django.

Entity disambiguation:

I don’t think they mean:

Jamie Foxx

I think they mean:

django software The Web framework for perfectionists with deadlines.

If you attend, drop me a note to confirm my suspicions. ;-)

SnoPy – SNOBOL Pattern Matching for Python

Friday, March 22nd, 2013

SnoPy – SNOBOL Pattern Matching for Python

Description:

SnoPy – A Python alternative to regular expressions. Borrowed from SNOBOL this alternative is both easier to use and more powerful than regular expressions. NO backslashes to count.

See also: SnoPy – SNOBOL Pattern Matching for Python Web Site.

For cross-disciplinary data mining, what could be more useful than SNOBOL pattern matching?

I first saw this in a Facebook link posted by Sam Hunting.

Pyrallel – Parallel Data Analytics in Python

Wednesday, March 20th, 2013

Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.

From the webpage:

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
  • focus on small to medium data (with data locality when possible).
  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
  • do not focus on HA / Fault Tolerance (yet).
  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

This project brought to mind two things:

  1. Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
  2. A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.

Million Song Dataset in Minutes!

Thursday, March 7th, 2013

Million Song Dataset in Minutes! (Video)

Actually 5:35 as per the video.

The summary of the video reads:

Created Web Project [zero install]

Loaded data from S3

Developed in Pig and Python [watch for the drop down menus of pig fragments]

ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]

Ran on Hadoop [drop downs to create a cluster]

Downloaded results [50 "densest songs", see the video]

It’s not all “hands free” or without intellectual effort on your part.

But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.

PyData Videos

Tuesday, February 26th, 2013

PyData Videos

All great but here are five (5) to illustrate the range of what awaits:

Connecting Data Science to business value, Josh Hemann.

GPU and Python, Andreas Klöckner, Ph.D.

Network X and Gephi, Gilad Lotan.

NLTK and Text Processing, Andrew Montalenti.

Wikipedia Indexing And Analysis, Didier Deshommes.

Forty-seven (47) videos in all so my list is missing forty-two (42) other great ones!

Which ones are your favorites?

Flatten entire HBase column families… [Mixing Labels and Data]

Monday, February 11th, 2013

Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.

From the post:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

Now there’s an ugly problem.

You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.

Saying: “Don’t do that!” doesn’t help because it is already being done.

If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.

Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?

Suggestions?

I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.

PyPLN: a Distributed Platform for Natural Language Processing

Friday, February 8th, 2013

PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

Abstract:

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

Demo: http://demo.pypln.org

Source code: http://pypln.org.

Have you noticed that tools for analysis are getting easier, not harder to use?

Is there a lesson there for tools to create topic map content?

A Quick Guide to Hadoop Map-Reduce Frameworks

Thursday, February 7th, 2013

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

Thanks Alex!

Confluently Persistent Sets and Maps

Wednesday, January 23rd, 2013

Confluently Persistent Sets and Maps by Olle Liljenzin.

Abstract:

Ordered sets and maps play important roles as index structures in relational data models. When a shared index in a multi-user system is modified concurrently, the current state of the index will diverge into multiple versions containing the local modifications performed in each work flow. The confluent persistence problem arises when versions should be melded in commit and refresh operations so that modifications performed by different users become merged.

Confluently Persistent Sets and Maps are functional binary search trees that support efficient set operations both when operands are disjoint and when they are overlapping. Treap properties with hash values as priorities are maintained and with hash-consing of nodes a unique representation is provided. Non-destructive set merge algorithms that skip inspection of equal subtrees and a conflict detecting meld algorithm based on set merges are presented. The meld algorithm is used in commit and refresh operations. With m modifications in one flow and n items in total, the expected cost of the operations is O(m log(n/m)).

Is this an avenue for coordination between distinct topic maps?

Or is consistency of distinct topic maps an application-based requirement?

Assembling a Python Machine Learning Toolkit

Wednesday, January 23rd, 2013

Assembling a Python Machine Learning Toolkit by Sujit Pal.

From the post:

I had been meaning to read Peter Harrington’s book Machine Learning In Action (MLIA) for a while now, and I finally finished reading it earlier this week (my review on Amazon is here). The book provides Python implementations of 8 of the 10 Top Algorithms in Data Mining listed in this paper (PDF). The math package used in the examples is Numpy, and the charts are built using Matplotlib.

In the past, the little ML work I have done has been in Java, because that was the language and ecosystem I knew best. However, given the experimental, iterative nature of ML work, its probably not the most ideal language to use. However, there are lots of options when it comes to languages for ML – over the last year, I have learned Octave (open-source version of MATLAB) for the Coursera Machine Learning class and R for the Coursera Statistics One and Computing for Data Analysis classes (still doing the second one). But because I know Python already, Python/Numpy looks easier to use than Octave, and Python/Matplotlib looks as simple as using R graphics. There is also the pandas package which provides R-like features, although I haven’t used it yet.

Looking around on the net, I find that many other people have reached similar conclusions – ie, that Python seems to be the way to go for initial prototyping work in ML. I wanted to set up a small toolbox of Python libraries that will allow me to do this also. I settled on an initial list of packages based on the Scipy Superpack, but since I am still on Mac OS (Snow Leopard) I could not use the script from there. There were some issues I had to work through to make this to work, so I document this here, so if you are in the same situation this may help you.

Unlike the Scipy Superpack, which seems to prefer versions that are often the bleeding edge development versions, I decided to stick to the latest stable release versions for each of the libraries. Here they are:

Sujit’s post will save you a few steps in assembling your Python machine learning toolkit.

Pass it on.

Machine Learning and Data Mining – Association Analysis with Python

Thursday, January 17th, 2013

Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

From the post:

Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

As this post demonstrates, that may be overly optimistic on my part.

What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

An incomplete association as it were.

Suggestions?

A Guide to Python Frameworks for Hadoop

Wednesday, January 9th, 2013

A Guide to Python Frameworks for Hadoop by Uri Laserson.

From the post:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

  • Hadoop Streaming
  • mrjob
  • dumbo
  • hadoopy
  • pydoop
  • and others

Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

Read on for implementation details, performance comparisons, and feature comparisons.

A non-word count Hadoop example? Who would have thought? ;-)

Enjoy!

IPython Notebook Viewer

Friday, December 28th, 2012

IPython Notebook Viewer

From the webpage:

A Simple way to share your IP[y]thon Notebook as Gists.

Share your own notebook, or browse others’

Scientific Python retweeted a post from Hilary Mason on the IPython Notebook Viewer so I had to go look.

For details on IPython and notebooks, see: IP[y]: IPython Interactive Computing:

IPython provides a rich toolkit to help you make the most out of using Python, with:

  • Powerful Python shells (terminal and Qt-based).
  • A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

As Hilary says in her tweet: “…one of the coolest things I’ve seen in a long time. It makes analysis more collaborative!”

Useful for exchanging data analysis.

Possibly a good lesson on what a merging data by example resource might look like.

Yes?

A Python Compiler for Big Data

Tuesday, December 18th, 2012

A Python Compiler for Big Data by Stephen Diehl.

From the post:

Blaze is the next generation of NumPy, Python’s extremely popular array library. At Continuum Analytics we aim to tackle some of the hardest problems in large data analytics with our Python stack of Numba and Blaze, which together will form the basis of distributed computation and storage system which is simultaneously able to generate optimized machine code specialized to the data being operated on.

Blaze aims to extend the structural properties of NumPy arrays to to a wider variety of table and array-like structures that support commonly requested features such missing values, type heterogeneity, and labeled arrays.

(images omitted)

Unlike NumPy, Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays.

We aim to allow analysts and scientists to productively write robust and efficient code, without getting bogged down in the details of how to distribute computation, or worse, how to transport and convert data between databases, formats, proprietary data warehouses, and other silos.

Just a thumbnail sketch but enough to get you interested in learning more.

Rosalind

Saturday, December 15th, 2012

Rosalind

From the homepage:

Rosalind is a platform for learning bioinformatics through problem solving.

Rather than teaching topic maps from the “basics” forward, what about teaching problems for which topic maps are a likely solution?

And introduce syntax/practices as solutions to particular issues?

Suggestions for problems?

Continuum Unleases Anaconda on Python Analytics Community

Tuesday, December 4th, 2012

Continuum Unleases Anaconda on Python Analytics Community

From the post:

Python-based data analytics solutions and services company, Continuum Analytics, today announced the release of the latest version of Anaconda, its collection of libraries for Python that includes Numba Pro, IOPro and wiseRF all in one package.

Anaconda enables large-scale data management, analysis, and visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.2.1, includes improved performance and feature enhancements for Numba Pro and IOPro.

Available for Windows, Mac OS X and Linux, Anaconda includes packages more than 80 popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with a single integrated and flexible installer. The company says its goal is to seamlessly support switching between multiple versions of Python and other packages, via a “Python environments” feature that allows mixing and matching different versions of Python, Numpy and Scipy.

New features and upgrades in the latest version of Anaconda include performance and feature enhancements to Numba Pro and IOPro, improved conda command and in addition, Continuum has added Qt to Linux versions and has also added mdp, MLTK and pytest.

Oh, you might like the Continuum Analytics link.

And the direct Anaconda link as well.

I expect people to go elsewhere after reading my analysis or finding a resource of interest.

Isn’t that what the web is supposed to be about?

Python Scientific Lecture Notes

Tuesday, December 4th, 2012

Python Scientific Lecture Notes edited by Valentin Haenel, Emmanuelle Gouillart and Gaël Varoquaux.

From the description:

Teaching material on the scientific Python ecosystem, a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.

Coverage? Here is the top level of the table of contents:

1. Getting started with Python for science
1.1. Scientific computing with tools and workflow
1.2. The Python language
1.3. NumPy: creating and manipulating numerical data
1.4. Getting help and finding documentation
1.5. Matplotlib: plotting
1.6. Scipy : high-level scientific computing
2. Advanced topics
2.1. Advanced Python Constructs
2.2. Advanced Numpy
2.3. Debugging code
2.4. Optimizing code
2.5. Sparse Matrices in SciPy
2.6. Image manipulation and processing using Numpy and Scipy
2.7. Mathematical optimization: finding minima of functions
2.8. Traits
2.9. 3D plotting with Mayavi
2.10. Sympy : Symbolic Mathematics in Python
2.11. scikit-learn: machine learning in Python

The contents are available in single and double sided PDF, HTML and example files, plus source code.

I first saw this in a tweet from Scientific Python.

Face detection using Python and OpenCV

Saturday, November 17th, 2012

Face detection using Python and OpenCV by Paolo D’Incau.

From the post:

Most of the posts you will find in this blog are Erlang related (of course they are!), but sometimes I like writing also about my experiences at University of Trento as I am doing right now. During the last couple of years I have attended many courses about Computer Vision and Digital Signal Processing so today I would like to show you something about it.

In this post I will write about making some code for face detection purposes using python and OpenCV. This post will have no code, actually you can just grab my original code from here (the files needed are faces.py and haarcascade_frontalface_alt.xml).

Face detection is a computer technology that determines the locations and sizes of human faces in images or video. It detects facial features and ignores anything else, such as buildings, trees and bodies.

I can imagine any number of topic map applications that could use or be enhanced by face detection capabilities.

A Wordcloud in Python

Saturday, November 17th, 2012

A Wordcloud in Python by Andreas Mueller.

From the post:

Last week I was at Pycon DE, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:
doing a wordl-like word cloud.

I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.

So I looked around to find a nice open-source implementation of word-clouds … only to find none. (This has been a while, maybe it has changed since).

“Andy” walks through the construction of a word cloud in Python.

Looking at his renderings, I think I know why I don’t appreciate word clouds as much as they deserve.

I am trying to “read” the words as text, not observing them in unknown relationships to each other.

Word clouds may work for you or your users and if they do, use them.

But be aware there are users who find them nearly useless.

Videos From PyData NYC

Wednesday, November 14th, 2012

Videos From PyData NYC

From the post:

If you weren’t able to attend PyData NYC, or would like another opportunity to watch a talk or tutorial, you now have the chance. Conference videos are posted on Vimeo at: https://vimeo.com/channels/pydata.

Four talks by the Continuum team were among the many great presentations at PyData. Be sure and check out their videos as well as the others.

Francesc Alted gave a tutorial on PyTables. He explained the basics of using HDF5 through PyTables and how it leverages HDF5 to allow Python to perform efficient computations over extremely large datasets that do not fit in memory.

Stephen Diehl spoke on Blaze, a next-generation NumPy sponsored by Continuum. It is designed as a foundational set of abstractions on which to build out-of-core and distributed algorithms. He explained how Blaze generalizes many of the ideas found in popular PyData projects such as Numpy, Pandas, and Theano into one generalized data-structure.

Hugo Shi and Travis Oliphant taught a tutorial on SciPy that included an overview of the modules that are most relevant for data analysis.

Stefan Urbanek presented, Python for Business Intelligence, an introduction to business intelligence, data warehousing and online analytical processing with Cubes.

A welcome alternative to the upcoming season of tawdry news conferences.

Analysis of the statistics blogosphere

Sunday, November 11th, 2012

Analysis of the statistics blogosphere by John Johnson.

From the post:

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Excellent post on mining blog content.

A rich source of data for a topic map on the subject of your dreams.

Introducing Wakari

Sunday, November 11th, 2012

Introducing Wakari by Paddy Mullen.

From the post:

We are proud to introduce Wakari, our hosted Python data analysis environment.

We believe that programmers, scientists, and analysts should spend their time writing code, not working to setup a system. Data should be shareable, and analysis should be repeatable. We built Wakari to achieve these goals.

Sample Use Cases

We think Wakari will be useful for many people in all types of industries. Here are just three of the many use cases that Wakari will help for.

Learning python

If you want to learn Python, Wakari is the perfect environment. Wakari makes it easy to start writing code immediately, without needing to install any software on your own computer. You will be able to show instructors your code and get feedback as to where you’re getting hung up.

Academia

If you’re an academic frustrated by setting up computing environments and annoyed that your colleagues can’t easily run your code, Wakari is made for you. Wakari handles all of the problems related to setting up a Python scientific computing environment. Because Wakari builds on Anaconda, useful libraries like SciKit, mpi4py and NumPy are right at your fingertips without compilation gymnastics.

Since you run code on our servers through a web browser, it is easy for your colleagues to re-run your code to repeat your analysis, or try out variations on their own. At Continuum, we understand that reproducibility is an important part of the scientific process that your results be consistent for reviewers and colleagues.

Finance

(graphic omitted)

For users who work in finance, Wakari lets you avoid the drudgery of emailing Excel files to share analysis, data, and visuals. Since data feeds are integrated into the Python environment, it is effortless to import financial data into your coding environment. When it is time to share results, you can email colleagues a URL that links to running code. Interactive charts are easy to create and share from Python. Since Wakari is built on top of Anaconda, great libraries like NumPy, Scipy, Matplotlib, and Pandas are already installed. Wakari includes support for Anaconda’s multiple environments, so you can easily change between versions of Python (including Python 3.3!) and versions of fundamental libraries.

Interesting in part because Wakari further blurs the distinction between “your” computer and the “host.”

If you are performing analysis on data (assuming a high speed connection), does it really matter if “your” computer is running the analysis or simply displaying the results from some remote host?

Not a completely new concept for those of you who remember desktops that booted from servers.

Interesting as well as a model for how authoring aids for topic maps could be delivered (or at least their results) to topic map authors.

Want a concordance of text at Y location? Enter the URI. Want other NLP routines? Choose from this list. Separate and apart from any authoring engine. (Its called modularity.)

CodernityDB [Origin of the 3 V's of Big Data]

Sunday, November 11th, 2012

CodernityDB

From the webpage:

CodernityDB pure python, NoSQL, fast database¶

CodernityDB is opensource, pure Python (no 3rd party dependency), fast (really fast check Speed if you don’t believe in words), multiplatform, schema-less, NoSQL database. It has optional support for HTTP server version (CodernityDB-HTTP), and also Python client library (CodernityDB-PyClient) that aims to be 100% compatible with embedded version.

“The hills are alive, with the sound of NoSQL databases….”

Sorry, I usually only sing in the shower. ;-)

I haven’t done a statistical survey (that may be in the offing) but it does seem like the stream of NoSQL databases continues unabated.

What I don’t know and you might: Has there always be a rumble of alternative databases and looking makes them appear larger/more numerous? As in a side view mirror.

If we can discover what makes NoSQL databases popular now, that may apply to semantic integration.

I don’t buy the 3 V’s, Velocity, Volume, Variety, as an explanation for NoSQL database adoption.

Doug Laney, now of Gartner, Inc., then of Meta Group coined that phrase in “3D Data Management: Controlling Data Volume, Velocity and Variety“, Date: 6 February 2001:*



E-Commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity, and variety.


I don’t recall the current level of interest in NoSQL databases when faced with the same problems in 2001.

So what else has changed? (I don’t know or I would say.)

Comments/suggestions/pointers?


I was alerted to the origin of the three V’s by a reference to Doug Laney by Stephen Swoyer in Big Data — Why the 3Vs Just Don’t Make Sense and then followed a reference in Big Data (Wikipedia) to find the link I reproduce above.