Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 28, 2012

IPython Notebook Viewer

Filed under: Programming,Python,Visualization — Patrick Durusau @ 8:08 pm

IPython Notebook Viewer

From the webpage:

A Simple way to share your IP[y]thon Notebook as Gists.

Share your own notebook, or browse others’

Scientific Python retweeted a post from Hilary Mason on the IPython Notebook Viewer so I had to go look.

For details on IPython and notebooks, see: IP[y]: IPython Interactive Computing:

IPython provides a rich toolkit to help you make the most out of using Python, with:

  • Powerful Python shells (terminal and Qt-based).
  • A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

As Hilary says in her tweet: “…one of the coolest things I’ve seen in a long time. It makes analysis more collaborative!”

Useful for exchanging data analysis.

Possibly a good lesson on what a merging data by example resource might look like.

Yes?

December 18, 2012

A Python Compiler for Big Data

Filed under: Arrays,Python — Patrick Durusau @ 6:41 am

A Python Compiler for Big Data by Stephen Diehl.

From the post:

Blaze is the next generation of NumPy, Python’s extremely popular array library. At Continuum Analytics we aim to tackle some of the hardest problems in large data analytics with our Python stack of Numba and Blaze, which together will form the basis of distributed computation and storage system which is simultaneously able to generate optimized machine code specialized to the data being operated on.

Blaze aims to extend the structural properties of NumPy arrays to to a wider variety of table and array-like structures that support commonly requested features such missing values, type heterogeneity, and labeled arrays.

(images omitted)

Unlike NumPy, Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays.

We aim to allow analysts and scientists to productively write robust and efficient code, without getting bogged down in the details of how to distribute computation, or worse, how to transport and convert data between databases, formats, proprietary data warehouses, and other silos.

Just a thumbnail sketch but enough to get you interested in learning more.

December 15, 2012

Rosalind

Filed under: Bioinformatics,Python,Teaching — Patrick Durusau @ 8:16 pm

Rosalind

From the homepage:

Rosalind is a platform for learning bioinformatics through problem solving.

Rather than teaching topic maps from the “basics” forward, what about teaching problems for which topic maps are a likely solution?

And introduce syntax/practices as solutions to particular issues?

Suggestions for problems?

December 4, 2012

Continuum Unleases Anaconda on Python Analytics Community

Filed under: Analytics,Python — Patrick Durusau @ 1:05 pm

Continuum Unleases Anaconda on Python Analytics Community

From the post:

Python-based data analytics solutions and services company, Continuum Analytics, today announced the release of the latest version of Anaconda, its collection of libraries for Python that includes Numba Pro, IOPro and wiseRF all in one package.

Anaconda enables large-scale data management, analysis, and visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.2.1, includes improved performance and feature enhancements for Numba Pro and IOPro.

Available for Windows, Mac OS X and Linux, Anaconda includes packages more than 80 popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with a single integrated and flexible installer. The company says its goal is to seamlessly support switching between multiple versions of Python and other packages, via a “Python environments” feature that allows mixing and matching different versions of Python, Numpy and Scipy.

New features and upgrades in the latest version of Anaconda include performance and feature enhancements to Numba Pro and IOPro, improved conda command and in addition, Continuum has added Qt to Linux versions and has also added mdp, MLTK and pytest.

Oh, you might like the Continuum Analytics link.

And the direct Anaconda link as well.

I expect people to go elsewhere after reading my analysis or finding a resource of interest.

Isn’t that what the web is supposed to be about?

Python Scientific Lecture Notes

Filed under: Programming,Python — Patrick Durusau @ 12:24 pm

Python Scientific Lecture Notes edited by Valentin Haenel, Emmanuelle Gouillart and Gaël Varoquaux.

From the description:

Teaching material on the scientific Python ecosystem, a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.

Coverage? Here is the top level of the table of contents:

1. Getting started with Python for science
1.1. Scientific computing with tools and workflow
1.2. The Python language
1.3. NumPy: creating and manipulating numerical data
1.4. Getting help and finding documentation
1.5. Matplotlib: plotting
1.6. Scipy : high-level scientific computing
2. Advanced topics
2.1. Advanced Python Constructs
2.2. Advanced Numpy
2.3. Debugging code
2.4. Optimizing code
2.5. Sparse Matrices in SciPy
2.6. Image manipulation and processing using Numpy and Scipy
2.7. Mathematical optimization: finding minima of functions
2.8. Traits
2.9. 3D plotting with Mayavi
2.10. Sympy : Symbolic Mathematics in Python
2.11. scikit-learn: machine learning in Python

The contents are available in single and double sided PDF, HTML and example files, plus source code.

I first saw this in a tweet from Scientific Python.

November 17, 2012

Face detection using Python and OpenCV

Filed under: Face Detection,OpenCV,Python — Patrick Durusau @ 4:23 pm

Face detection using Python and OpenCV by Paolo D’Incau.

From the post:

Most of the posts you will find in this blog are Erlang related (of course they are!), but sometimes I like writing also about my experiences at University of Trento as I am doing right now. During the last couple of years I have attended many courses about Computer Vision and Digital Signal Processing so today I would like to show you something about it.

In this post I will write about making some code for face detection purposes using python and OpenCV. This post will have no code, actually you can just grab my original code from here (the files needed are faces.py and haarcascade_frontalface_alt.xml).

Face detection is a computer technology that determines the locations and sizes of human faces in images or video. It detects facial features and ignores anything else, such as buildings, trees and bodies.

I can imagine any number of topic map applications that could use or be enhanced by face detection capabilities.

A Wordcloud in Python

Filed under: Python,Visualization,Word Cloud — Patrick Durusau @ 3:15 pm

A Wordcloud in Python by Andreas Mueller.

From the post:

Last week I was at Pycon DE, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:
doing a wordl-like word cloud.

I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.

So I looked around to find a nice open-source implementation of word-clouds … only to find none. (This has been a while, maybe it has changed since).

“Andy” walks through the construction of a word cloud in Python.

Looking at his renderings, I think I know why I don’t appreciate word clouds as much as they deserve.

I am trying to “read” the words as text, not observing them in unknown relationships to each other.

Word clouds may work for you or your users and if they do, use them.

But be aware there are users who find them nearly useless.

November 14, 2012

Videos From PyData NYC

Filed under: Conferences,Python — Patrick Durusau @ 1:48 pm

Videos From PyData NYC

From the post:

If you weren’t able to attend PyData NYC, or would like another opportunity to watch a talk or tutorial, you now have the chance. Conference videos are posted on Vimeo at: https://vimeo.com/channels/pydata.

Four talks by the Continuum team were among the many great presentations at PyData. Be sure and check out their videos as well as the others.

Francesc Alted gave a tutorial on PyTables. He explained the basics of using HDF5 through PyTables and how it leverages HDF5 to allow Python to perform efficient computations over extremely large datasets that do not fit in memory.

Stephen Diehl spoke on Blaze, a next-generation NumPy sponsored by Continuum. It is designed as a foundational set of abstractions on which to build out-of-core and distributed algorithms. He explained how Blaze generalizes many of the ideas found in popular PyData projects such as Numpy, Pandas, and Theano into one generalized data-structure.

Hugo Shi and Travis Oliphant taught a tutorial on SciPy that included an overview of the modules that are most relevant for data analysis.

Stefan Urbanek presented, Python for Business Intelligence, an introduction to business intelligence, data warehousing and online analytical processing with Cubes.

A welcome alternative to the upcoming season of tawdry news conferences.

November 11, 2012

Analysis of the statistics blogosphere

Filed under: Blogs,Data Mining,Python,Social Networks — Patrick Durusau @ 8:11 pm

Analysis of the statistics blogosphere by John Johnson.

From the post:

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Excellent post on mining blog content.

A rich source of data for a topic map on the subject of your dreams.

Introducing Wakari

Filed under: Data Analysis,Programming,Python — Patrick Durusau @ 1:30 pm

Introducing Wakari by Paddy Mullen.

From the post:

We are proud to introduce Wakari, our hosted Python data analysis environment.

We believe that programmers, scientists, and analysts should spend their time writing code, not working to setup a system. Data should be shareable, and analysis should be repeatable. We built Wakari to achieve these goals.

Sample Use Cases

We think Wakari will be useful for many people in all types of industries. Here are just three of the many use cases that Wakari will help for.

Learning python

If you want to learn Python, Wakari is the perfect environment. Wakari makes it easy to start writing code immediately, without needing to install any software on your own computer. You will be able to show instructors your code and get feedback as to where you’re getting hung up.

Academia

If you’re an academic frustrated by setting up computing environments and annoyed that your colleagues can’t easily run your code, Wakari is made for you. Wakari handles all of the problems related to setting up a Python scientific computing environment. Because Wakari builds on Anaconda, useful libraries like SciKit, mpi4py and NumPy are right at your fingertips without compilation gymnastics.

Since you run code on our servers through a web browser, it is easy for your colleagues to re-run your code to repeat your analysis, or try out variations on their own. At Continuum, we understand that reproducibility is an important part of the scientific process that your results be consistent for reviewers and colleagues.

Finance

(graphic omitted)

For users who work in finance, Wakari lets you avoid the drudgery of emailing Excel files to share analysis, data, and visuals. Since data feeds are integrated into the Python environment, it is effortless to import financial data into your coding environment. When it is time to share results, you can email colleagues a URL that links to running code. Interactive charts are easy to create and share from Python. Since Wakari is built on top of Anaconda, great libraries like NumPy, Scipy, Matplotlib, and Pandas are already installed. Wakari includes support for Anaconda’s multiple environments, so you can easily change between versions of Python (including Python 3.3!) and versions of fundamental libraries.

Interesting in part because Wakari further blurs the distinction between “your” computer and the “host.”

If you are performing analysis on data (assuming a high speed connection), does it really matter if “your” computer is running the analysis or simply displaying the results from some remote host?

Not a completely new concept for those of you who remember desktops that booted from servers.

Interesting as well as a model for how authoring aids for topic maps could be delivered (or at least their results) to topic map authors.

Want a concordance of text at Y location? Enter the URI. Want other NLP routines? Choose from this list. Separate and apart from any authoring engine. (Its called modularity.)

CodernityDB [Origin of the 3 V’s of Big Data]

Filed under: CodernityDB,NoSQL,Python — Patrick Durusau @ 10:06 am

CodernityDB

From the webpage:

CodernityDB pure python, NoSQL, fast database¶

CodernityDB is opensource, pure Python (no 3rd party dependency), fast (really fast check Speed if you don’t believe in words), multiplatform, schema-less, NoSQL database. It has optional support for HTTP server version (CodernityDB-HTTP), and also Python client library (CodernityDB-PyClient) that aims to be 100% compatible with embedded version.

“The hills are alive, with the sound of NoSQL databases….”

Sorry, I usually only sing in the shower. 😉

I haven’t done a statistical survey (that may be in the offing) but it does seem like the stream of NoSQL databases continues unabated.

What I don’t know and you might: Has there always be a rumble of alternative databases and looking makes them appear larger/more numerous? As in a side view mirror.

If we can discover what makes NoSQL databases popular now, that may apply to semantic integration.

I don’t buy the 3 V’s, Velocity, Volume, Variety, as an explanation for NoSQL database adoption.

Doug Laney, now of Gartner, Inc., then of Meta Group coined that phrase in “3D Data Management: Controlling Data Volume, Velocity and Variety“, Date: 6 February 2001:*



E-Commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity, and variety.


I don’t recall the current level of interest in NoSQL databases when faced with the same problems in 2001.

So what else has changed? (I don’t know or I would say.)

Comments/suggestions/pointers?


I was alerted to the origin of the three V’s by a reference to Doug Laney by Stephen Swoyer in Big Data — Why the 3Vs Just Don’t Make Sense and then followed a reference in Big Data (Wikipedia) to find the link I reproduce above.

Python interface to Stanford Core NLP tools v1.3.3

Filed under: Natural Language Processing,Python,Stanford NLP — Patrick Durusau @ 5:25 am

Python interface to Stanford Core NLP tools v1.3.3

From the README.md:

This is a Python wrapper for Stanford University’s NLP group’s Java-based CoreNLP tools. It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.

  • Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named entity resolution, and coreference resolution.
  • Runs an JSON-RPC server that wraps the Java server and outputs JSON.
  • Outputs parse trees which can be used by nltk.

It requires pexpect and (optionally) unidecode to handle non-ASCII text. This script includes and uses code from jsonrpc and python-progressbar.

It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on Core NLP tools version 1.3.3 released 2012-07-09.

If you have NLP requirements and work in Python, this may be of interest.

November 10, 2012

MDP – Modular toolkit for Data Processing

Filed under: Data Analysis,Python — Patrick Durusau @ 1:36 pm

MDP – Modular toolkit for Data Processing

From the webpage:

Modular toolkit for Data Processing (MDP) is a Python data processing framework.

From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.

From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.

The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.

If you are using Python, you might want give MDP a try.

I first saw this in a tweet by Chris@SocialTexture.

November 8, 2012

100% Big Data 0% Hadoop 0% Java

Filed under: BigData,Erjang,Hadoop,Python — Patrick Durusau @ 3:36 pm

100% Big Data 0% Hadoop 0% Java by Pavlo Baron.

If you are guessing Python and Erlang, take another cookie!

Not a lot of details (it’s slides) but take a look at: https://github.com/pavlobaron, Disco in particular.

Hadoop can only be improved by insights gained from alternative approaches.

Recall we recently took the “one answer only” road in databases not long ago. Yes?

November 6, 2012

Galry

Filed under: Galry,Graphics,OpenGL,Python,Visualization — Patrick Durusau @ 3:52 pm

Galry – High-performance interactive 2D visualization in Python

From the webpage:

Galry is a high performance interactive 2D visualization package in Python. It lets you visualize and navigate into very large 2D plots (signals, points, textures…) in real time, by using the graphics card as much as possible (with OpenGL). On a 2012 computer with a recent 250$ graphics card, one can interactively visualize as much as 100 million points at a reasonable framerate.

Galry is not meant to generate high-quality plots (like matplotlib), and is more “low-level”. It can be used to write complex interactive visualization GUIs that deal with large 2D datasets (only with QT for now).

It is based on PyOpenGL and Numpy and is meant to work on any platform (Window/Linux/MacOS). Mandatory dependencies include Python 2.7, Numpy, either PyQt4 or PySide, PyOpenGL, matplotlib.

Optional dependencies include IPython, hdf5, PyOpenCL (the last two are not currently used but may be in the future).

And:

Important note: Galry is still an experimental project with an unstable programming interface that is likely to change at any time. Do not use it in production yet.

If you need 2D visualization performance, take a look at Galry.

The demo video is impressive to say the least.

November 4, 2012

Atepassar Recommendations [social network recommender]

Filed under: MapReduce,Python,Recommendation — Patrick Durusau @ 3:56 pm

Atepassar Recommendations: Recommending friends with MapReduce and Python by Marcel Caraciolo.

From the post:

In this post I will present one of the tecnhiques used at Atépassar, a brazilian social network that help students around Brazil in order to pass the exams for a civil job, our recommender system.

(graphic omitted)

I will describe some of the data models that we use and discuss our approach to algorithmic innovation that combines offline machine learning with online testing. For this task we use distributed computing since we deal with over with 140 thousand users. MapReduce is a powerful technique and we use it by writting in python code with the framework MrJob. I recommend you to read further about it at my last post here.

One of our recommender techniques is the simple ‘people you might know‘ recommender algorithm. Indeed, there are several components behind the algorithm since at Atépassar, users can follow other people as also be followed by other people. In this post I will talk about the basic idea of the algorithm which can be derivated for those other components. The idea of the algorithm is that if person A and person B do know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other.

Is there a presumption in social recommendation programs that there are no duplicate people in the network? Using different names? If two people have exactly the same friends, is there some chance they could be the same person?

How many “same” friends would you require? 20? 30? 50? Some other number?

Curious because determining personal identity and identity of the people behind two or more entries, may be a matter of pattern matching.

BTW, this is a interesting looking blog. You may want to browse older entries or even subscribe.

October 31, 2012

LEARN programming by visualizing code execution

Filed under: Programming,Python,Visualization — Patrick Durusau @ 8:05 am

LEARN programming by visualizing code execution

From the webpage:

Online Python Tutor is a free educational tool that helps students overcome a fundamental barrier to learning programming: understanding what happens as the computer executes each line of a program’s source code. Using this tool, a teacher or student can write a Python program directly in the web browser and visualize what the computer is doing step-by-step as it executes the program.

Of immediate significance for anyone learning or teaching Python.

Longer range, something similar for merging data from different sources could be useful as well.

At its simplest, representing the percentage of information from particular sources by color, for the map or items in the map. Illustrating, “what if we take away X,” as a source type analysis.

I first saw this at Christophe Lalanne’s A bag of tweets / October 2012.

October 26, 2012

First Steps with NLTK

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 3:18 pm

First Steps with NLTK by Sujit Pal.

From the post:

Most of what I know about NLP is as a byproduct of search, ie, find named entities in (medical) text and annotating them with concept IDs (ie node IDs in our taxonomy graph). My interest in NLP so far has been mostly as a user, like using OpenNLP to do POS tagging and chunking. I’ve been meaning to learn a bit more, and I did take the Stanford Natural Language Processing class from Coursera. It taught me a few things, but still not enough for me to actually see where a deeper knowledge would actually help me. Recently (over the past month and a half), I have been reading the NLTK Book and the NLTK Cookbook in an effort to learn more about NLTK, the Natural Language Toolkit for Python.

This is not the first time I’ve been through the NLTK book, but it is the first time I have tried working out all the examples and (some of) the exercises (available on GitHub here), and I feel I now understand the material a lot better than before. I also realize that there are parts of NLP that I can safely ignore at my (user) level, since they are not either that baked out yet or because their scope of applicability is rather narrow. In this post, I will describe what I learned, where NLTK shines, and what one can do with it.

You will find the structured listing of links into the NLTK PyDocs very useful.

October 10, 2012

Explore Python, machine learning, and the NLTK library

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 4:18 pm

Explore Python, machine learning, and the NLTK library by Chris Joakim (cjoakim@bellsouth.net), Senior Software Engineer, Primedia Inc.

From the post:

The challenge: Use machine learning to categorize RSS feeds

I was recently given the assignment to create an RSS feed categorization subsystem for a client. The goal was to read dozens or even hundreds of RSS feeds and automatically categorize their many articles into one of dozens of predefined subject areas. The content, navigation, and search functionality of the client website would be driven by the results of this daily automated feed retrieval and categorization.

The client suggested using machine learning, perhaps with Apache Mahout and Hadoop, as she had recently read articles about those technologies. Her development team and ours, however, are fluent in Ruby rather than Java™ technology. This article describes the technical journey, learning process, and ultimate implementation of a solution.

If a wholly automated publication process leaves you feeling uneasy, imagine the same system that feeds content to subject matter experts for further processing.

Think of it as processing raw ore on the way to finding diamonds and then deciding which ones get polished.

October 1, 2012

Scikit-learn 0.12 released

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 10:31 am

Scikit-learn 0.12 released by Andreas Mueller.

From the post:

Last night I uploaded the new version 0.12 of scikit-learn to pypi. Also the updated website is up and running and development now starts towards 0.13.

The new release has some nifty new features (see whatsnew):

  • Multidimensional scaling
  • Multi-Output random forests (like these)
  • Multi-task Lasso
  • More loss functions for ensemble methods and SGD
  • Better text feature extraction

Eventhough, the majority of changes in this release are somewhat “under the hood”.

Vlad developed and set up a continuous performance benchmark for the main algorithms during his google summer of code. I am sure this will help improve performance.

There already has been a lot of work in improving performance, by Vlad, Immanuel, Gilles and others for this release.

Just in case you haven’t been keeping up with Scikit-learn.

Troll Detection with Scikit-Learn

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 9:52 am

Troll Detection with Scikit-Learn by Andreas Mueller.

I had thought that troll detection was one of those “field guide” sort of things:

troll dolls

After reading Andreas’ post, apparently not. 😉

From the post:

Cross-post from Peekaboo, Andreas Mueller‘s computer vision and machine learning blog. This post documents his experience in the Impermium Detecting Insults in Social Commentary competition, but rest of the blog is well worth a read, especially for those interested in computer vision and Python scikit-learn and -image.

Recently I entered my first kaggle competition – for those who don’t know it, it is a site running machine learning competitions. A data set and time frame is provided and the best submission gets a money prize, often something between 5000$ and 50000$.

I found the approach quite interesting and could definitely use a new laptop, so I entered Detecting Insults in Social Commentary.

My weapon of choice was Python with scikit-learn – for those who haven’t read my blog before: I am one of the core devs of the project and never shut up about it.

During the competition I was visiting Microsoft Reseach, so this is where most of my time and energy went, in particular in the end of the competition, as it was also the end of my internship. And there was also the scikit-learn release in between. Maybe I can spent a bit more time on the next competition.

Disco [Erlang/Python – MapReduce]

Filed under: Disco,Erlang,MapReduce,Python — Patrick Durusau @ 9:16 am

Disco

From the webpage:

Disco is a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data.

Disco is powerful and easy to use, thanks to Python. Disco distributes and replicates your data, and schedules your jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in real-time.

Install Disco on your laptop, cluster or cloud of choice and become a part of the Disco community!

I rather like the MapReduce graphic you will see at About.

I first saw this in Guido Kollerie’s post on the recent Python users meeting in the Netherlands. Guido details his 5 minute presentation on Disco.

September 6, 2012

Ternary graph isomorphism in polynomial time, after Luks

Filed under: Graphs,Networks,Python — Patrick Durusau @ 4:04 pm

Ternary graph isomorphism in polynomial time, after Luks by Adria Alcala Mena and Francesc Rossello.

Abstract:

The graph isomorphism problem has a long history in mathematics and computer science, with applications in computational chemistry and biology, and it is believed to be neither solvable in polynomial time nor NP-complete. E. Luks proposed in 1982 the best algorithm so far for the solution of this problem, which moreover runs in polynomial time if an upper bound for the degrees of the nodes in the graphs is taken as a constant. Unfortunately, Luks’ algorithm is purely theoretical, very difficult to use in practice, and, in particular, we have not been able to find any implementation of it in the literature. The main goal of this paper is to present an efficient implementation of this algorithm for ternary graphs in the SAGE system, as well as an adaptation to fully resolved rooted phylogenetic networks on a given set of taxa.

Building on his masters thesis, Adria focuses on implementation issues of Luks’ graph isomorphism algorithm.

August 6, 2012

r3 redistribute reduce reuse

Filed under: MapReduce,Python,Redis — Patrick Durusau @ 10:30 am

r3 redistribute reduce reuse

From the project homepage:

r³ is a map-reduce engine written in python using redis as a backend

r³ is a map reduce engine written in python using a redis backend. It’s purpose is to be simple.

r³ has only three concepts to grasp: input streams, mappers and reducers.

You need to visit this project. It is simple, efficient and effective.

I found this following r³ – A quick demo of usage, which I found at: Demoing the Python-Based Map-Reduce R3 Against GitHub Data, Alex Popescu’s myNoSQL.

July 29, 2012

AstroPython

Filed under: Astroinformatics,Python — Patrick Durusau @ 3:09 pm

AstroPython

From the webpage:

The purpose of this web site is to act as a community knowledge base for performing astronomy research with Python. It provides lists of useful resources, a forum for general discussion, advice, or relevant news items, collecting users’ code snippets or scripts, and longer tutorials on specific topics. The topics within these pages are presented in a list view with the ability to sort by date or topic. A traditional “blog” view of the most recently posted topics is visible from the site Home page.

Along with the other astronomy applications I have mentioned this weekend I thought you might find this useful.

Skills with Python, data processing and subject identification/mapping skills transfer across disciplines.

July 27, 2012

Anaconda: Scalable Python Computing

Filed under: Anaconda,Data Analysis,Machine Learning,Python,Statistics — Patrick Durusau @ 10:19 am

Anaconda: Scalable Python Computing

Easy, Scalable Distributed Data Analysis

Anaconda is a distribution that combines the most popular Python packages for data analysis, statistics, and machine learning. It has several tools for a variety of types of cluster computations, including MapReduce batch jobs, interactive parallelism, and MPI.

All of the packages in Anaconda are built, tested, and supported by Continuum. Having a unified runtime for distributed data analysis makes it easier for the broader community to share code, examples, and best practices — without getting tangled in a mess of versions and dependencies.

Good way to avoid dependency issues!

On scaling, I am reminded of a developer who designed a Python application to require upgrading for “heavy” use. Much to their disappointment, Python scaled under “heavy” use with no need for an upgrade. 😉

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

July 20, 2012

The Art of Social Media Analysis with Twitter and Python

Filed under: Python,Social Graphs,Social Media,Tweets — Patrick Durusau @ 4:59 am

The Art of Social Media Analysis with Twitter and Python by Krishna Sankar.

All that social media data in your topic map has to come from somewhere. 😉

Covers both the basics of the Twitter API and social graph analysis. With code of course.

I first saw this at KDNuggets.

July 9, 2012

Hadoop Streaming Made Simple using Joins and Keys with Python

Filed under: Hadoop,Python,Stream Analytics — Patrick Durusau @ 10:48 am

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post https://github.com/joestein/amaunet

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

July 6, 2012

BigMl 0.3.1 Release

Filed under: Machine Learning,Predictive Analytics,Python — Patrick Durusau @ 9:45 am

BigMl 0.3.1 Release

From the webpage:

An open source binding to BigML.io, the public BigML API

Downloads

BigML makes machine learning easy by taking care of the details required to add data-driven decisions and predictive power to your company. Unlike other machine learning services, BigML creates beautiful predictive models that can be easily understood and interacted with.

These BigML Python bindings allow you to interact with BigML.io, the API for BigML. You can use it to easily create, retrieve, list, update, and delete BigML resources (i.e., sources, datasets, models and, predictions).

There’s that phrase again, predictive models.

Don’t people read patent literature anymore? 😉 I don’t care for absurdist fiction so I tend to avoid it. People claiming invention for having a patent lawyer write common art up in legal prose. Good for patent lawyers, bad for researchers and true inventers.

June 9, 2012

Hadoop Streaming Support for MongoDB

Filed under: Hadoop,Javascript,MapReduce,MongoDB,Python,Ruby — Patrick Durusau @ 7:13 pm

Hadoop Streaming Support for MongoDB

From the post:

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

[graphic omitted]

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

[graphic omitted]

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

I like that, “…popular dynamic programming languages…” 😉

Any improvement to increase usability without religious conversion (using a programming language not your favorite) is a good move.

« Newer PostsOlder Posts »

Powered by WordPress