Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 11, 2013

Bayesian Methods for Hackers

Bayesian Methods for Hackers by a community of authors!

From the readme:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

(…)

Useful in case all the knowledge you want to put in a topic map is far from certain. 😉

July 2, 2013

Running Python and R inside Emacs

Filed under: Editor,Programming,Python,R — Patrick Durusau @ 2:45 pm

Running Python and R inside Emacs by John D. Cook.

From the post:

Emacs org-mode lets you manage blocks of source code inside a text file. You can execute these blocks and have the output display in your text file. Or you could export the file, say to HTML or PDF, and show the code and/or the results of executing the code.

Here I’ll show some of the most basic possibilities. For much more information, see orgmode.org. And for the use of org-mode in research, see A Multi-Language Computing Environment for Literate Programming and Reproducible Research.

Not recent (2012) but looks quite interesting.

Well, you have to already like Emacs! 😉

Follow John’s post for basic usage and if you like it, checkout orgmode.org.

Glue

Filed under: Python,Visualization — Patrick Durusau @ 9:50 am

Glue: multidimensional data exploration.

From the webpage:

Glue is a Python library to explore relationships within and among related datasets. Its main features include:

  • Linked Statistical Graphics. With Glue, users can create scatter plots, histograms and images (2D and 3D) of their data. Glue is focused on the brushing and linking paradigm, where selections in any graph propagate to all others.
  • Flexible linking across data. Glue uses the logical links that exist between different data sets to overlay visualizations of different data, and to propagate selections across data sets. These links are specified by the user, and are arbitrarily flexible.
  • Full scripting capability. Glue is written in Python, and built on top of its standard scientific libraries (i.e., Numpy, Matplotlib, Scipy). Users can easily integrate their own python code for data input, cleaning, and analysis.

There is a series of videos by Chris Beaumont on Glue:

What is Glue?

Getting Started with Glue

Glue FAQ: How do I overplot a catalog on an image?

Linking Data in Glue

Glue Demo: World Wide Telescope

I like Glue because of its use of astronomy data for examples but it isn’t limited to astronomical data.

From the FAQ:

What data formats does Glue understand?

Glue relies on several libraries to parse different file formats:

  • Astropy for FITS images and tables, a
    variety of ascii table formats, and VO
    tables.
  • scikit-image to read popular image
    formats like .jpeg and .tiff
  • h5py to read HDF5 files
  • If Glue’s predefined data loaders don’t fit your needs, ou can also write your own loader, and plug it into Glue.

    Searching for particular information or data is one task.

    Exploring a data set to see what you may encounter is another.

    What data sets do you want to explore with Glue?

    I first saw this in Christophe Lalanne’s A bag of tweets / June 2013.

    PS: The mapping function in “Getting Started With Glue” is particularly interesting. What mapping function will you plugin?

    June 30, 2013

    SciPy2013 Videos

    Filed under: Python,Scikit-Learn,Statistics — Patrick Durusau @ 6:13 pm

    SciPy2013 Videos

    A really nice set of videos, including tutorials, from SciPy2013.

    Due to the limitations of YouTube, the listing is a mess.

    If I have time later this week I will try to produce a cleaned up listing.

    in the meantime, enjoy!

    June 21, 2013

    Graphillion

    Filed under: Graphillion,Graphs,Networks,Python — Patrick Durusau @ 6:07 pm

    Graphillion

    From the webpage:

    Graphillion is a Python library for efficient graphset operations. Unlike existing graph tools such as NetworkX, which are designed to manipulate just a single graph at a time, Graphillion handles a large set of graphs very efficiently. Surprisingly, trillions of trillions of graphs can be processed on a single computer with Graphillion.

    You may be curious about an uncommon concept of graphset, but it comes along with any graph or network when you consider multiple subgraphs cut from the graph; e.g., considering possible driving routes on a road map, examining feasible electric flows on a power grid, or evaluating the structure of chemical reaction networks. The number of such subgraphs can be trillions even in a graph with just a few hundreds of edges, since subgraphs increase exponentially with the graph size. It takes millions of years to examine all subgraphs with a naive approach as demonstrated in the funny movie above; Graphillion is our answer to resolve this issue.

    Graphillion allows you to exhaustively but efficiently search a graphset with complex, even nonconvex, constraints. In addition, you can find top-k optimal graphs from the complex graphset, and can also extract common properties among all graphs in the set. Thanks to these features, Graphillion has a variety of applications including graph database, combinatorial optimization, and a graph structure analysis. We will show some practical use cases in the following tutorial, including evaluation of power distribution networks.

    Just skimming the tutorial, this looks way cool!

    Be sure to check out the references:

    • Takeru Inoue, Hiroaki Iwashita, Jun Kawahara, and Shin-ichi Minato: “Graphillion: Software Library Designed for Very Large Sets of Graphs in Python,” Hokkaido University, Division of Computer Science, TCS Technical Reports, TCS-TR-A-13-65, June 2013.
      (pdf)
    • Takeru Inoue, Keiji Takano, Takayuki Watanabe, Jun Kawahara, Ryo Yoshinaka, Akihiro Kishimoto, Koji Tsuda, Shin-ichi Minato, and Yasuhiro Hayashi, “Loss Minimization of Power Distribution Networks with Guaranteed Error Bound,” Hokkaido University, Division of Computer Science, TCS Technical Reports, TCS-TR-A-12-59, 2012. (pdf)
    • Ryo Yoshinaka, Toshiki Saitoh, Jun Kawahara, Koji Tsuruma, Hiroaki Iwashita, and Shin-ichi Minato, “Finding All Solutions and Instances of Numberlink and Slitherlink by ZDDs,” Algorithms 2012, 5(2), pp.176-213, 2012. (doi)
    • DNET – Distribution Network Evaluation Tool

    I first saw this in a tweet by David Gutelius.

    June 15, 2013

    Indexing web sites in Solr with Python

    Filed under: Indexing,Python,Solr — Patrick Durusau @ 3:44 pm

    Indexing web sites in Solr with Python by Martijn Koster.

    From the post:

    In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

    We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites. If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

    Good practice with Solr, not to mention your search activities are yours to keep private if you like. 😉

    June 6, 2013

    edX Code

    Filed under: Education,Python — Patrick Durusau @ 2:28 pm

    edX Code

    From the homepage:

    Welcome to edX Code, where developers around the globe are working to create a next-generation online learning platform that will bring quality education to students around the world.

    EdX is a not-for-profit enterprise composed of 27 leading global institutions, the xConsortium. Since our founding in May 2012, edX has been committed to an open source vision. We believe in pursuing non-profit, open-source opportunities for expanding online education around the world. We believe it’s important to support these efforts in visible and substantive ways, and that’s why we are opening up our platform and inviting the world to help us make it better.

    If you think topic maps are relevant to education, then they should be relevant to online education.

    Yes?

    A standalone topic map application not needed in this context but I don’t recall any standalong application requirement.

    I first saw this at: edX learning platform now all open source.

    June 5, 2013

    Trends in Machine Learning [SciPy]

    Filed under: Machine Learning,Python — Patrick Durusau @ 8:11 am

    Trends in Machine Learning by Olivier Grisel.

    Slides from presentation at Paris DataGeeks 2013.

    Focus is on Python and SciPy.

    Covers probabilistic programming, deep learning, and has links at the end.

    Good way to check your currency on machine learning with Python.

    June 2, 2013

    MapReduce with Python and mrjob on Amazon EMR

    Filed under: Amazon EMR,MapReduce,Natural Language Processing,Python — Patrick Durusau @ 10:59 am

    MapReduce with Python and mrjob on Amazon EMR by Sujit Pal.

    From the post:

    I’ve been doing the Introduction to Data Science course on Coursera, and one of the assignments involved writing and running some Pig scripts on Amazon Elastic Map Reduce (EMR). I’ve used EMR in the past, but have avoided it ever since I got burned pretty badly for leaving it on. Being required to use it was a good thing, since I got over the inertia and also saw how much nicer the user interface had become since I last saw it.

    I was doing another (this time Python based) project for the same class, and figured it would be educational to figure out how to run Python code on EMR. From a quick search on the Internet, mrjob from Yelp appeared to be the one to use on EMR, so I wrote my code using mrjob.

    The code reads an input file of sentences, and builds up trigram, bigram and unigram counts of the words in the sentences. It also normalizes the text, lowercasing, replacing numbers and stopwords with placeholder tokens, and Porter stemming the remaining words. Heres the code, as you can see, its fairly straightforward:

    Knowing how to exit and confirm exit from a cloud service are the first things to learn about a cloud system.

    May 30, 2013

    Stepping up to Big Data with R and Python…

    Filed under: BigData,Python,R — Patrick Durusau @ 2:08 pm

    Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need by Abhijit Dasgupta.

    From the post:

    On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled “Stepping up to big data with R and Python,” was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the “traditional” analytics stack in R and Python to work with big data.

    [images omitted]

    R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn’t even try to cover everything, and we didn’t believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.

    See the post for links to preliminary maps of the two ecosystems.

    I like the maps but the background seems distracting.

    You?

    May 24, 2013

    Wakari.IO Web-based Python Data Analysis

    Filed under: Data Analysis,Python — Patrick Durusau @ 6:20 pm

    Wakari.IO Web-based Python Data Analysis

    From: Continuum Analytics Launches Full-Featured, In-Browser Data Analytics Environment by Corinna Bahr.

    Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets .

    Hosted on Amazon’s Elastic Compute Cloud (EC2), Wakari gives users the ability to share analyses and results via IPython notebook, visualize with Matplotlib, easily switch between multiple versions of Python and its scientific libraries, and quickly collaborate on analyses without having to download data locally to their laptops or workstations. Users can share code and results as simple web URLs, from which other users can easily create their own copies to modify and explore.

    Previously in beta, the version 1.0 release of Wakari boasts a number of new features, including:

    • Premium access to SSH, ipcluster configuration, and the full range of Amazon compute nodes and clusters via a drop-down menu
    • Enhanced IPython notebook support, most notably an IPython notebook gallery and an improved UI for sharing
    • Bundles for simplified sharing of files, folders, and Python library dependencies
    • Expanded Wakari documentation
    • Numerous enhancements to the user interface

    This looks quite interesting. There is a free option if you are undecided.

    I first saw this at: Wakari: Continuum In-Browser Data Analytics Environment.

    May 11, 2013

    GPU Scripting and Code Generation with PyCUDA

    Filed under: CUDA,GPU,Python — Patrick Durusau @ 10:47 am

    GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.

    Abstract:

    High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

    The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

    Point taken.

    More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

    Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.

    April 16, 2013

    Hacking Secret Ciphers with Python

    Filed under: Cryptography,Python — Patrick Durusau @ 6:40 pm

    “Hacking Secret Ciphers with Python” Released by Al Sweigart.

    From the post:

    My third book, Hacking Secret Ciphers with Python, is finished. It is free to download under a Creative Commons license, and available for purchase as a physical book on Amazon for $25 (which qualifies it for free shipping). This book is aimed at people who have no experience programming or with cryptography. The book goes through writing Python programs that not only implement several ciphers but also can hack these ciphers.

    100% of the proceeds from the book sales will be donated to the Electronic Frontier Foundation, Creative Commons, and The Tor Project.

    This looks like fun!

    Unlike the secrecy cultists in cybersecurity, I think new ideas and insights into cryptography can come from anyone who spends time working on it.

    To paraphrase Buffalo Springfield, “…increase the government’s paranoia like looking in a mirror and seeing the public working on cryptography….”

    I never claimed to be a song writer. 😉

    PS: Download a copy and buy a hard copy to give to someone.

    Or donate the hard copy to your local library!

    April 11, 2013

    PyData and More Tools…

    Filed under: Data Science,PyData,Python — Patrick Durusau @ 3:14 pm

    PyData and More Tools for Getting Started with Python for Data Scientists by Sean Murphy.

    From the post:

    It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post. As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

    I make out at least seventeen (17) different Pthon resources, libraries, etc.

    Enough to keep you busy for more than a little while. 😉

    April 9, 2013

    Scrapely

    Filed under: Data Mining,Python — Patrick Durusau @ 9:36 am

    Scrapely

    From the webpage:

    Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

    A tool for data mining similar HTML pages.

    Supports a command line interface.

    April 6, 2013

    A Programmer’s Guide to Data Mining

    Filed under: Data Mining,Python — Patrick Durusau @ 8:56 am

    A Programmer’s Guide to Data Mining – The Ancient Art of the Numerati by Ron Zacharski.

    From the webpage:

    Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don’t get me wrong, the information in those books is extremely important. However, if you are a programmer interested in learning a bit about data mining you might be interested in a beginner’s hands-on guide as a first step. That’s what this book provides.

    This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques. This book is available for download for free under a Creative Commons license (see link in footer). You are free to share the book, and remix it. Someday I may offer a paper copy, but the online version will always be free.

    If you are looking for explanations of data mining that fall between the “dummies” variety and arXiv.org papers, you are at the right place!

    Not new information but well presented information, always a rare thing.

    Take the time to read this book.

    If not for the content, to get some ideas on how to improve your next book.

    March 25, 2013

    Implementing the RAKE Algorithm with NLTK

    Filed under: Authoring Topic Maps,Natural Language Processing,NLTK,Python — Patrick Durusau @ 3:09 pm

    Implementing the RAKE Algorithm with NLTK by Sujit Pal.

    From the post:

    The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

    The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

    I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

    I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

    Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

    If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at Amazon.com.

    Could be a handy tool to quickly extract candidates for topics in a topic map.

    March 24, 2013

    PyCon US 2013

    Filed under: Python — Patrick Durusau @ 6:38 pm

    PyCon US 2013

    In case you missed it, videos from PyCon US 2013 are online!

    I am just beginning to scroll through the presentations and will be pulling out some favorites.

    What are yours?

    March 22, 2013

    How Sharehoods Created Neomodel Along The Way [London]

    Filed under: Django,Graphs,Neo4j,Python — Patrick Durusau @ 12:53 pm

    How Sharehoods Created Neomodel Along The Way

    EVENT DETAILS

    What: Neo4J User Group:CASE STUDY: How Sharehoods Created Neomodel Along The Way
    Where: The Skills Matter eXchange, London
    When: 27 Mar 2013 Starts at 18:30

    From the description:

    Sharehoods is a global online portal for foreigners. and the first place where new-comers to a city can build their social relationships and network – online or from a mobile phone.

    In this talk, Sharehoods Head of Technology Robin Edwards will explain why and how Neo4j is used at this exciting tech startup. Robin will also give a whirlwind tour of neomodel, a new Python framework for neo4j and its integration with the Django stack.

    Join this talk if you’d like to learn how to get productive with Neo4j, Python and Django.

    Entity disambiguation:

    I don’t think they mean:

    Jamie Foxx

    I think they mean:

    django software The Web framework for perfectionists with deadlines.

    If you attend, drop me a note to confirm my suspicions. 😉

    SnoPy – SNOBOL Pattern Matching for Python

    Filed under: Pattern Matching,Python,SNOBOL — Patrick Durusau @ 9:32 am

    SnoPy – SNOBOL Pattern Matching for Python

    Description:

    SnoPy – A Python alternative to regular expressions. Borrowed from SNOBOL this alternative is both easier to use and more powerful than regular expressions. NO backslashes to count.

    See also: SnoPy – SNOBOL Pattern Matching for Python Web Site.

    For cross-disciplinary data mining, what could be more useful than SNOBOL pattern matching?

    I first saw this in a Facebook link posted by Sam Hunting.

    March 20, 2013

    Pyrallel – Parallel Data Analytics in Python

    Filed under: Data Analysis,Parallel Programming,Programming,Python — Patrick Durusau @ 6:12 am

    Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.

    From the webpage:

    Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

    Scope:

    • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
    • focus on small to medium data (with data locality when possible).
    • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
    • do not focus on HA / Fault Tolerance (yet).
    • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

    Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

    This project brought to mind two things:

    1. Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
    2. A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.

    March 7, 2013

    Million Song Dataset in Minutes!

    Filed under: Hadoop,MapReduce,Mortar,Pig,Python — Patrick Durusau @ 3:50 pm

    Million Song Dataset in Minutes! (Video)

    Actually 5:35 as per the video.

    The summary of the video reads:

    Created Web Project [zero install]

    Loaded data from S3

    Developed in Pig and Python [watch for the drop down menus of pig fragments]

    ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]

    Ran on Hadoop [drop downs to create a cluster]

    Downloaded results [50 “densest songs”, see the video]

    It’s not all “hands free” or without intellectual effort on your part.

    But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.

    February 26, 2013

    PyData Videos

    Filed under: Data,Python — Patrick Durusau @ 1:53 pm

    PyData Videos

    All great but here are five (5) to illustrate the range of what awaits:

    Connecting Data Science to business value, Josh Hemann.

    GPU and Python, Andreas Klöckner, Ph.D.

    Network X and Gephi, Gilad Lotan.

    NLTK and Text Processing, Andrew Montalenti.

    Wikipedia Indexing And Analysis, Didier Deshommes.

    Forty-seven (47) videos in all so my list is missing forty-two (42) other great ones!

    Which ones are your favorites?

    February 11, 2013

    Flatten entire HBase column families… [Mixing Labels and Data]

    Filed under: HBase,Pig,Python — Patrick Durusau @ 4:24 pm

    Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.

    From the post:

    Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

    How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

    Now there’s an ugly problem.

    You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.

    Saying: “Don’t do that!” doesn’t help because it is already being done.

    If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.

    Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?

    Suggestions?

    I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.

    February 8, 2013

    PyPLN: a Distributed Platform for Natural Language Processing

    Filed under: Linguistics,Natural Language Processing,Python — Patrick Durusau @ 5:16 pm

    PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

    Abstract:

    This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

    Demo: http://demo.pypln.org

    Source code: http://pypln.org.

    Have you noticed that tools for analysis are getting easier, not harder to use?

    Is there a lesson there for tools to create topic map content?

    February 7, 2013

    A Quick Guide to Hadoop Map-Reduce Frameworks

    Filed under: Hadoop,Hive,MapReduce,Pig,Python,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 10:45 am

    A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

    Alex has assembled links to guides to MapReduce frameworks:

    Thanks Alex!

    January 23, 2013

    Confluently Persistent Sets and Maps

    Filed under: Functional Programming,Maps,Python,Sets — Patrick Durusau @ 7:42 pm

    Confluently Persistent Sets and Maps by Olle Liljenzin.

    Abstract:

    Ordered sets and maps play important roles as index structures in relational data models. When a shared index in a multi-user system is modified concurrently, the current state of the index will diverge into multiple versions containing the local modifications performed in each work flow. The confluent persistence problem arises when versions should be melded in commit and refresh operations so that modifications performed by different users become merged.

    Confluently Persistent Sets and Maps are functional binary search trees that support efficient set operations both when operands are disjoint and when they are overlapping. Treap properties with hash values as priorities are maintained and with hash-consing of nodes a unique representation is provided. Non-destructive set merge algorithms that skip inspection of equal subtrees and a conflict detecting meld algorithm based on set merges are presented. The meld algorithm is used in commit and refresh operations. With m modifications in one flow and n items in total, the expected cost of the operations is O(m log(n/m)).

    Is this an avenue for coordination between distinct topic maps?

    Or is consistency of distinct topic maps an application-based requirement?

    Assembling a Python Machine Learning Toolkit

    Filed under: Machine Learning,Python — Patrick Durusau @ 7:40 pm

    Assembling a Python Machine Learning Toolkit by Sujit Pal.

    From the post:

    I had been meaning to read Peter Harrington’s book Machine Learning In Action (MLIA) for a while now, and I finally finished reading it earlier this week (my review on Amazon is here). The book provides Python implementations of 8 of the 10 Top Algorithms in Data Mining listed in this paper (PDF). The math package used in the examples is Numpy, and the charts are built using Matplotlib.

    In the past, the little ML work I have done has been in Java, because that was the language and ecosystem I knew best. However, given the experimental, iterative nature of ML work, its probably not the most ideal language to use. However, there are lots of options when it comes to languages for ML – over the last year, I have learned Octave (open-source version of MATLAB) for the Coursera Machine Learning class and R for the Coursera Statistics One and Computing for Data Analysis classes (still doing the second one). But because I know Python already, Python/Numpy looks easier to use than Octave, and Python/Matplotlib looks as simple as using R graphics. There is also the pandas package which provides R-like features, although I haven’t used it yet.

    Looking around on the net, I find that many other people have reached similar conclusions – ie, that Python seems to be the way to go for initial prototyping work in ML. I wanted to set up a small toolbox of Python libraries that will allow me to do this also. I settled on an initial list of packages based on the Scipy Superpack, but since I am still on Mac OS (Snow Leopard) I could not use the script from there. There were some issues I had to work through to make this to work, so I document this here, so if you are in the same situation this may help you.

    Unlike the Scipy Superpack, which seems to prefer versions that are often the bleeding edge development versions, I decided to stick to the latest stable release versions for each of the libraries. Here they are:

    Sujit’s post will save you a few steps in assembling your Python machine learning toolkit.

    Pass it on.

    January 17, 2013

    Machine Learning and Data Mining – Association Analysis with Python

    Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

    From the post:

    Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

    The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

    Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

    When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

    As this post demonstrates, that may be overly optimistic on my part.

    What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

    An incomplete association as it were.

    Suggestions?

    January 9, 2013

    A Guide to Python Frameworks for Hadoop

    Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 12:03 pm

    A Guide to Python Frameworks for Hadoop by Uri Laserson.

    From the post:

    I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

    In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

    • Hadoop Streaming
    • mrjob
    • dumbo
    • hadoopy
    • pydoop
    • and others

    Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

    Read on for implementation details, performance comparisons, and feature comparisons.

    A non-word count Hadoop example? Who would have thought? 😉

    Enjoy!

    « Newer PostsOlder Posts »

    Powered by WordPress