Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2014

Analyzing PubMed Entries with Python and NLTK

Filed under: NLTK,PubMed,Python — Patrick Durusau @ 2:35 pm

Analyzing PubMed Entries with Python and NLTK by Themos Kalafatis.

From the post:

I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK.

We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia).

At the moment of writing, the PubMed Query for sudden hearing loss returns 2919 entries :

A great illustration of using NLTK but of the iterative nature of successful querying.

Some queries, quite simple ones, can and do succeed on the first attempt.

Themos demonstrates how to use NLTK to explore a data set where the first response isn’t all that helpful.

This is a starting idea for weekly exercises with NLTK. Exercises which emphasize different aspects of NLTK.

February 15, 2014

Anaconda 1.9

Filed under: Anaconda,Data Mining,Python — Patrick Durusau @ 10:22 am

Anaconda 1.9

From the homepage:

Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing

  • 125+ of the most popular Python packages for science, math, engineering, data analysis
  • Completely free – including for commercial use and even redistribution
  • Cross platform on Linux, Windows, Mac
  • Installs into a single directory and doesn’t affect other Python installations on your system. Doesn’t require root or local administrator privileges.
  • Stay up-to-date by easily updating packages from our free, online repository
  • Easily switch between Python 2.6, 2.7, 3.3, and experiment with multiple versions of libraries, using our conda package manager and its great support for virtual environments

In addition to maintaining Anaconda as a free Python distribution, Continuum Analytics offers consulting/training services and commercial packages to enhance your use of Anaconda.

Before hitting “download,” know that the Linux 64-bit distribution is just short of 649 MB. Not an issue for most folks but there are some edge cases where it might be.

February 14, 2014

SunPy

Filed under: Astroinformatics,Numpy,Python — Patrick Durusau @ 4:54 pm

SunPy

From the webpage:

The SunPy project is a free and open-source software library for solar physics.

SunPy is a community-developed free and open-source software package for solar physics. SunPy is meant to be a free alternative to the SolarSoft data analysis environment which is based on the IDL scientific programming language sold by Exelis. Though SolarSoft is open-source IDL is not and can be prohibitively expensive.

The aim of the SunPy project is to provide the software tools necessary so that anyone can analyze solar data. SunPy is written using the Python programming language and is build upon the scientific Python environment which includes the core packages NumPy, SciPy. The development of SunPy is associated with that Astropy. SunPy was first created in 2011 by a small group of scientists and developers at the NASA Goddard Space Flight Center on nights and weekends.

Future employers will be interested in your data handling skills. Not whether you learned them as part of a hobby (astronomy), on your own or from a class. From a hobby just means you had fun learning them.

I first saw this in a tweet by Scientific Python.

January 30, 2014

100 numpy exercises

Filed under: Numpy,Python,Scientific Computing — Patrick Durusau @ 3:38 pm

100 numpy exercises A joint effort of the numpy community.

The categories are:

Neophyte
Novice
Apprentice
Journeyman
Craftsman
Artisan
Adept
Expert
Master
Archmaster

Further on Numpy.

Enjoy!

I first saw this in a tweet by Gregory Piatetsky.

January 21, 2014

Extracting Insights – FBO.Gov

Filed under: Government Data,Hadoop,NLTK,Pig,Python — Patrick Durusau @ 3:20 pm

Extracting Insights from FBO.Gov data – Part 1

Extracting Insights from FBO.Gov data – Part 2

Extracting Insights from FBO.Gov data – Part 3

Dave Fauth has written a great three part series on extracting “insights” from large amounts of data.

From the third post in the series:

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The most compelling lesson from this series is that data doesn’t always easily give up its secrets.

If you make it to the end of the series, you will find the government, on occasion, does the right thing. I’ll admit it, I was very surprised. 😉

January 14, 2014

Algorithmic Music Discovery at Spotify

Filed under: Algorithms,Machine Learning,Matrix,Music,Music Retrieval,Python — Patrick Durusau @ 3:19 pm

Algorithmic Music Discovery at Spotify by Chris Johnson.

From the description:

In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.

Among a number of interesting points, Chris points out differences between movie and music data.

One difference is that songs are consumed over and over again. Another is that users rate movies but “vote” by their streaming behavior on songs.*

While leads to Chris’ main point, implicit matrix factorization. Code. The source code page points to: Collaborative Filtering for Implicit Feedback Datasets by Yifan Hu, Yehuda Koren, and Chris Volinsky.

Scaling that process is represented in blocks for Hadoop and Spark.

* I suspect that “behavior” is more reliable than “ratings” from the same user. Reasoning ratings are more likely to be subject to social influences. I don’t have any research at my fingertips on that issue. Do you?

January 10, 2014

…Customizable Test Data with Python

Filed under: Data,Python — Patrick Durusau @ 5:15 pm

A Tool to Generate Customizable Test Data with Python by Alec Noller.

From the post:

Sometimes you need a dataset to run some tests – just a bunch of data, anything – and it can be unexpectedly difficult to find something that works. There are some useful and readily-available options out there; for example, Matthew Dubins has worked with the Enron email dataset and a complete list of 9/11 victims.

However, if you have more specific needs, particularly when it comes to format and fitting within the structure of a database, and you want to customize your dataset to test one thing or another in particular, take a look at this Python package called python-testdata used to generate customizable test data. It can be set up to generate names in various forms, companies, addresses, emails, and more. The Github also includes some help to get started, as well as examples for use cases.

I hesitated when I first saw this given the overabundance of free data.

But then with “free” data, if it is large enough, you will have to rely on sampling to gauge the performance of software.

Introducing the hazards and dangers of strange data may not be acceptable in all cases.

xkcd 1313: Something is Wrong on the Internet!

Filed under: Python,Regex — Patrick Durusau @ 3:41 pm

xkcd 1313: Something is Wrong on the Internet!

Serious geekdom here!

An xkcd comic inspires an algorithm that generates a regex to extract winners from U.S. presidential elections. (Applicable to other lists as well.)

Remembering that some U.S. presidents both won and lost races for the presidency.

A very clever piece of work. At the same time, I must have the winner/loser lists in order to generate the regex.

So good exercise but I can’t apply it beyond the lists I used to generate the regex.

Yes?

BTW, do make a trip by Regex Golf to try your hand at writing regexes against different lists.

January 5, 2014

Unpublished Data (Meaning What?)

Filed under: NLTK,Python — Patrick Durusau @ 4:33 pm

PLoS Biology Bigrams by Georg.

From the post:

Here I will use the Natural Language Toolkit and a recipe from Python Text Processing with NLTK 2.0 Cookbook to work out the most frequent bigrams in the PLoS Biology articles that I downloaded last year and have described in previous posts here and here.

The amusing twist in this blog post is that the most frequent bigram, after filtering out stopwords, is unpublished data.

Not a trivial data set, some 1,754 articles.

Do you see the flaw in saying that most articles in PLoS data use “unpublished” data?

First, without looking at the data, I would be asking for the number of bigrams for each of the top six bigrams. I suspect that “gene expression” is used frequently relative to the number of articles, but I can’t make that judgment with the information given.

Second, the other question you would need to ask is why an article used the bigram “unpublished data.”

If I were writing a paper about papers that used “unpublished data” or more generally about “unpublished data,” I would use the bigram a lot. That would not mean my article was based on “unpublished data.”

NLTK can point you to the articles but deeper analysis is going to require you.

December 31, 2013

A pandas cookbook

Filed under: Python — Patrick Durusau @ 4:01 pm

A pandas cookbook by Julia Evans.

From the post:

A few people have told me recently that they find the slides for my talks really helpful for getting started with pandas, a Python library for manipulating data. But then they get out of date, and it’s tough to support slides for a talk that I gave a year ago.

So I was procrastinating packing to leave New York yesterday, and I started writing up some examples, with explanations! A lot of them are taken from talks I’ve given, but I also want to give some new examples, like

  • how to deal with timestamps
  • what is a pivot table and why would you ever want one?
  • how to deal with “big” data

I’ve put it in a GitHub repository called pandas-cookbook. It’s along the same lines as the pandas talks I’ve given – take a real dataset or three, play around with it, and learn how to use pandas along the way.

From what I have seen recently, “cookbooks” are going to be a big item in 2014!

PyPi interactive dependency graph

Filed under: Dependency Graphs,Graphs,Python,Visualization — Patrick Durusau @ 3:54 pm

PyPi interactive dependency graph

The graph takes a moment or two to load but is well worth the wait.

Mouse-over for popup labels.

The code is available on GitHub.

I don’t know the use case for displaying all the dependencies (or rather all the identified dependencies) in PyPi.

Or to put it another way, being able to hide some common dependencies by package or even class could prove to be helpful.

Seeing data in its aggregate isn’t as useful as discovering important data in the process of hiding common aggregate data.

December 28, 2013

Visualization [Harvard/Python/D3]

Filed under: D3,Graphics,Python,Visualization — Patrick Durusau @ 4:51 pm

Visualization [Harvard/Python/D3]

From the webpage:

The amount and complexity of information produced in science, engineering, business, and everyday human activity is increasing at staggering rates. The goal of this course is to expose you to visual representation methods and techniques that increase the understanding of complex data. Good visualizations not only present a visual interpretation of data, but do so by improving comprehension, communication, and decision making.

In this course you will learn how the human visual system processes and perceives images, good design practices for visualization, tools for visualization of data from a variety of fields, collecting data from web sites with Python, and programming of interactive web-based visualizations using D3.

Twenty-two (22) lectures, nine (9) labs (for some unknown reason, “lab” becomes “section”) and three (3) bonus videos.

Just as a sample, I tried Lab 3 Sketching Workshop I.

I don’t know that I will learn how to draw a straight line but if I don’t, it won’t be the fault of the instructor!

This looks very good.

I first saw this in a tweet by Christophe Viau.

December 21, 2013

Class Scheduling [Tutorial FoundationDB]

Filed under: FoundationDB,Java,Programming,Python,Ruby — Patrick Durusau @ 7:22 pm

Class Scheduling

From the post:

This tutorial provides a walkthrough of designing and building a simple application in Python using FoundationDB. In this tutorial, we use a few simple data modeling techniques. For a more in-depth discussion of data modeling in FoundationDB, see Data Modeling.

The concepts in this tutorial are applicable to all the languages supported by FoundationDB. If you prefer, you can see a version of this tutorial in:

The offering of the same tutorial in different languages looks like a clever idea.

Like using a polyglot edition of the Bible with parallel original text and translations.

In a polyglot, the associations between words in different languages are implied rather than explicit.

December 15, 2013

Data Science

Filed under: CS Lectures,Programming,Python — Patrick Durusau @ 8:49 pm

Data Science

Lectures on data science from the Harvard Extension School.

Twenty-two (22) lectures and ten (10) labs.

The lab sessions are instructor lead coding exercises with good visibility of the terminal window.

Possibly a format to follow in preparing other CS instructional material.

Lecture following by typing exercise of entering and understanding the code (when typos result in it not working).

I was reminded recently that Hunter Thompson typed novels by Ernest Hemingway and F. Scott Fitzgerald in order to learn their writing styles.

Would the same work for learning programming style? That you would begin to recognize patterns and options?

If nothing else, it give you some quality time with a debugger. 😉

December 7, 2013

Large-Scale Machine Learning and Graphs

Filed under: GraphChi,GraphLab,Graphs,Python — Patrick Durusau @ 5:10 pm

Large-Scale Machine Learning and Graphs by Carlos Guestrin.

The presentation starts with a history of the evolution of GraphLab, which is interesting in and of itself.

Carlos then goes beyond a history lesson and gives a glimpse of a very exciting future.

Such as: installing GraphLab with Python, using Python for local development, running the same Python with Graphlab in the cloud.

Thought that might catch your eye.

Something to remember when people talk about scaling graph analysis.

If you are interested in seeing one possible future of graph processing today, not some day, check out: GraphLab Notebook (Beta).

BTW, Carlos mentions a technique call “think like a vertex” which involves distributing vertexes across machines rather than splitting graphs on edges.

Seems to me that would work to scale the processing of topic maps by splitting topics as well. Once “merging” has occurred on different machines, then “merge” the relevant topics back together across machines.

December 6, 2013

Whoosh

Filed under: Python,Search Engines — Patrick Durusau @ 5:18 pm

Whoosh: Python Search Library

From the webpage:

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Some of Whoosh’s features include:

  • Pythonic API.
  • Pure-Python. No compilation or binary packages needed, no mysterious crashes.
  • Fielded indexing and search.
  • Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks.
  • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
  • Powerful query language.
  • Production-quality pure Python spell-checker (as far as I know, the only one).

Whoosh might be useful in the following circumstances:

  • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
  • As a research platform (at least for programmers that find Python easier to read and work with than Java 😉
  • When an easy-to-use Pythonic interface is more important to you than raw speed.
  • If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option).

Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.

Learning more

One of the reasons to use Whoosh made me laugh:

When an easy-to-use Pythonic interface is more important to you than raw speed.

When is raw speed less important than anything? 😉

Seriously, experimentation with search promises to be a fruitful area for the foreseeable future.

I first saw this in Nat Torkington’s Four short links: 21 November 2013.

December 3, 2013

Bokeh

Filed under: Graphics,Python,Visualization — Patrick Durusau @ 3:42 pm

Bokeh

From the webpage:

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

For more information about the goals and direction of the project, please see the Technical Vision.

To get started quickly, follow the Quickstart.

Visit the source repository: https://github.com/ContinuumIO/bokeh

Be sure to follow us on Twitter @bokehplots!

The technical vision makes the case for Bokeh quite well:

Photographers use the Japanese word “bokeh” to describe the blurring of the out-of-focus parts of an image. Its aesthetic quality can greatly enhance a photograph, and photographers artfully use focus to draw attention to subjects of interest. “Good bokeh” contributes visual interest to a photograph and places its subjects in context.

In this vein of focusing on high-impact subjects while always maintaining a relationship to the data background, the Bokeh project attempts to address fundamental challenges of large dataset visualization:

  • How do we look at all the data?
    • What are the best perceptual approaches to honestly and accurately represent the data to domain experts and SMEs so they can apply their intuition to the data?
    • Are there automated approaches to accurately reduce large datasets so that outliers and anomalies are still visible, while we meaningfully represent baselines and backgrounds? How can we do this without “washing away” all the interesting bits during a naive downsampling?
    • If we treat the pixels and topology of pixels on a screen as a bottleneck in the I/O channel between hard drives and an analyst’s visual cortex, what are the best compression techniques at all levels of the data transformation pipeline?
  • How can scientists and data analysts be empowered to use visualization fluidly, not merely as an output facility or one stage of a pipeline, but as an entire mode of engagement with data and models?
    • Are language-based approaches for expressing mathematical modeling and data transformations the best way to compose novel interactive graphics?
    • What data-oriented interactions (besides mere linked brushing/selection) are useful for fluid, visually-enable analysis?

Not likely any time soon but posting data for scientific research in ways that enable interactive analysis by readers (and snapshotting their results) could take debates over data and analysis to a whole new level.

As opposed to debating dots on a graph not of your own making and where alternative analyses are not available.

November 20, 2013

Storm, Neo4j and Python:…

Filed under: Graphs,Neo4j,Python,Storm — Patrick Durusau @ 4:26 pm

Storm, Neo4j and Python: Real-Time Stream Computation on Graphs by Sonal Raj.

From the webpage:

This page serves a resource repository for my talk at Pycon India 2013 held at Bangalore, India on 30th August – 1st September, 2013. The talk introduces the basics of the Storm real-time distributed Computation Platform popularised by Twitter, and the Neo4J Graph Database and goes on to explain how they can be used in conjuction to perform real-time computations on Graph Data with the help of emerging python libraries – py2neo (for Neo4J) and petrel (for Storm)

Great slides, code skeletons, pointers to references and a live visualization!

See the video at: PyCon India 2013.

Demo gremlins mar the demonstration part but you can see:

A Storm Topology on AWS showing signup locations for people joining based on a sample Social Network data
http://www.enfoss.org/map-aws/storm-aws-visual.html

A quote from the slides that sticks with me:

Process Infinite Streams of data one-tuple-at-a-time.

😉

November 3, 2013

Google’s Python Lessons are Awesome

Filed under: NLTK,Python — Patrick Durusau @ 5:46 pm

Google’s Python Lessons are Awesome by Hartley Brody.

From the post:

Whether you’re just starting to learn Python, or you’ve been working with it for awhile, take note.

The lovably geeky Nick Parlante — a Google employee and CS lecturer at Stanford — has written some awesomely succinct tutorials that not only tell you how you can use Python, but also how you should use Python. This makes them a fantastic resource, regardless of whether you’re just starting, or you’ve been working with Python for awhile.

The course also features six YouTube videos of Nick giving a lesson in front of some new Google employees. These make it feel like he’s actually there teaching you every feature and trick, and I’d highly recommend watching all of them as you go through the lessons. Some of the videos are longish (~50m) so this is something you want to do when you’re sitting down and focused.

And to really get your feet wet, there are also downloadable samples puzzles and challenges that go along with the lessons, so you can actually practice coding along with the Googlers in his class. They’re all pretty basic — most took me less than 5m — but they’re a great chance to practice what you’ve learned. Plus you get the satisfaction that comes with solving puzzles and successfully moving through the class.

I am studying the NLTK to get ready for a text analysis project. At least to be able to read along. This looks like a great resource to know about.

I also like the idea of samples, puzzles and challenges.

Not that samples, puzzles and challenges would put topic maps over the top but it would make instruction/self-learning more enjoyable.

October 27, 2013

Lectures on scientific computing with Python

Filed under: Python,Skepticism — Patrick Durusau @ 5:35 pm

Lectures on scientific computing with Python by J.R. Johansson.

From the webpage:

A set of lectures on scientific computing with Python, using IPython notebooks.

Read only versions of the lectures:

To debunk pitches, proposals, articles, demos, etc., you will need to know, among other things, how scientific computing should be done.

Scientific computing is a very large field so take this as a starting point, not a destination.

October 26, 2013

Machine Learning And Analytics…

Filed under: Analytics,Machine Learning,Python — Patrick Durusau @ 4:10 pm

Machine Learning And Analytics Using Python For Beginners by Naveen Venkataraman.

From the post:

Analytics has been a major personal theme in 2013. I’ve recently taken an interest in machine learning after spending some time in analytics consulting. In this post, I’ll share a few tips for folks looking to get started with machine learning and data analytics.

Audience

The audience for this article is people who are looking to understand the basics of machine learning and those who are interested in developing analytics projects using python. A coding background is not required in order to read this article

Most resource postings list too many resources to consult.

Naveen lists a handful of resources and why you should use them.

October 3, 2013

Easy k-NN Document Classification with Solr and Python

Filed under: K-Nearest-Neighbors,Python,Solr — Patrick Durusau @ 7:02 pm

Easy k-NN Document Classification with Solr and Python by John Berryman.

From the post:

You’ve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…

Build a k-Nearest Neighbors Classifier!

The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? That’s right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… they’re all just right.

I don’t know how timely John’s post is for you but it is very timely for me. 😉

I was being asked yesterday about devising a rough cut over a body of texts.

Looking forward to putting this approach through its paces.

August 27, 2013

Astropy: A Community Python Package for Astronomy

Filed under: Astroinformatics,Python — Patrick Durusau @ 7:13 pm

Astropy: A Community Python Package for Astronomy by Bruce Berriman.

From the post:

The rapid adoption of Python by the astronomical community was starting to make it a victim of its own success, with fragmented development of Python packages across different groups. Thus began the Astropy project began in 2011, with an ambitious goal to coordinate Python development across various groups and simplify installation and usage for astronomers. These ambitious goals have been met and are summarized in the paper Astropy: A Community Python Package for Astronomy, prepared by the Astropy Collaboration. The Astropy webpage provides download and build instructions for the current release, version 0.2.4, and complete documentation. It is released under a “3-clause” BSD-type license – the package may be used for any purpose, as long as the copyright is acknowledged and warranty disclaimers are given.

Get the paper and the code. Both will repay your study well.

The only good Python story I know was from a programmer who lamented the ability of Python to scale.

He wrote a sample program in Python for a customer, anticipating they would return for the production version.

But the sample program handled their needs so well, they had no need for the production version.

I am sure Python was due some of the credit but the programmer is a James Clark level programmer so his skills contributed to the result as well.

August 17, 2013

Parallel Astronomical Data Processing with Python:…

Filed under: Astroinformatics,Parallel Programming,Python — Patrick Durusau @ 3:52 pm

Parallel Astronomical Data Processing with Python: Recipes for multicore machines by Bruce Berriman.

From the post:

Most astronomers (myself included) have a high performance compute engine on their desktops. Modern computers now contain multicore processors, whose development was prompted by the need to reduce heat dissipation and power consumption but which give users a powerful processing machine at their fingertips. Singh, Browne and Butler have recently posted a preprint on astro-ph, submitted to Astronomy and Computing, that offers recipes in Python for running data parallel processing on multicore machines. Such machines offer an alternative to grids, clouds and clusters for many tasks, and the authors give examples based on commonly used astronomy toolkits.

The paper restricts itself to the use of CPython’s native multiprocessing module, for two reasons: much astronomical software is written in it, and it places sufficiently strong restrictions on managing threads launched by the OS that it can make parallel jobs run slower than serial jobs (not so for other flavors of Python, though, such as PyPy and Jython). The authors also chose to study data parallel applications, which are common in astronomy, rather than task parallel applications. The heart of the paper is a comparison of three approaches to multiprocessing in Python, with sample code snippets for each:
(…)

Bruce’s quick overview will give you the motivation to read this paper.

Astronomical data is easier to process in parallel than some data.

Suggestions on how to transform other data to make it easier to process in parallel?

August 16, 2013

Finding Parties Named in U.S. Law…

Filed under: Law,Natural Language Processing,NLTK,Python — Patrick Durusau @ 4:59 pm

Finding Parties Named in U.S. Law using Python and NLTK by Gary Sieling.

From the post:

U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

I would rather know who paid for particular laws but that requires information external to the Code XML data set. 😉

A very good exercise to become familiar with both NLTK and the Code XML data set.

August 15, 2013

RE|Parse

Filed under: Parsing,Python — Patrick Durusau @ 6:45 pm

RE|PARSE

From the webpage:

Python library/tools for combining and using Regular Expressions in a maintainable way

This library also allows you to:

  • Maintain a database of Regular Expressions
  • Combine them together using Patterns
  • Search, Parse and Output data matched by combined Regex using Python functions.

If you know Regular Expressions already, this library basically just gives you a way to combine them together and hook them up to some callback functions in Python.

This looks like a very useful tool.

August 3, 2013

Unpivoting Data with Excel, Open Refine and Python

Filed under: Data,Excel,Google Refine,Python — Patrick Durusau @ 4:09 pm

Unpivoting Data with Excel, Open Refine and Python by Tariq Khokhar.

From the post:

“How can I unpivot or transpose my tabular data so that there’s only one record per row?”

I see this question a lot and I thought it was worth a quick Friday blog post.

Data often aren’t quite in the format that you want. We usually provide CSV / XLS access to our data in “pivoted” or “normalized” form so they look like this:

Manipulating data is at least as crucial a skill to authoring a topic map as being able to model data.

Here are some quick tips for your toolkit.

Examining Citations in Federal Law using Python

Filed under: Government,Law - Sources,Python,Topic Maps — Patrick Durusau @ 4:03 pm

Examining Citations in Federal Law using Python by Gary Sieling.

From the post:

Congress frequently passes laws which amend or repeal sections of prior laws; this produces a series of edits to law which programmers will recognize as bearing resemblance to source control history.

In concept this is simple, but in practice this is incredibly complex – for instance like source control, the system must handle renumbering. What we will see below is that while it is possible to get some data about links, it is difficult to resolve what those links point to.

Here is an example paragraph where, rather than amending a law, the citation serves as a justification for why several words are absent in one section:

(…)

There has been some discussion lately about good examples of using topic maps with particular data sets.

Curious how you would solve the problem posed here using a topic map?

For extra credit, how would you map from particular provisions in a bill to the person(s) most likely to benefit from them?

July 23, 2013

fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python

Filed under: Clustering,Python,R — Patrick Durusau @ 12:47 pm

fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python by Daniel Müllner.

Abstract:

The fastcluster package is a C++ library for hierarchical, agglomerative clustering. It provides a fast implementation of the most efficient, current algorithms when the input is a dissimilarity index. Moreover, it features memory-saving routines for hierarchical clustering of vector data. It improves both asymptotic time complexity (in most cases) and practical performance (in all cases) compared to the existing implementations in standard software: several R packages, MATLAB, Mathematica, Python with SciPy.

Builds upon the author’s prior work: Modern hierarchical, agglomerative clustering algorithms.

Both papers are worth your time or you can cut to the chase with the packages you will find here.

When you stop to think about it, merging (as in topic maps) is just clustering followed by processing of members of the cluster.

Which should open merging up to the use of any number of clustering algorithms, depending upon what subjects you want to talk about.

July 20, 2013

Python for Data Analysis: The Landscape of Tutorials

Filed under: Data Analysis,Python — Patrick Durusau @ 12:59 pm

Python for Data Analysis: The Landscape of Tutorials by Abhijit Dasgupta.

From the post:

Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities. Matplotlib provides sophisticated 2-D and basic 3-D graphics capabilities with Matlab-like syntax.

Python

Further recent development has resulted in a rather complete stack for data manipulation and analysis, that includes Sympy for symbolic mathematics, pandas for data structures and analysis, and IPython as an enhanced console and HTML notebook that also facilitates parallel computation.

An even richer data analysis ecosystem is quickly evolving in Python, led by Enthought and Continuum Analytics and several other independent and associated efforts. We have described this ecosystem here. [“ecosystem” and “here” are two distinct links.]

(…)

A very impressive listing of tutorials on Python packages for data analysis.

« Newer PostsOlder Posts »

Powered by WordPress