Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 2, 2011

GPUStats

Filed under: CUDA,Parallel Programming,Statistics — Patrick Durusau @ 6:25 pm

GPUStats

If you need to access a NVIDIA CUDA interface for statistical calculations, GPUStats may be of assistance.

From the webpage:

gpustats is a PyCUDA-based library implementing functionality similar to that present in scipy.stats. It implements a simple framework for specifying new CUDA kernels and extending existing ones. Here is a (partial) list of target functionality:

  • Probability density functions (pdfs). These are intended to speed up likelihood calculations in particular in Bayesian inference applications, such as in PyMC
  • Random variable generation using CURAND

pandas: a Foundational Python Library for Data Analysis

Filed under: Data Analysis,Python — Patrick Durusau @ 6:25 pm

pandas: a Foundational Python Library for Data Analysis and Statistics by Wes McKinney

From the abstract:

In this paper we will discuss pandas, a Python library of rich data structures and tools for working with structured data sets common to statistics, finance, social sciences, and many other fields. The library provides integrated, intuitive routines for performing common data manipulations and analysis on such data sets. It aims to be the foundational layer for the future of statistical computing in Python. It serves as a strong complement to the existing scientific Python stack while implementing and improving upon the kinds of data manipulation tools found in other statistical programming languages such as R. In addition to detailing its design and features of pandas, we will discuss future avenues of work and growth opportunities for statistics and data analysis applications in the Python language.

A quick listing of things pandas does well (from pandas.sourceforge.net)

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Another data analysis library for your topic maps toolkit.

Virtual Machines

Filed under: Virtual Machines — Patrick Durusau @ 6:25 pm

Virtual Machines

From the post:

One of the best resources about virtual machines (both high-level language VMs and system VMs) is Jim Smith’s and Ravi Nair’s book Virtual Machines: Versatile Platforms for Systems and Processes.

What functions would you optimize if you were writing a virtual machine?

Graph-Database.org

Filed under: Graph Partitioning,Graphs — Patrick Durusau @ 6:25 pm

Graph-Database.org

Interesting site with a number of presentations/resources on graphs. Worth a visit.

GraphDB in PHP

Filed under: GraphDB,PHP — Patrick Durusau @ 6:25 pm

GraphDB in PHP by Alessandro Nadalin and David Funaro.

Well, at 163 slides you know there is going to be some introductory graph material but it went by quickly enough. Would have liked to see the presentation that went with the slides but even without it, the slides are fairly interesting.

Good to see PHP libraries because PHP is so widely used as a scripting language. (No offense to others, just an observation. Use whatever is the most comfortable for you.)

Summify’s Technology Examined

Filed under: Summarization,Summify — Patrick Durusau @ 6:24 pm

Summify’s Technology Examined by Phil Whelan.

From the post:

Following on from examining Quora’s technology, I thought I would look at a tech company closer to home. Home being Vancouver, BC. While the tech scene is much smaller here than in the valley, it is here. In fact, Vancouver boasts the largest number of entrepreneurs per capita.

Summify.com is a website that strives to make our lives easier and helps us deal with the information overload we all experience every time we sit down at our computers. The founders of this start-up, Cristian Strat and Mircea Paşoi, seem to have all the right ingredients for success. This is their biggest venture so far, but not their first. They have previously built Infoarena.ro and Balaur.ro, which are both focused on their home country of Romania.

“We’re a team of two Romanian hackers and entrepreneurs, passionate about technology and Internet startups. We’ve interned at Google and Microsoft and we’ve kicked ass in programming contests like the International Olympiad in Informatics and TopCoder.”
– Summify Team. “Our Story”

In this post I will look at the technology infrastructure they have built for Summify.com, the details of which they were kind enough to share with me.

From this last Spring so this may be old news but I thought it was an interesting look “behind the scenes” at an “information overload solution” application.

Curious that the two challenges for Summify were seen as:

  • Crawling a large volume of feeds and web pages
  • Live streaming updates to the website

May just be me but I would think the semantics of the feeds would rank pretty high. Both in terms of recognition of items of interest in terminology familiar to the user as well as new terminology. For example, what if I say I wants feeds on P2P systems, an information overload reducing application would also give me distributed network entries.

That’s an easy example but you get the idea. And the system should do that across different interests of users and update its recognition of relevant items to include new terminology as it emerges.

BTW, you might want to check out the Summify FAQ on how they determine your interests.

Systems We Make

Filed under: Distributed Systems — Patrick Durusau @ 6:24 pm

Systems We Make Curating complex distributed systems of our times by Srihari Srinivasan.

About:

These are indeed great times for Distributed Systems enthusiasts. The boom in the number and variety of systems being built in both academia and the industry has created a strong need to curate interesting creations under one roof.

Systems We Make was conceived to fill this void. Although Systems We Make is still in its infancy I hope to shape it into something more than just a catalog. So stay tuned as we evolve this site and do write to me about how you feel!

Systems We Make may still be in its “infancy” but I am certainly going to both watch this site for news as well as mine the resources it already offers!

I don’t have any predictions for when it will happen but it isn’t hard to foresee a time when “distributed computing” is as archaic as “my computer.” Computing will be a service much like electricity or water, based on a computing fabric, the details of which matter only those charged with its maintenance.

Reading List for Distributed Systems

Filed under: Distributed Consistency,MapReduce — Patrick Durusau @ 6:24 pm

Reading List for Distributed Systems

From the post:

I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.

Please feel free to comment if you think there are more posts that needs to be added.

Reading list that ranges from Paxos to MapReduce and places in between. Looks like a very good list.

Processing & Twitter

Filed under: Indexing,Processing — Patrick Durusau @ 6:24 pm

Processing & Twitter

From the post:

** Since I first released this tutorial in 2009, it has received thousands of views and has hopefully helped some of you get started with building projects incorporating Twitter with Processing. In late 2010, Twitter changed the way that authorization works, so I’ve updated the tutorial to get it inline with the new Twitter API functionality.

Accessing information from the Twitter API with Processing is (reasonably) easy. A few people have sent me e-mails asking how it all works, so I thought I’d write a very quick tutorial to get everyone up on their feet.

We don’t need to know too much about how the Twitter API functions, because someone has put together a very useful Java library to do all of the dirty work for us. It’s called twitter4j, and you can download it here. We’ll be using this in the first step of the building section of this tutorial.

Visualizing Twitter messages with Processing (graphics language) is good practice for any type of streaming data.

I really don’t understand the attraction of word clouds but know that many people like them. What I think would be cool would be what looks like a traditional index for browsing where the words darken based on their frequency in the stream of material and perhaps even have see-also entries. Imagine that with a feed from CiteSeer or the ACM.

MarkLogic 5 is Big Data for the Enterprise

Filed under: BigData,Hadoop,MarkLogic — Patrick Durusau @ 6:24 pm

MarkLogic 5 is Big Data for the Enterprise

From the announcement:

SAN CARLOS, Calif. — November 1, 2011 — MarkLogic® Corporation, the company empowering organizations to make high stakes decisions on Big Data in real time, today announced MarkLogic 5, the latest version of its award-winning product designed for Big Data applications across the enterprise. MarkLogic 5 defines Big Data by empowering organizations to build Big Data applications that make information actionable. With MarkLogic 5, organizations get smarter answers faster by analyzing structured, unstructured, and semi-structured data in the same application. This allows a complete view of the health of the enterprise. Key features include the MarkLogic Connector for Hadoop, which marries large-scale batch processing with the real time Big Data applications MarkLogic has been delivering for a decade. MarkLogic 5 is a visionary step forward for organizations who want to manage complex Big Data on an operational database with confidence at scale. MarkLogic 5 is available today.

“Most of the hype around Big Data has focused only on the big or on the analytics,” said Ken Bado, president and CEO, MarkLogic. “For nearly a decade, MarkLogic has been helping its customers build cost effective Big Data applications that create competitive advantage. That means going beyond big and analytics to make information actionable so organizations can create real value for their business. With MarkLogic, multi-billion dollar companies like JP Morgan Chase and LexisNexis have redefined their business models, while organizations like the U.S. Army and the FAA have the real time, mission-critical information they need to get the job done. These aren’t science projects – they’re real organizations using Big Data applications right now.”

“We believe that MarkLogic 5 is well positioned to help solve many of the Big Data challenges that are emerging in the healthcare industry today,” said Jeff Cunningham, CTO at Informatics Corporation of America. “By incorporating MarkLogic 5 into our CareAlign™ Health Information Exchange platform, we have the ability to securely aggregate, manage, share, and analyze large amounts of patient information derived from a wide variety of sources and formats. These capabilities will help doctors, hospitals, and healthcare systems across the country solve many of the care coordination and population health management challenges that exist in healthcare today.”

There is a lot of noise concerning this release and it will take some time to obtain a favorable signal/noise ratio.

You can help contribute to the signal side of that equation:

Available with MarkLogic 5, the new Express license is free for developers looking to check out MarkLogic. It is limited to use on one computer with at most 2 CPUs and can hold up to 40GB of content. It includes options that make sense on a single computer (geospatial, alerting, conversion) and does not include options intended for clusters or enterprise usage (e.g., replication).

November 1, 2011

Parallel approaches in next-generation sequencing analysis pipelines

Filed under: Bioinformatics,Parallel Programming,Parallelism — Patrick Durusau @ 3:34 pm

Parallel approaches in next-generation sequencing analysis pipelines

From the post:

My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.

Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.

The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.

The message passing based parallelism sounds a lot like Storm doesn’t it? Will message passing be what frees us from the constraints of architecture? Wondering what sort of performance “hit” we will take when not working really close to the metal? But, then the “metal” may become the basis for such message passing systems. Not quite yet but perhaps not so far away either.

Graph and Network Analysis – DERI 2011

Filed under: Graphs,Networks — Patrick Durusau @ 3:34 pm

Graph and Network Analysis – DERI 2011 – Dr. Derek Greene

From the website:

Summer School: July 6, 2011 – July 13, 2011. DERI, NUI Galway

Supporting material for the tutorial “Graph and Network Analysis” by Dr. Derek Greene from the Clique Research Cluster, providing an introduction to social network analysis, with examples using the Python NetworkX library.

Related Resources:

  • NetworkX: Library for network analysis (recommended v1.5) for Python (recommended v2.6.x / 2.7.x)
  • Gephi: Java interactive visualisation platform and toolkit – “Photoshop for graphs”.
  • Graclus: Graph partitioning tool
  • Louvain: Disjoint community finding software
  • CFinder: Overlapping community finding software
  • Moses: Overlapping community finding software
  • GCE: Overlapping community finding software
  • Dynamic community finding software

In case you weren’t able to make it to the Summer School, the next best thing!

Dictionary of Algorithms and Data Structures

Filed under: Algorithms — Patrick Durusau @ 3:33 pm

Dictionary of Algorithms and Data Structures

From the webpage:

This web site is hosted in part by the Software and Systems Division, Information Technology Laboratory.

This is a dictionary of algorithms, algorithmic techniques, data structures, archetypal problems, and related definitions. Algorithms include common functions, such as Ackermann’s function. Problems include traveling salesman and Byzantine generals. Some entries have links to implementations and more information. Index pages list entries by area and by type. The two-level index has a total download 1/20 as big as this page.

Don’t use this site to cheat. Teachers, contact us if we can help.

To define or correct terms, please contact Paul E. Black. We do not include algorithms particular to business data processing, communications, operating systems or distributed algorithms, programming languages, AI, graphics, or numerical analysis: it is
tough enough covering “general” algorithms and data structures.

I thought I had listed this site but apparently never did. Although only general algorithms, it is a good resource to have on hand.

Facebook100 data and a parser for it

Filed under: Data,Dataset — Patrick Durusau @ 3:33 pm

Facebook100 data and a parser for it

From the post:

A few weeks ago, Mason Porter posted a goldmine of data, the Facebook100 dataset. The dataset contains all of the Facebook friendships at 100 US universities at some time in 2005, as well as a number of node attributes such as dorm, gender, graduation year, and academic major. The data was apparently provided directly by Facebook.

As far as I know, the dataset is unprecedented and has the potential advance both network methods and insights into the structure of acquaintanceship. Unfortunately, the Facebook Data Team requested that Porter no longer distribute the dataset. It does not include the names of individual or even of any of the node attributes (they have been given integer ids), but Facebook seems to be concerned. Anonymized network data is after all vulnerable to de-anonymization (for some nice examples of why, see the last 20 minutes of this video lecture from Jon Kleinberg).

It’s a shame that Porter can no longer distribute the data. On the other hand, once a dataset like that has been released, will the internet be able to forget it? After a bit of poking around I found the dataset as a torrent file. In fact, if anyone is seeding the torrent, you can download it by following this link and it appears to be on rapidshare.

Can anyone confirm a location for the Facebook100 data? I get “file removed” from the brave folks at rapidshare and ads to register for various download services (before knowing the file is available) from the torrent site. Thanks!

Lab 49 Blog

Filed under: Artificial Intelligence,Finance Services,Machine Learning — Patrick Durusau @ 3:33 pm

Lab 49 Blog

From the main site:

Lab49 is a technology consulting firm that builds advanced solutions for the financial services industry. Our clients include many of the world’s largest investment banks, hedge funds and exchanges. Lab49 designs and delivers some of the most sophisticated and forward-thinking financial applications in the industry today, and has an impeccable delivery record on mission critical systems.

Lab49 helps clients effect positive change in their markets through technological innovation and a rich fabric of industry best practices and first-hand experience. From next-generation trading platforms to innovative risk aggregation and reporting systems to entirely new investment ventures, we enable our clients to realize new business opportunities and gain competitive advantage.

Lab49 cultivates a collaborative culture that is both innovative and delivery-focused. We value intelligent, experienced, and personable engineering professionals that work with clients as partners. With a proven ability to attract and retain industry-leading engineering talent and to forge and leverage valued partnerships, Lab49 continues to innovate at the vanguard of software and technology.

A very interesting blog sponsored by what appears to be a very interesting company, Lab 49.

Kernel Perceptron in Python

Filed under: Kernel Methods,Perceptron,Python — Patrick Durusau @ 3:33 pm

Kernel Perceptron in Python

From the post:

The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It’s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods.

The original paper by F. Rosenblatt, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, Vol. 65, No. 6, 1958.

Good way to learn more about kernel methods.

I have included a link to the original paper by Rosenblatt.

  1. What do you make of Rosenblatt’s choice to not use symbolic or Boolean logic?
  2. What do you make of the continued efforts (think Cyc/SUMA) to use symbolic or Boolean logic?
  3. Is knowledge/information probabilistic?

There are no certain answers to these questions, I am interested in how you approach discussing them.

aliquote

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 3:33 pm

aliquote

One of the odder blogs I have encountered, particularly the “bag of tweets” postings.

What appear to be fairly high grade posting on data and bioinformatics topics. It is one that I will be watching and thought I would pass it along.

Natural Language Processing from Scratch

Filed under: Natural Language Processing,Neural Networks — Patrick Durusau @ 3:32 pm

Natural Language Processing from Scratch

From the post:

Ronan's masterpiece, "Natural Language Processing (Almost) from Scratch", has been published in JMLR. This paper describes how to use a unified neural network architecture to solve a collection of natural language processing tasks with near state-of-the-art accuracies and ridiculously fast processing speed. A couple thousand lines of C code processes english sentence at more than 10000 words per second and outputs part-of-speech tags, named entity tags, chunk boundaries, semantic role labeling tags, and, in the latest version, syntactic parse trees. Download SENNA!

This looks very cool! Check out the paper along with the software!

A Convenient Framework for Efficient Parallel Multipass Algorithms

Filed under: MapReduce,Parallel Programming — Patrick Durusau @ 3:32 pm

A Convenient Framework for Efficient Parallel Multipass Algorithms by Markus Weimer, Sriram Rao, and Martin Zinkevich.

Abstract:

The amount of data available is ever-increasing. At the same time, the available time to learn from the available data is decreasing in many applications, especially on the web. These two trends together with limited improvements in per-cpu speed and hard disk bandwidth lead to the need for parallel machine learning algorithms. Numerous have been proposed in the past (including [1, 3, 4]). Many of them make use of frameworks like MapReduce [2], as it facilitates easy parallelization and provides fault tolerance and data local computation at the framework level. However, MapReduce also introduces some inherent inefficiencies when compared to message passing systems like MPI.

In this paper, we present a computational framework based on Workers and Aggregators for dataparallel computations that retains the simplicity of MapReduce, while offering a significant speedup for a large class of algorithms. We report experiments based on several implementations of Stochastic Gradient Descent (SGD): The well known sequential variant as well as a parallel version inspired by our recent work in [5] which we implemented both in MapReduce and the proposed framework.

The direct passing of messages reminds me of Storm.

Comments?

Structure and Interpretation of Computer Programs (Python)

Filed under: Python,Scheme — Patrick Durusau @ 3:32 pm

Structure and Interpretation of Computer Programs (Python)

The classic Scheme based text by Abelson and Sussman taught with Python code. (lecture notes too)

« Newer Posts

Powered by WordPress