Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs

Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs by Amir Hajian, Marcelo Alvarez, J. Richard Bond.

Abstract:

Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya’ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.

From the introduction:

In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist. A sample of the objects does exist (the very objects in the observed catalog) however, and the observed sample can be used to express the rules for the selection function. This “learning from examples” is the main idea behind classi cation algorithms in machine learning. The problem of selection functions can be re-stated in the statistical machine learning language as: given a set of samples, we would like to detect the soft boundary of that set so as to classify new points as belonging to that set or not. (emphasis added)

Does the sentence:

In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist.

sound like they could be describing people?

I mention this as a reason why you should be read broadly in machine learning in particular and IR in general.

What if all the known data about known terrorists, sans all the idle speculation by intelligence analysts, were gathered into a data set. Machine learning on that data set could then be tested against a simulation of potential terrorists, to help avoid the biases of intelligence analysts.

Lest the undeserved fixation on Muslims blind security services to other potential threats, such as governments bent on devouring their own populations.

I first saw this in a tweet by Stat.ML.

Comments are closed.