Clustering with phylogenetic tools in astrophysics [IMA]

Phylogenetic approaches are finding more and more applications outside the field of biology. Astrophysics is no exception since an overwhelming amount of multivariate data has appeared in the last twenty years or so. In particular, the diversification of galaxies throughout the evolution of the Universe quite naturally invokes phylogenetic approaches. We have demonstrated that Maximum Parsimony brings useful astrophysical results, and we now proceed toward the analyses of large datasets for galaxies. In this talk I present how we solve the major difficulties for this goal: the choice of the parameters, their discretization, and the analysis of a high number of objects with an unsupervised NP-hard classification technique like cladistics. 1. Introduction How do the galaxy form, and when? How did the galaxy evolve and transform themselves to create the diversity we observe? What are the progenitors to present-day galaxies? To answer these big questions, observations throughout the Universe and the physical modelisation are obvious tools. But between these, there is a key process, without which it would be impossible to extract some digestible information from the complexity of these systems. This is classification. One century ago, galaxies were discovered by Hubble. From images obtained in the visible range of wavelengths, he synthetised his observations through the usual process: classification. With only one parameter (the shape) that is qualitative and determined with the eye, he found four categories: ellipticals, spirals, barred spirals and irregulars. This is the famous Hubble classification. He later hypothetized relationships between these classes, building the Hubble Tuning Fork. The Hubble classification has been refined, notably by de Vaucouleurs, and is still used as the only global classification of galaxies. Even though the physical relationships proposed by Hubble are not retained any more, the Hubble Tuning Fork is nearly always used to represent the classification of the galaxy diversity under its new name the Hubble sequence (e.g. Delgado-Serrano, 2012). Its success is impressive and can be understood by its simplicity, even its beauty, and by the many correlations found between the morphology of galaxies and their other properties. And one must admit that there is no alternative up to now, even though both the Hubble classification and diagram have been recognised to be unsatisfactory. Among the most obvious flaws of this classification, one must mention its monovariate, qualitative, subjective and old-fashioned nature, as well as the difficulty to characterise the morphology of distant galaxies. The first two most significant multivariate studies were by Watanabe et al. (1985) and Whitmore (1984). Since the year 2005, the number of studies attempting to go beyond the Hubble classification has increased largely. Why, despite of this, the Hubble classification and its sequence are still alive and no alternative have yet emerged (Sandage, 2005)? My feeling is that the results of the multivariate analyses are not easily integrated into a one-century old practice of modeling the observations. In addition, extragalactic objects like galaxies, stellar clusters or stars do evolve. Astronomy now provides data on very distant objects, raising the question of the relationships between those and our present day nearby galaxies. Clearly, this is a phylogenetic problem. Astrocladistics 1 aims at exploring the use of phylogenetic tools in astrophysics (Fraix-Burnet et al., 2006a,b). We have proved that Maximum Parsimony (or cladistics) can be applied in astrophysics and provides a new exploration tool of the data (Fraix-Burnet et al., 2009, 2012, Cardone \& Fraix-Burnet, 2013). As far as the classification of galaxies is concerned, a larger number of objects must now be analysed. In this paper, I

Read this paper on arXiv…

D. Fraix-Burnet
Thu, 2 Jun 16

Comments: Proceedings of the 60th World Statistics Congress of the International Statistical Institute, ISI2015, Jul 2015, Rio de Janeiro, Brazil

A Selection of Giant Radio Sources from NVSS [GA]

Results of the application of pattern recognition techniques to the problem of identifying Giant Radio Sources (GRS) from the data in the NVSS catalog are presented and issues affecting the process are explored. Decision-tree pattern recognition software was applied to training set source pairs developed from known NVSS large angular size radio galaxies. The full training set consisted of 51,195 source pairs, 48 of which were known GRS for which each lobe was primarily represented by a single catalog component. The source pairs had a maximum separation of 20 arc minutes and a minimum component area of 1.87 square arc minutes at the 1.4 mJy level. The importance of comparing resulting probability distributions of the training and application sets for cases of unknown class ratio is demonstrated. The probability of correctly ranking a randomly selected (GRS, non-GRS) pair from the best of the tested classifiers was determined to be 97.8 +/- 1.5%. The best classifiers were applied to the over 870,000 candidate pairs from the entire catalog. Images of higher ranked sources were visually screened and a table of over sixteen hundred candidates, including morphological annotation, is presented. These systems include doubles and triples, Wide-Angle Tail (WAT) and Narrow-Angle Tail (NAT), S- or Z-shaped systems, and core-jets and resolved cores. While some resolved lobe systems are recovered with this technique, generally it is expected that such systems would require a different approach.

Read this paper on arXiv…

D. Proctor
Wed, 23 Mar 16

Comments: 20 pages of text, 6 figures, 22 pages tables, total 55 pages. The stub for Table 6 is followed by the complete machine readable file. To be published in The Astrophysical Journal Supplement

Simple, Fast and Accurate Photometric Estimation of Specific Star Formation Rate [IMA]

Large-scale surveys make huge amounts of photometric data available. Because of the sheer amount of objects, spectral data cannot be obtained for all of them. Therefore it is important to devise techniques for reliably estimating physical properties of objects from photometric information alone. These estimates are needed to automatically identify interesting objects worth a follow-up investigation as well as to produce the required data for a statistical analysis of the space covered by a survey. We argue that machine learning techniques are suitable to compute these estimates accurately and efficiently. This study considers the task of estimating the specific star formation rate (sSFR) of galaxies. It is shown that a nearest neighbours algorithm can produce better sSFR estimates than traditional SED fitting. We show that we can obtain accurate estimates of the sSFR even at high redshifts using only broad-band photometry based on the u, g, r, i and z filters from Sloan Digital Sky Survey (SDSS). We addtionally demonstrate that combining magnitudes estimated with different methods from the same photometry can lead to a further improvement in accuracy. The study highlights the general importance of performing proper model selection to improve the results of machine learning systems and how feature selection can provide insights into the predictive relevance of particular input features. Furthermore, the use of massively parallel computation on graphics processing units (GPUs) for handling large amounts of astronomical data is advocated.

Read this paper on arXiv…

K. Stensbo-Smidt, F. Gieseke, C. Igel, et. al.
Wed, 18 Nov 15

Comments: 10 pages, 12 figures, 1 table. Submitted to MNRAS

A review of learning vector quantization classifiers [CL]

In this work we present a review of the state of the art of Learning Vector Quantization (LVQ) classifiers. A taxonomy is proposed which integrates the most relevant LVQ approaches to date. The main concepts associated with modern LVQ approaches are defined. A comparison is made among eleven LVQ classifiers using one real-world and two artificial datasets.

Read this paper on arXiv…

D. Nova and P. Estevez
Thu, 24 Sep 15

Comments: 14 pages

Machine Learning Model of the Swift/BAT Trigger Algorithm for Long GRB Population Studies [HEAP]

To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is essential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the problem of modeling the Swift/BAT triggering algorithm for long GRBs, a computationally expensive procedure, and models it using machine learning algorithms. A large sample of simulated GRBs from Lien 2014 is used to train various models: random forests, boosted decision trees (with AdaBoost), support vector machines, and artificial neural networks. The best models have accuracies of $\gtrsim97\%$ ($\lesssim 3\%$ error), which is a significant improvement on a cut in GRB flux which has an accuracy of $89.6\%$ ($10.4\%$ error). These models are then used to measure the detection efficiency of Swift as a function of redshift $z$, which is used to perform Bayesian parameter estimation on the GRB rate distribution. We find a local GRB rate density of $n_0 \sim 0.48^{+0.41}_{-0.23} \ {\rm Gpc}^{-3} {\rm yr}^{-1}$ with power-law indices of $n_1 \sim 1.7^{+0.6}_{-0.5}$ and $n_2 \sim -5.9^{+5.7}_{-0.1}$ for GRBs above and below a break point of $z_1 \sim 6.8^{+2.8}_{-3.2}$. This methodology is able to improve upon earlier studies by more accurately modeling Swift detection and using this for fully Bayesian model fitting. The code used in this is analysis is publicly available online (

Read this paper on arXiv…

P. Graff, A. Lien, J. Baker, et. al.
Fri, 4 Sep 15

Comments: 16 pages, 18 figures, 5 tables, submitted to ApJ

Distinguishing short and long Fermi GRBs [HEAP]

Two classes of GRBs, short and long, have been determined without any doubts, and are usually ascribed to different progenitors, yet these classes overlap for a variety of descriptive parameters. A subsample of 46 long and 22 short $Fermi$ GRBs with estimated Hurst Exponents (HEs), complemented by minimum variability time-scales (MVTS) and durations ($T_{90}$) is used to perform a supervised Machine Learning (ML) and Monte Carlo (MC) simulation using a Support Vector Machine (SVM) algorithm. It is found that while $T_{90}$ itself performs very well in distinguishing short and long GRBs, the overall success ratio is higher when the training set is complemented by MVTS and HE. These results may allow to introduce a new (non-linear) parameter that might provide less ambiguous classification of GRBs.

Read this paper on arXiv…

M. Tarnopolski
Mon, 20 Jul 15

Comments: 8 pages, 6 figures; resubmitted to MNRAS after adressing referee’s comments

Celeste: Variational inference for a generative model of astronomical images [IMA]

We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our approach on synthetic images. We also run it on images from a major sky survey, where it exceeds the performance of the current state-of-the-art method for locating celestial bodies and measuring their colors.

Read this paper on arXiv…

J. Regier, A. Miller, J. McAuliffe, et. al.
Thu, 4 Jun 15

Comments: in the Proceedings of the 32nd International Conference on Machine Learning (2015)