# Phylogenetic Tools in Astrophysics [IMA]

Multivariate clustering in astrophysics is a recent development justified by the bigger and bigger surveys of the sky. The phylogenetic approach is probably the most unexpected technique that has appeared for the unsupervised classification of galaxies, stellar populations or globular clusters. On one side, this is a somewhat natural way of classifying astrophysical entities which are all evolving objects. On the other side, several conceptual and practical difficulties arize, such as the hierarchical representation of the astrophysical diversity, the continuous nature of the parameters, and the adequation of the result to the usual practice for the physical interpretation. Most of these have now been solved through the studies of limited samples of stellar clusters and galaxies. Up to now, only the Maximum Parsimony (cladistics) has been used since it is the simplest and most general phylogenetic technique. Probabilistic and network approaches are obvious extensions that should be explored in the future.

D. Fraix-Burnet
Thu, 2 Mar 17
38/44

|

# Generative Adversarial Networks recover features in astrophysical images of galaxies beyond the deconvolution limit [IMA]

Observations of astrophysical objects such as galaxies are limited by various sources of random and systematic noise from the sky background, the optical system of the telescope and the detector used to record the data. Conventional deconvolution techniques are limited in their ability to recover features in imaging data by the Shannon-Nyquist sampling theorem. Here we train a generative adversarial network (GAN) on a sample of $4,550$ images of nearby galaxies at $0.01<z<0.02$ from the Sloan Digital Sky Survey and conduct $10\times$ cross validation to evaluate the results. We present a method using a GAN trained on galaxy images that can recover features from artificially degraded images with worse seeing and higher noise than the original with a performance which far exceeds simple deconvolution. The ability to better recover detailed features such as galaxy morphology from low-signal-to-noise and low angular resolution imaging data significantly increases our ability to study existing data sets of astrophysical objects as well as future observations with observatories such as the Large Synoptic Sky Telescope (LSST) and the Hubble and James Webb space telescopes.

K. Schawinski, C. Zhang, H. Zhang, et. al.
Fri, 3 Feb 17
34/55

Comments: Accepted for publication in MNRAS, for the full code and a virtual machine set up to run it, see this http URL

|

# Correlated signal inference by free energy exploration [CL]

The inference of correlated signal fields with unknown correlation structures is of high scientific and technological relevance, but poses significant conceptual and numerical challenges. To address these, we develop the correlated signal inference (CSI) algorithm within information field theory (IFT) and discuss its numerical implementation. To this end, we introduce the free energy exploration (FrEE) strategy for numerical information field theory (NIFTy) applications. The FrEE strategy is to let the mathematical structure of the inference problem determine the dynamics of the numerical solver. FrEE uses the Gibbs free energy formalism for all involved unknown fields and correlation structures without marginalization of nuisance quantities. It thereby avoids the complexity marginalization often impose to IFT equations. FrEE simultaneously solves for the mean and the uncertainties of signal, nuisance, and auxiliary fields, while exploiting any analytically calculable quantity. Finally, FrEE uses a problem specific and self-tuning exploration strategy to swiftly identify the optimal field estimates as well as their uncertainty maps. For all estimated fields, properly weighted posterior samples drawn from their exact, fully non-Gaussian distributions can be generated. Here, we develop the FrEE strategies for the CSI of a normal, a log-normal, and a Poisson log-normal IFT signal inference problem and demonstrate their performances via their NIFTy implementations.

T. Ensslin and J. Knollmuller
Wed, 28 Dec 16
31/46

Comments: 19 pages, 5 figures, submitted

# Astronomical image reconstruction with convolutional neural networks [CL]

State of the art methods in astronomical image reconstruction rely on the resolution of a regularized or constrained optimization problem. Solving this problem can be computationally intensive and usually leads to a quadratic or at least superlinear complexity w.r.t. the number of pixels in the image. We investigate in this work the use of convolutional neural networks for image reconstruction in astronomy. With neural networks, the computationally intensive tasks is the training step, but the prediction step has a fixed complexity per pixel, i.e. a linear complexity. Numerical experiments show that our approach is both computationally efficient and competitive with other state of the art methods in addition to being interpretable.

R. Flamary
Thu, 15 Dec 16
33/59

# Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference [CL]

Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems.
Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia’s high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer.

J. Regier, K. Pamnany, R. Giordano, et. al.
Fri, 11 Nov 16
11/40

# Enabling Dark Energy Science with Deep Generative Models of Galaxy Images [IMA]

Understanding the nature of dark energy, the mysterious force driving the accelerated expansion of the Universe, is a major challenge of modern cosmology. The next generation of cosmological surveys, specifically designed to address this issue, rely on accurate measurements of the apparent shapes of distant galaxies. However, shape measurement methods suffer from various unavoidable biases and therefore will rely on a precise calibration to meet the accuracy requirements of the science analysis. This calibration process remains an open challenge as it requires large sets of high quality galaxy images. To this end, we study the application of deep conditional generative models in generating realistic galaxy images. In particular we consider variations on conditional variational autoencoder and introduce a new adversarial objective for training of conditional generative networks. Our results suggest a reliable alternative to the acquisition of expensive high quality observations for generating the calibration data needed by the next generation of cosmological surveys.

S. Ravanbakhsh, F. Lanusse, R. Mandelbaum, et. al.
Tue, 20 Sep 16
10/74

|

# Mapping the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data [IMA]

We apply a novel spectral graph technique, that of locally-biased semi-supervised eigenvectors, to study the diversity of galaxies. This technique permits us to characterize empirically the natural variations in observed spectra data, and we illustrate how this approach can be used in an exploratory manner to highlight both large-scale global as well as small-scale local structure in Sloan Digital Sky Survey (SDSS) data. We use this method in a way that simultaneously takes into account the measurements of spectral lines as well as the continuum shape. Unlike Principal Component Analysis, this method does not assume that the Euclidean distance between galaxy spectra is a good global measure of similarity between all spectra, but instead it only assumes that local difference information between similar spectra is reliable. Moreover, unlike other nonlinear dimensionality methods, this method can be used to characterize very finely both small-scale local as well as large-scale global properties of realistic noisy data. The power of the method is demonstrated on the SDSS Main Galaxy Sample by illustrating that the derived embeddings of spectra carry an unprecedented amount of information. By using a straightforward global or unsupervised variant, we observe that the main features correlate strongly with star formation rate and that they clearly separate active galactic nuclei. Computed parameters of the method can be used to describe line strengths and their interdependencies. By using a locally-biased or semi-supervised variant, we are able to focus on typical variations around specific objects of astronomical interest. We present several examples illustrating that this approach can enable new discoveries in the data as well as a detailed understanding of very fine local structure that would otherwise be overwhelmed by large-scale noise and global trends in the data.

D. Lawlor, T. Budavari and M. Mahoney
Wed, 14 Sep 16
19/75

Comments: 34 pages. A modified version of this paper has been accepted to The Astrophysical Journal

|

# Clustering with phylogenetic tools in astrophysics [IMA]

Phylogenetic approaches are finding more and more applications outside the field of biology. Astrophysics is no exception since an overwhelming amount of multivariate data has appeared in the last twenty years or so. In particular, the diversification of galaxies throughout the evolution of the Universe quite naturally invokes phylogenetic approaches. We have demonstrated that Maximum Parsimony brings useful astrophysical results, and we now proceed toward the analyses of large datasets for galaxies. In this talk I present how we solve the major difficulties for this goal: the choice of the parameters, their discretization, and the analysis of a high number of objects with an unsupervised NP-hard classification technique like cladistics. 1. Introduction How do the galaxy form, and when? How did the galaxy evolve and transform themselves to create the diversity we observe? What are the progenitors to present-day galaxies? To answer these big questions, observations throughout the Universe and the physical modelisation are obvious tools. But between these, there is a key process, without which it would be impossible to extract some digestible information from the complexity of these systems. This is classification. One century ago, galaxies were discovered by Hubble. From images obtained in the visible range of wavelengths, he synthetised his observations through the usual process: classification. With only one parameter (the shape) that is qualitative and determined with the eye, he found four categories: ellipticals, spirals, barred spirals and irregulars. This is the famous Hubble classification. He later hypothetized relationships between these classes, building the Hubble Tuning Fork. The Hubble classification has been refined, notably by de Vaucouleurs, and is still used as the only global classification of galaxies. Even though the physical relationships proposed by Hubble are not retained any more, the Hubble Tuning Fork is nearly always used to represent the classification of the galaxy diversity under its new name the Hubble sequence (e.g. Delgado-Serrano, 2012). Its success is impressive and can be understood by its simplicity, even its beauty, and by the many correlations found between the morphology of galaxies and their other properties. And one must admit that there is no alternative up to now, even though both the Hubble classification and diagram have been recognised to be unsatisfactory. Among the most obvious flaws of this classification, one must mention its monovariate, qualitative, subjective and old-fashioned nature, as well as the difficulty to characterise the morphology of distant galaxies. The first two most significant multivariate studies were by Watanabe et al. (1985) and Whitmore (1984). Since the year 2005, the number of studies attempting to go beyond the Hubble classification has increased largely. Why, despite of this, the Hubble classification and its sequence are still alive and no alternative have yet emerged (Sandage, 2005)? My feeling is that the results of the multivariate analyses are not easily integrated into a one-century old practice of modeling the observations. In addition, extragalactic objects like galaxies, stellar clusters or stars do evolve. Astronomy now provides data on very distant objects, raising the question of the relationships between those and our present day nearby galaxies. Clearly, this is a phylogenetic problem. Astrocladistics 1 aims at exploring the use of phylogenetic tools in astrophysics (Fraix-Burnet et al., 2006a,b). We have proved that Maximum Parsimony (or cladistics) can be applied in astrophysics and provides a new exploration tool of the data (Fraix-Burnet et al., 2009, 2012, Cardone \& Fraix-Burnet, 2013). As far as the classification of galaxies is concerned, a larger number of objects must now be analysed. In this paper, I

D. Fraix-Burnet
Thu, 2 Jun 16
56/60

Comments: Proceedings of the 60th World Statistics Congress of the International Statistical Institute, ISI2015, Jul 2015, Rio de Janeiro, Brazil

|

# A Selection of Giant Radio Sources from NVSS [GA]

Results of the application of pattern recognition techniques to the problem of identifying Giant Radio Sources (GRS) from the data in the NVSS catalog are presented and issues affecting the process are explored. Decision-tree pattern recognition software was applied to training set source pairs developed from known NVSS large angular size radio galaxies. The full training set consisted of 51,195 source pairs, 48 of which were known GRS for which each lobe was primarily represented by a single catalog component. The source pairs had a maximum separation of 20 arc minutes and a minimum component area of 1.87 square arc minutes at the 1.4 mJy level. The importance of comparing resulting probability distributions of the training and application sets for cases of unknown class ratio is demonstrated. The probability of correctly ranking a randomly selected (GRS, non-GRS) pair from the best of the tested classifiers was determined to be 97.8 +/- 1.5%. The best classifiers were applied to the over 870,000 candidate pairs from the entire catalog. Images of higher ranked sources were visually screened and a table of over sixteen hundred candidates, including morphological annotation, is presented. These systems include doubles and triples, Wide-Angle Tail (WAT) and Narrow-Angle Tail (NAT), S- or Z-shaped systems, and core-jets and resolved cores. While some resolved lobe systems are recovered with this technique, generally it is expected that such systems would require a different approach.

D. Proctor
Wed, 23 Mar 16
34/73

Comments: 20 pages of text, 6 figures, 22 pages tables, total 55 pages. The stub for Table 6 is followed by the complete machine readable file. To be published in The Astrophysical Journal Supplement

# Simple, Fast and Accurate Photometric Estimation of Specific Star Formation Rate [IMA]

Large-scale surveys make huge amounts of photometric data available. Because of the sheer amount of objects, spectral data cannot be obtained for all of them. Therefore it is important to devise techniques for reliably estimating physical properties of objects from photometric information alone. These estimates are needed to automatically identify interesting objects worth a follow-up investigation as well as to produce the required data for a statistical analysis of the space covered by a survey. We argue that machine learning techniques are suitable to compute these estimates accurately and efficiently. This study considers the task of estimating the specific star formation rate (sSFR) of galaxies. It is shown that a nearest neighbours algorithm can produce better sSFR estimates than traditional SED fitting. We show that we can obtain accurate estimates of the sSFR even at high redshifts using only broad-band photometry based on the u, g, r, i and z filters from Sloan Digital Sky Survey (SDSS). We addtionally demonstrate that combining magnitudes estimated with different methods from the same photometry can lead to a further improvement in accuracy. The study highlights the general importance of performing proper model selection to improve the results of machine learning systems and how feature selection can provide insights into the predictive relevance of particular input features. Furthermore, the use of massively parallel computation on graphics processing units (GPUs) for handling large amounts of astronomical data is advocated.

K. Stensbo-Smidt, F. Gieseke, C. Igel, et. al.
Wed, 18 Nov 15
6/61

Comments: 10 pages, 12 figures, 1 table. Submitted to MNRAS

|

# A review of learning vector quantization classifiers [CL]

In this work we present a review of the state of the art of Learning Vector Quantization (LVQ) classifiers. A taxonomy is proposed which integrates the most relevant LVQ approaches to date. The main concepts associated with modern LVQ approaches are defined. A comparison is made among eleven LVQ classifiers using one real-world and two artificial datasets.

D. Nova and P. Estevez
Thu, 24 Sep 15
53/60

# Machine Learning Model of the Swift/BAT Trigger Algorithm for Long GRB Population Studies [HEAP]

To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is essential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the problem of modeling the Swift/BAT triggering algorithm for long GRBs, a computationally expensive procedure, and models it using machine learning algorithms. A large sample of simulated GRBs from Lien 2014 is used to train various models: random forests, boosted decision trees (with AdaBoost), support vector machines, and artificial neural networks. The best models have accuracies of $\gtrsim97\%$ ($\lesssim 3\%$ error), which is a significant improvement on a cut in GRB flux which has an accuracy of $89.6\%$ ($10.4\%$ error). These models are then used to measure the detection efficiency of Swift as a function of redshift $z$, which is used to perform Bayesian parameter estimation on the GRB rate distribution. We find a local GRB rate density of $n_0 \sim 0.48^{+0.41}_{-0.23} \ {\rm Gpc}^{-3} {\rm yr}^{-1}$ with power-law indices of $n_1 \sim 1.7^{+0.6}_{-0.5}$ and $n_2 \sim -5.9^{+5.7}_{-0.1}$ for GRBs above and below a break point of $z_1 \sim 6.8^{+2.8}_{-3.2}$. This methodology is able to improve upon earlier studies by more accurately modeling Swift detection and using this for fully Bayesian model fitting. The code used in this is analysis is publicly available online (https://github.com/PBGraff/SwiftGRB_PEanalysis).

P. Graff, A. Lien, J. Baker, et. al.
Fri, 4 Sep 15
52/58

Comments: 16 pages, 18 figures, 5 tables, submitted to ApJ

# Distinguishing short and long Fermi GRBs [HEAP]

Two classes of GRBs, short and long, have been determined without any doubts, and are usually ascribed to different progenitors, yet these classes overlap for a variety of descriptive parameters. A subsample of 46 long and 22 short $Fermi$ GRBs with estimated Hurst Exponents (HEs), complemented by minimum variability time-scales (MVTS) and durations ($T_{90}$) is used to perform a supervised Machine Learning (ML) and Monte Carlo (MC) simulation using a Support Vector Machine (SVM) algorithm. It is found that while $T_{90}$ itself performs very well in distinguishing short and long GRBs, the overall success ratio is higher when the training set is complemented by MVTS and HE. These results may allow to introduce a new (non-linear) parameter that might provide less ambiguous classification of GRBs.

M. Tarnopolski
Mon, 20 Jul 15
5/52

# Celeste: Variational inference for a generative model of astronomical images [IMA]

We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our approach on synthetic images. We also run it on images from a major sky survey, where it exceeds the performance of the current state-of-the-art method for locating celestial bodies and measuring their colors.

J. Regier, A. Miller, J. McAuliffe, et. al.
Thu, 4 Jun 15
34/60

Comments: in the Proceedings of the 32nd International Conference on Machine Learning (2015)

|

# Removing systematic errors for exoplanet search via latent causes [CL]

We describe a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. The method, referred to as half-sibling regression, is inspired by recent work in causal inference using additive noise models. We provide a theoretical justification and illustrate the potential of the method in a challenging astronomy application.

B. Scholkopf, D. Hogg, D. Wang, et. al.
Wed, 13 May 15
6/69

Comments: Extended version of a paper appearing in the Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015

# Rotation-invariant convolutional neural networks for galaxy morphology prediction [IMA]

Measuring the morphological parameters of galaxies is a key requirement for studying their formation and evolution. Surveys such as the Sloan Digital Sky Survey (SDSS) have resulted in the availability of very large collections of images, which have permitted population-wide analyses of galaxy morphology. Morphological analysis has traditionally been carried out mostly via visual inspection by trained experts, which is time-consuming and does not scale to large ($\gtrsim10^4$) numbers of images.
Although attempts have been made to build automated classification systems, these have not been able to achieve the desired level of accuracy. The Galaxy Zoo project successfully applied a crowdsourcing strategy, inviting online users to classify images by answering a series of questions. Unfortunately, even this approach does not scale well enough to keep up with the increasing availability of galaxy images.
We present a deep neural network model for galaxy morphology classification which exploits translational and rotational symmetry. It was developed in the context of the Galaxy Challenge, an international competition to build the best model for morphology classification based on annotated images from the Galaxy Zoo project.
For images with high agreement among the Galaxy Zoo participants, our model is able to reproduce their consensus with near-perfect accuracy ($> 99\%$) for most questions. Confident model predictions are highly accurate, which makes the model suitable for filtering large collections of images and forwarding challenging images to experts for manual annotation. This approach greatly reduces the experts’ workload without affecting accuracy. The application of these algorithms to larger sets of training data will be critical for analysing results from future surveys such as the LSST.

S. Dieleman, K. Willett and J. Dambre
Wed, 25 Mar 15
30/38

Comments: Accepted for publication in MNRAS. 20 pages, 14 figures

|

# Bayesian Evidence and Model Selection [CL]

In this paper we review the concept of the Bayesian evidence and its application to model selection. The theory is presented along with a discussion of analytic, approximate and numerical techniques. Application to several practical examples within the context of signal processing are discussed.

K. Knuth, M. Habeck, N. Malakar, et. al.
Thu, 13 Nov 14
38/49

Comments: 39 pages, 8 figures. Submitted to DSP. Features theory, numerical methods and four applications

# Estimating the distribution of Galaxy Morphologies on a continuous space [GA]

The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution estimation and manifold estimation.

G. Vinci, P. Freeman, J. Newman, et. al.
Tue, 1 Jul 14
57/70

Comments: 4 pages, 3 figures, Statistical Challenges in 21st Century Cosmology, Proceedings IAU Symposium No. 306, 2014

# Hellinger Distance Trees for Imbalanced Streams [CL]

Classifiers trained on data sets possessing an imbalanced class distribution are known to exhibit poor generalisation performance. This is known as the imbalanced learning problem. The problem becomes particularly acute when we consider incremental classifiers operating on imbalanced data streams, especially when the learning objective is rare class identification. As accuracy may provide a misleading impression of performance on imbalanced data, existing stream classifiers based on accuracy can suffer poor minority class performance on imbalanced streams, with the result being low minority class recall rates. In this paper we address this deficiency by proposing the use of the Hellinger distance measure, as a very fast decision tree split criterion. We demonstrate that by using Hellinger a statistically significant improvement in recall rates on imbalanced data streams can be achieved, with an acceptable increase in the false positive rate.

R. Lyon, J. Brooke, J. Knowles, et. al.
Mon, 12 May 14
22/40

Comments: 6 Pages, 2 figures, to be published in Proceedings 22nd International Conference on Pattern Recognition (ICPR) 2014

# Bayesian Source Separation Applied to Identifying Complex Organic Molecules in Space [IMA]

Emission from a class of benzene-based molecules known as Polycyclic Aromatic Hydrocarbons (PAHs) dominates the infrared spectrum of star-forming regions. The observed emission appears to arise from the combined emission of numerous PAH species, each with its unique spectrum. Linear superposition of the PAH spectra identifies this problem as a source separation problem. It is, however, of a formidable class of source separation problems given that different PAH sources potentially number in the hundreds, even thousands, and there is only one measured spectral signal for a given astrophysical site. Fortunately, the source spectra of the PAHs are known, but the signal is also contaminated by other spectral sources. We describe our ongoing work in developing Bayesian source separation techniques relying on nested sampling in conjunction with an ON/OFF mechanism enabling simultaneous estimation of the probability that a particular PAH species is present and its contribution to the spectrum.

K. Knuth, M. Tse, J. Choinsky, et. al.
Thu, 20 Mar 14
5/51

|

# SOMz: photometric redshift PDFs with self organizing maps and random atlas [IMA]

In this paper we explore the applicability of the unsupervised machine learning technique of Self Organizing Maps (SOM) to estimate galaxy photometric redshift probability density functions (PDFs). This technique takes a spectroscopic training set, and maps the photometric attributes, but not the redshifts, to a two dimensional surface by using a process of competitive learning where neurons compete to more closely resemble the training data multidimensional space. The key feature of a SOM is that it retains the topology of the input set, revealing correlations between the attributes that are not easily identified. We test three different 2D topological mapping: rectangular, hexagonal, and spherical, by using data from the DEEP2 survey. We also explore different implementations and boundary conditions on the map and also introduce the idea of a random atlas where a large number of different maps are created and their individual predictions are aggregated to produce a more robust photometric redshift PDF. We also introduced a new metric, the $I$-score, which efficiently incorporates different metrics, making it easier to compare different results (from different parameters or different photometric redshift codes). We find that by using a spherical topology mapping we obtain a better representation of the underlying multidimensional topology, which provides more accurate results that are comparable to other, state-of-the-art machine learning algorithms. Our results illustrate that unsupervised approaches have great potential for many astronomical problems, and in particular for the computation of photometric redshifts.

Mon, 23 Dec 13
31/48

|

# On the art and theory of self-calilbration [IMA]

Calibration is the process of inferring how much measured data depend on the signal one is interested in. It is essential for any quantitative signal estimation on the basis of the data. Here, we investigate the “art” of self-calibration that augments an external calibration solution using a known reference signal with an internal calibration on the unknown measurement signal itself. Contemporary self-calibration schemes try to find a self-consistent solution for signal and calibration. This can be understood in terms of maximizing their joint probability. Thus, the full uncertainty structure of this probability around its maximum is not taken into account by these schemes. Therefore better schemes — in sense of minimal square error — can be designed that also reflect the uncertainties of signal and calibration reconstructions. We argue that at least the signal uncertainty should not be neglected in typical measurement situations, since the calibration solutions suffer from a systematic bias otherwise, which consequently distorts the signal reconstruction. Furthermore, we argue that non-parametric, signal-to-noise filtered calibration should provide more accurate reconstructions than the common bin averages and provide a new, improved self-calibration scheme. We illustrate our findings with a simplistic numerical example.

Fri, 6 Dec 13
27/55

|

# Automatic Classification of Variable Stars in Catalogs with missing data [IMA]

We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks, a probabilistic graphical model, that allows us to perform inference to pre- dict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilises sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model we use three catalogs with missing data (SAGE, 2MASS and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches and at what computational cost. Integrating these catalogs with missing data we find that classification of variable objects improves by few percent and by 15% for quasar detection while keeping the computational cost the same.

Date added: Wed, 30 Oct 13

|