Time Series Cube Data Model [CL]


The purpose of this document is to create a data model and its serialization for expressing generic time series data. Already existing IVOA data models are reused as much as possible. The model is also made as generic as possible to be open to new extensions but at the same time closed for modifications. This enables maintaining interoperability throughout different versions of the data model. We define the necessary building blocks for metadata discovery, serialization of time series data and understanding it by clients. We present several categories of time series science cases with examples of implementation. We also take into account the most pressing topics for time series providers like tracking original images for every individual point of a light curve or time-derived axes like frequency for gravitational wave analysis. The main motivation for the creation of a new model is to provide a unified time series data publishing standard – not only for light curves but also more generic time series data, e.g., radial velocity curves, power spectra, hardness ratio, provenance linkage, etc. The flexibility is the most crucial part of our model – we are not dependent on any physical domain or frame models. While images or spectra are already stable and standardized products, the time series related domains are still not completely evolved and new ones will likely emerge in near future. That is why we need to keep models like Time Series Cube DM independent of any underlying physical models. In our opinion, this is the only correct and sustainable way for future development of IVOA standards.

Read this paper on arXiv…

J. Nadvornik, P. Skoda, D. Morris, et. al.
Tue, 7 Feb 17

Comments: 27 pages, 17 figures

Photo-z-SQL: integrated, flexible photometric redshift computation in a database [GA]


We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z datasets, PHAT and CAPR/CANDELS. The code is available for download at https://github.com/beckrob/Photo-z-SQL.

Read this paper on arXiv…

R. Beck, L. Dobos, T. Budavari, et. al.
Tue, 8 Nov 16

Comments: 11 pages, 4 figures. Submitted to Astronomy & Computing on 2016 November 04

The Footprint Database and Web Services of the Herschel Space Observatory [IMA]


Data from the Herschel Space Observatory is freely available to the public but no uniformly processed catalogue of the observations has been published so far. To date, the Herschel Science Archive does not contain the exact sky coverage (footprint) of individual observations and supports search for measurements based on bounding circles only. Drawing on previous experience in implementing footprint databases, we built the Herschel Footprint Database and Web Services for the Herschel Space Observatory to provide efficient search capabilities for typical astronomical queries. The database was designed with the following main goals in mind: (a) provide a unified data model for meta-data of all instruments and observational modes, (b) quickly find observations covering a selected object and its neighbourhood, (c) quickly find every observation in a larger area of the sky, (d) allow for finding solar system objects crossing observation fields. As a first step, we developed a unified data model of observations of all three Herschel instruments for all pointing and instrument modes. Then, using telescope pointing information and observational meta-data, we compiled a database of footprints. As opposed to methods using pixellation of the sphere, we represent sky coverage in an exact geometric form allowing for precise area calculations. For easier handling of Herschel observation footprints with rather complex shapes, two algorithms were implemented to reduce the outline. Furthermore, a new visualisation tool to plot footprints with various spherical projections was developed. Indexing of the footprints using Hierarchical Triangular Mesh makes it possible to quickly find observations based on sky coverage, time and meta-data. The database is accessible via a web site (this http URL) and also as a set of REST web service functions.

Read this paper on arXiv…

L. Dobos, E. Varga-Verebelyi, E. Verdugo, et. al.
Tue, 14 Jun 16

Comments: Accepted for publication in Experimental Astronomy

Real-Time Data Mining of Massive Data Streams from Synoptic Sky Surveys [IMA]


The nature of scientific and technological data collection is evolving rapidly: data volumes and rates grow exponentially, with increasing complexity and information content, and there has been a transition from static data sets to data streams that must be analyzed in real time. Interesting or anomalous phenomena must be quickly characterized and followed up with additional measurements via optimal deployment of limited assets. Modern astronomy presents a variety of such phenomena in the form of transient events in digital synoptic sky surveys, including cosmic explosions (supernovae, gamma ray bursts), relativistic phenomena (black hole formation, jets), potentially hazardous asteroids, etc. We have been developing a set of machine learning tools to detect, classify and plan a response to transient events for astronomy applications, using the Catalina Real-time Transient Survey (CRTS) as a scientific and methodological testbed. The ability to respond rapidly to the potentially most interesting events is a key bottleneck that limits the scientific returns from the current and anticipated synoptic sky surveys. Similar challenge arise in other contexts, from environmental monitoring using sensor networks to autonomous spacecraft systems. Given the exponential growth of data rates, and the time-critical response, we need a fully automated and robust approach. We describe the results obtained to date, and the possible future developments.

Read this paper on arXiv…

S. Djorgovski, M. Graham, C. Donalek, et. al.
Tue, 19 Jan 16

Comments: 14 pages, an invited paper for a special issue of Future Generation Computer Systems, Elsevier Publ. (2015). This is an expanded version of a paper arXiv:1407.3502 presented at the IEEE e-Science 2014 conf., with some new content

Cross-matching Engine for Incremental Photometric Sky Survey [CL]


For light curve generation, a pre-planned photometry survey is needed nowadays, where all of the exposure coordinates have to be given and don’t change during the survey. This thesis shows it is not required and we can data-mine these light curves from astronomical data that was never meant for this purpose. With this approach, we can recycle all of the photometric surveys in the world and generate light curves of observed objects for them.
This thesis is addressing mostly the catalog generation process, which is needed for creating the light curves. In practice, it focuses on one of the most important problems in astroinformatics which is clustering data volumes on Big Data scale where most of the traditional techniques stagger. We consider a wide variety of possible solutions from the view of performance, scalability, distributability, etc. We defined criteria for time and memory complexity which we evaluated for all of the tested solutions. Furthermore, we created quality standards which we also take into account when evaluating the results.
We are using relational databases as a starting point of our implementation and compare them with the newest technologies potentially usable for solving our problem. These are noSQL Array databases or transferring the heavy computations of clustering towards supercomputers by using parallelism.

Read this paper on arXiv…

I. Nadvornik
Thu, 25 Jun 15

Comments: 57 pages, 36 figures

Building an Archive with Saada [IMA]


Saada transforms a set of heterogeneous FITS files or VOTables of various categories (images, tables, spectra …) in a database without writing code. Databases created with Saada come with a rich Web interface and an Application Programming Interface (API). They support the four most common VO services. Such databases can mix various categories of data in multiple collections. They allow a direct access to the original data while providing a homogenous view thanks to an internal data model compatible with the characterization axis defined by the VO. The data collections can be bound to each other with persistent links making relevant browsing paths and allowing data-mining oriented queries.

Read this paper on arXiv…

L. Michel, C. Motch, H. Nguyen, et. al.
Tue, 2 Sep 14

Comments: 18 pages, 5 figures Special VO issue

D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database [CL]


Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M)[this http URL] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema

Read this paper on arXiv…

J. Kepner, C. Anderson, W. Arcand, et. al.
Wed, 16 Jul 14

Comments: 6 pages; IEEE HPEC 2013