The COsmostatistics INitiative ([COIN](https://asaip.psu.edu/organizations/iaa/iaa-working-group-of-cosmostatistics/)), a working group built within the International Astrostatistics Association ([IAA](http://iaa.mi.oa-brera.inaf.it/IAA/home.html)), aims to create a friendly environment where hands-on collaboration between astronomers, cosmologists, statisticians and machine learning experts can flourish. COIN is designed to promote the development of a new family of tools for data exploration in cosmology.
AMADA allows an iterative exploration and information retrieval of high-dimensional data sets. This is done by performing a hierarchical clustering analysis for different choices of correlation matrices and by doing a principal components analysis in the original data. Additionally, AMADA provides a set of modern visualization data-mining diagnostics. The user can switch between them using the different tabs.
Link to ADS Package Web App
Approximate Bayesian Computation (ABC) enables the statistical analysis of stochastic models for complex physical systems in cases where the true likelihood function is unknown, unavailable, or computationally expensive. ABC relies on the forward simulation of mock data rather than the specification of a likelihood function. The CosmoABC code was originally designed for cosmological parameter inference from galaxy clusters number counts based on Sunyaev-Zel’dovich measurements. Nevertheless, the user can easily take advantage of the ABC sampler along with his/her own simulator, as well as test personalized prior distributions, summary statistics and distance functions.
Link to ADS Tutorial Package
DRACULA classifies objects using dimensionality reduction and clustering. The code has an easy interface and can be applied to separate several types of objects. It is based on tools developed in scikit-learn, with Deep Learning usage requiring also the H2O package.
Link to ADS Package
Statistical methods play a central role to fully exploit astronomical catalogues and an efficient data analysis requires astronomers to go beyond the traditional Gaussian-based models. This projects illustrates the power of generalized linear models (GLMs) for astronomical community, from a Bayesian perspective. Applications range from modelling star formation activity (logistic regression), globular cluster population (negative binomial regression), photometric redshifts (gamma regression), exoplanets multiplicity (Poisson regression), and so forth.
Suited to handle binary or proportional data, also called absence and presence data. For example AGN activity, star-galaxy separation, fraction of bars in a galaxy, scape fraction, etc.
Link to ADS
Suited to handle non-negative continuous variables. Such as photometric redshifts, star formation rate, galaxy mass. The method naturally accounts for heteroskedasticity (non-constant variability).
Link to ADS Tutorial Package Web App
Suited to handle non-negative discrete variables. Such as number of exoplanets, globular cluster population, richness of galaxy clusters, etc.
Link to ADS
This work employs a Gaussian mixture model (GMM) to jointly analyse two traditional emission-line classification schemes of galaxy ionization sources: the Baldwin-Phillips-Terlevich (BPT) and W_H-alpha vs. [NII]_H-alpha (WHAN) diagrams, using spectroscopic data from the Sloan Digital Sky Survey Data Release 7 and SEAGal/STARLIGHT datasets. We apply a GMM to empirically define classes of galaxies in a three-dimensional space spanned by the log [OIII]/H-beta, log [NII]/H-alpha, and log EW(H-alpha) optical parameters. The best-fit GMM based on several statistical criteria consists of four Gaussian components (GCs), which are capable to explain up to 97 per cent of the data variance. Using elements of information theory, we compare each GC to their respective astronomical counterpart. GC1 and GC4 are associated with star-forming galaxies, suggesting the need to define a new starburst subgroup. GC2 is associated with BPT's Active Galaxy Nuclei (AGN) class and WHAN's weak AGN class. GC3 is associated with BPT's composite class and WHAN's strong AGN class. Conversely, there is no statistical evidence -- based on GMMs -- for the existence of a Seyfert/LINER dichotomy in our sample. We demonstrate the potential of our methodology to recover/unravel different objects inside the wilderness of astronomical datasets, without lacking the ability to convey physically interpretable results; hence being a precious tool for maximum exploitation of the ever-increasing astronomical surveys.
Link to ADS Catalogue Tutorial
We developed a hierarchical Bayesian model (HBM) to investigate how the presence of Seyfert activity relates to their environment, herein represented by the galaxy cluster mass, M200, and the normalized cluster-centric distance, r/r200. We achieved this by constructing an unbiased sample of galaxies from the Sloan Digital Sky Survey, with morphological classifications provided by the Galaxy Zoo Project. A propensity score matching approach is introduced to control for the effects of confounding variables: stellar mass, galaxy colour, and star formation rate. The connection between Seyfert-activity and environmental properties in the de-biased sample is modelled within an HBM framework using the so-called logistic regression technique, suitable for the analysis of binary data (e.g., whether or not a galaxy hosts an AGN). Unlike standard ordinary least square fitting methods, our methodology naturally allows modelling the probability of Seyfert-AGN activity in galaxies on their natural scale, i.e. as a binary variable. Furthermore, we demonstrate how an HBM can incorporate information of each particular galaxy morphological type in a unified framework. In elliptical galaxies, our analysis indicates a strong correlation of Seyfert-AGN activity with r/r200, and a weaker correlation with the mass of the host. In spiral galaxies these trends do not appear, suggesting that the link between Seyfert activity and the properties of spiral galaxies are independent of the environment.
Link to ADS
We present two galaxy catalogues built to enable a more demanding and realistic test of photo-z methods. Using photometry from the Sloan Digital Sky Survey and spectroscopy from a collection of sources, we constructed datasets which mimic the biases between the underlying probability distribution of the real spectroscopic and photometric sample while also possessing spectroscopic measurements. We demonstrate the potential of these catalogues by submitting them to the scrutiny of different photo-z methods, including machine learning (ML) and template fitting approaches. We were able to recognize the superiority of global models in cases with incomplete coverage in feature space and the general failure across all types of methods when incomplete coverage is convoluted with the presence of photometric errors - a data situation which photo-z methods were not trained to deal with up to now and which must be addressed by future large scale surveys. Our catalogues represent the first controlled environment allowing a straightforward implementation of such tests.
Link to ADS Catalogues
CRP #1: August/2014 - Lisbon, Portugal
CRP #2: October/2015 - Isle of Wight, UK
CRP #3: August/2016 - Budapest, Hungary
CRP #4: August/2017 - Clermont Ferrand, France
Alberto Krone-Martins Arlindo Trindade Bart Buelens Chieh-An Lin Emille Ishida Eric Feigelson Fabian Gieseke Hugo Camacho Jonny Elliott Joseph Hilbe Madhura Killedar Marcos Vinicius Costa Duarte Maria Luiza Dantas Mariana Penna-Lima Michel Aguena Michele Sasdelli Miguel de Val-Borro Mohammad Hattab Pierre-Yves Lablanche Rafael S. de Souza Ricardo Vilalta Robert Beck Rodolfo Smiljanic Sandro Vitenti Vinicius C. Busti Yabebal Fantaye