-
Classification of colorectal primer carcinoma from normal colon with mid-infrared spectra
Authors:
B. Borkovits,
E. Kontsek,
A. Pesti,
P. Gordon,
S. Gergely,
I. Csabai,
A. Kiss,
P. Pollner
Abstract:
In this project, we used formalin-fixed paraffin-embedded (FFPE) tissue samples to measure thousands of spectra per tissue core with Fourier transform mid-infrared spectroscopy using an FT-IR imaging system. These cores varied between normal colon (NC) and colorectal primer carcinoma (CRC) tissues. We created a database to manage all the multivariate data obtained from the measurements. Then, we a…
▽ More
In this project, we used formalin-fixed paraffin-embedded (FFPE) tissue samples to measure thousands of spectra per tissue core with Fourier transform mid-infrared spectroscopy using an FT-IR imaging system. These cores varied between normal colon (NC) and colorectal primer carcinoma (CRC) tissues. We created a database to manage all the multivariate data obtained from the measurements. Then, we applied classifier algorithms to identify the tissue based on its yielded spectra. For classification, we used the random forest, a support vector machine, XGBoost, and linear discriminant analysis methods, as well as three deep neural networks. We compared two data manipulation techniques using these models and then applied filtering. In the end, we compared model performances via the sum of ranking differences (SRD).
△ Less
Submitted 22 March, 2024;
originally announced May 2024.
-
The CMB lensing imprint of cosmic voids detected in the WISE-Pan-STARRS luminous red galaxy catalog
Authors:
G. Camacho-Ciurana,
P. Lee,
N. Arsenov,
A. Kovács,
I. Szapudi,
I. Csabai
Abstract:
The cross-correlation of cosmic voids with the lensing convergence ($κ$) map of the CMB fluctuations offers a powerful tool to refine our understanding of the dark sector in the consensus cosmological model. Our principal aim is to compare the lensing signature of our galaxy data set with simulations based on the concordance model and characterize the results with an $A_κ$ consistency parameter. I…
▽ More
The cross-correlation of cosmic voids with the lensing convergence ($κ$) map of the CMB fluctuations offers a powerful tool to refine our understanding of the dark sector in the consensus cosmological model. Our principal aim is to compare the lensing signature of our galaxy data set with simulations based on the concordance model and characterize the results with an $A_κ$ consistency parameter. In particular, our measurements contribute to the understanding of the "lensing-is-low" tension of the $Λ$CDM model. We selected luminous red galaxies from the WISE-Pan-STARSS data set, allowing an extended 14,200 deg$^2$ sky area, that offers a more precise measurement compared to previous studies. We created 2D and 3D void catalogs to cross-correlate their locations with the Planck lensing map and studied their average imprint signal using a stacking methodology. Applying the same procedure, we also generated a mock catalog from the WebSky simulation for comparison. The 2D void analysis revealed good agreement with the standard cosmological model with $A_κ\approx1.06 \pm 0.08$, i.e. $S/N=13.3$, showing a higher $S/N$ than previous studies using voids detected in the Dark Energy Survey data set. The 3D void analysis exhibited a lower $S/N$ and demonstrated worse agreement with our mock catalog than the 2D voids. These deviations might be attributed to limitations in the mock catalog, such as imperfections in the LRG selection, as well as a potential asymmetry between the North and South patches of the WISE-Pan-STARSS data set in terms of data quality. Overall, we present a significant detection of a CMB lensing signal associated with cosmic voids, largely consistent with the concordance model. Future analyses using even larger data sets also hold great promise of further sharpening these results, given their complementary nature to large-scale structure analyses.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Complementary Cosmological Simulations
Authors:
Gábor Rácz,
Alina Kiessling,
István Csabai,
István Szapudi
Abstract:
Cosmic variance limits the accuracy of cosmological N-body simulations, introducing bias in statistics such as the power spectrum, halo mass function, or the cosmic shear. We provide new methods to measure and reduce the effect of cosmic variance in existing and new simulations. We ran pairs of simulations using phase-shifted initial conditions with matching amplitudes. We set the initial amplitud…
▽ More
Cosmic variance limits the accuracy of cosmological N-body simulations, introducing bias in statistics such as the power spectrum, halo mass function, or the cosmic shear. We provide new methods to measure and reduce the effect of cosmic variance in existing and new simulations. We ran pairs of simulations using phase-shifted initial conditions with matching amplitudes. We set the initial amplitudes of the Fourier modes to ensure that the average power spectrum of the pair is equal to the cosmic mean power spectrum from linear theory. The average power spectrum of a pair of such simulations remains consistent with the estimated nonlinear spectra of the state-of-the-art methods even at late times. We also show that the effect of cosmic variance on any analysis involving a cosmological simulation can be estimated using the complementary pair of the original simulation. To demonstrate the effectiveness of our novel technique, we simulated a complementary pair of the original Millennium run and quantified the degree to which cosmic variance affected its the power spectrum. The average power spectrum of the original and complementary Millennium simulation was able to directly resolve the baryon acoustic oscillation features.
△ Less
Submitted 17 February, 2023; v1 submitted 26 October, 2022;
originally announced October 2022.
-
Deep Weighted Monte Carlo: A hybrid option pricing framework using neural networks
Authors:
Sándor Kunsági-Máté,
Gábor Fáth,
István Csabai,
Gábor Molnár-Sáska
Abstract:
Recent studies have demonstrated the efficiency of Variational Autoencoders (VAE) to compress high-dimensional implied volatility surfaces into a low dimensional representation. Although this method can be effectively used for pricing vanilla options, it does not provide any explicit information about the dynamics of the underlying asset. In our work we present an effective way to overcome this pr…
▽ More
Recent studies have demonstrated the efficiency of Variational Autoencoders (VAE) to compress high-dimensional implied volatility surfaces into a low dimensional representation. Although this method can be effectively used for pricing vanilla options, it does not provide any explicit information about the dynamics of the underlying asset. In our work we present an effective way to overcome this problem. We use a Weighted Monte Carlo approach to first generate paths from a simple a priori Brownian dynamics, and then calculate path weights to price options correctly. We develop and successfully train a neural network that is able to assign these weights directly from the latent space. Combining the encoder network of the VAE and this new "weight assigner" module, we are able to build a dynamic pricing framework which cleanses the volatility surface from irrelevant noise fluctuations, and then can price not just vanillas, but also exotic options on this idealized vol surface. This pricing method can provide relative value signals for option traders.
△ Less
Submitted 8 December, 2022; v1 submitted 30 August, 2022;
originally announced August 2022.
-
Photometric redshifts for quasars from WISE-PS1-STRM
Authors:
Sándor Kunsági-Máté,
Róbert Beck,
István Szapudi,
István Csabai
Abstract:
Three-dimensional wide-field galaxy surveys are fundamental for cosmological studies. For higher redshifts (z > 1.0), where galaxies are too faint, quasars still trace the large-scale structure of the Universe. Since available telescope time limits spectroscopic surveys, photometric methods are efficient for estimating redshifts for many quasars. Recently, machine learning methods are increasingly…
▽ More
Three-dimensional wide-field galaxy surveys are fundamental for cosmological studies. For higher redshifts (z > 1.0), where galaxies are too faint, quasars still trace the large-scale structure of the Universe. Since available telescope time limits spectroscopic surveys, photometric methods are efficient for estimating redshifts for many quasars. Recently, machine learning methods are increasingly successful for quasar photometric redshifts, however, they hinge on the distribution of the training set. Therefore a rigorous estimation of reliability is critical. We extracted optical and infrared photometric data from the cross-matched catalogue of the WISE All-Sky and PS1 3$π$ DR2 sky surveys. We trained an XGBoost regressor and an artificial neural network on the relation between color indices and spectroscopic redshift. We approximated the effective training set coverage with the K nearest neighbors algorithm. We estimated reliable photometric redshifts of 2,879,298 quasars which overlap with the training set in feature space. We validated the derived redshifts with an independent, clustering-based redshift estimation technique. The final catalog is publicly available.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Decomposition of stellar populations in CosmoDC2 galaxies using SCARLET and Deep Learning
Authors:
Sándor Kunsági-Máté,
István Csabai
Abstract:
We are presenting a novel, Deep Learning based approach to estimate the normalized broadband spectral energy distribution (SED) of different stellar populations in synthetic galaxies. In contrast to the non-parametric multiband source separation algorithm, SCARLET - where the SED and morphology are simultaneously fitted - in our study we provide a morphology-independent, statistical determination…
▽ More
We are presenting a novel, Deep Learning based approach to estimate the normalized broadband spectral energy distribution (SED) of different stellar populations in synthetic galaxies. In contrast to the non-parametric multiband source separation algorithm, SCARLET - where the SED and morphology are simultaneously fitted - in our study we provide a morphology-independent, statistical determination of the SEDs, where we only use the color distribution of the galaxy. We developed a neural network (sedNN) that accurately predicts the SEDs of the old, red and young, blue stellar populations of realistic synthetic galaxies from the color distribution of the galaxy-related pixels in simulated broadband images. We trained and tested the network on a subset of the recently published CosmoDC2 simulated galaxy catalog containing about 3,600 galaxies. The model performance was compared to the results of SCARLET, where we found that sedNN can predict the SEDs with 4-5% accuracy on average, which is about two times better than applying SCARLET. We also investigated the effect of this improvement on the flux determination accuracy of the bulge and disk. We found that using more accurate SEDs decreases the error in the flux determination of the components by approximately 30%.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
Evidence for a high-z ISW signal from supervoids in the distribution of eBOSS quasars
Authors:
A. Kovács,
R. Beck,
A. Smith,
G. Rácz,
I. Csabai,
I. Szapudi
Abstract:
The late-time integrated Sachs-Wolfe (ISW) imprint of $R\gtrsim 100~h^{-1}{\rm Mpc}$ super-structures is sourced by evolving large-scale potentials due to a dominant dark energy component in the $Λ$CDM model. The aspect that makes the ISW effect distinctly interesting is the repeated observation of stronger-than-expected imprints from supervoids at $z\lesssim0.9$. Here we analyze the un-probed key…
▽ More
The late-time integrated Sachs-Wolfe (ISW) imprint of $R\gtrsim 100~h^{-1}{\rm Mpc}$ super-structures is sourced by evolving large-scale potentials due to a dominant dark energy component in the $Λ$CDM model. The aspect that makes the ISW effect distinctly interesting is the repeated observation of stronger-than-expected imprints from supervoids at $z\lesssim0.9$. Here we analyze the un-probed key redshift range $0.8<z<2.2$ where the ISW signal is expected to fade in $Λ$CDM, due to a weakening dark energy component, and eventually become consistent with zero in the matter dominated epoch. On the contrary, alternative cosmological models, proposed to explain the excess low-$z$ ISW signals, predicted a sign-change in the ISW effect at $z\approx1.5$ due to the possible growth of large-scale potentials that is absent in the standard model. To discriminate, we estimated the high-$z$ $Λ$CDM ISW signal using the Millennium XXL mock catalogue, and compared it to our measurements from about 800 supervoids identified in the eBOSS DR16 quasar catalogue. At $0.8<z<1.2$, we found an excess ISW signal with $A_\mathrm{ ISW}\approx3.6\pm2.1$ amplitude. The signal is then consistent with the $Λ$CDM expectation ($A_\mathrm{ ISW}=1$) at $1.2<z<1.5$ where the standard and alternative models predict similar amplitudes. Most interestingly, we also detected an opposite-sign ISW signal at $1.5<z<2.2$ that is in $2.7σ$ tension with the $Λ$CDM prediction. Taken at face value, these moderately significant detections of ISW anomalies suggest an alternative growth rate of structure in low-density environments at $\sim100~h^{-1}{\rm Mpc}$ scales.
△ Less
Submitted 18 April, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
An empirical nonlinear power spectrum overdensity-response
Authors:
Gábor Rácz,
István Szapudi,
István Csabai
Abstract:
Context. The overdensity inside a cosmological sub-volume and the tidal fields from its surroundings affect the matter distribution of the region. The resulting difference between the local and global power spectra is characterized by the response function.
Aims. Our aim is to provide a new, simple, and accurate formula for the power spectrum overdensity response at highly nonlinear scales based…
▽ More
Context. The overdensity inside a cosmological sub-volume and the tidal fields from its surroundings affect the matter distribution of the region. The resulting difference between the local and global power spectra is characterized by the response function.
Aims. Our aim is to provide a new, simple, and accurate formula for the power spectrum overdensity response at highly nonlinear scales based on the results of cosmological simulations and paying special attention to the lognormal nature of the density field.
Methods. We measured the dark matter power spectrum amplitude as a function of the overdensity ($δ_W$) in $N$-body simulation subsamples. We show that the response follows a power-law form in terms of $(1+δ_W)$, and we provide a new fit in terms of the variance, $σ(L)$, of a sub-volume of size $L$.
Results. Our fit has a similar accuracy and a comparable complexity to second-order standard perturbation theory on large scales, but it is also valid for nonlinear (smaller) scales, where perturbation theory needs higher-order terms for a comparable precision. Furthermore, we show that the lognormal nature of the overdensity distribution causes a previously unidentified bias: the power spectrum amplitude for a subsample with an average density is typically underestimated by about $-2σ^2$. Although this bias falls to the sub-percent level above characteristic scales of $200Mpch^{-1}$, taking it into account improves the accuracy of estimating power spectra from zoom-in simulations and smaller high-resolution surveys embedded in larger low-resolution volumes.
△ Less
Submitted 11 March, 2022; v1 submitted 30 May, 2021;
originally announced May 2021.
-
GPU-Accelerated Hierarchical Bayesian Inference with Application to Modeling Cosmic Populations: CUDAHM
Authors:
János M. Szalai-Gindl,
Thomas J. Loredo,
Brandon C. Kelly,
István Csabai,
Tamás Budavári,
László Dobos
Abstract:
We describe a computational framework for hierarchical Bayesian inference with simple (typically single-plate) parametric graphical models that uses graphics processing units (GPUs) to accelerate computations, enabling deployment on very large datasets. Its C++ implementation, CUDAHM (CUDA for Hierarchical Models) exploits conditional independence between instances of a plate, facilitating massive…
▽ More
We describe a computational framework for hierarchical Bayesian inference with simple (typically single-plate) parametric graphical models that uses graphics processing units (GPUs) to accelerate computations, enabling deployment on very large datasets. Its C++ implementation, CUDAHM (CUDA for Hierarchical Models) exploits conditional independence between instances of a plate, facilitating massively parallel exploration of the replication parameter space using the single instruction, multiple data architecture of GPUs. It provides support for constructing Metropolis-within-Gibbs samplers that iterate between GPU-accelerated robust adaptive Metropolis sampling of plate-level parameters conditional on upper-level parameters, and Metropolis-Hastings sampling of upper-level parameters on the host processor conditional on the GPU results. CUDAHM is motivated by demographic problems in astronomy, where density estimation and linear and nonlinear regression problems must be addressed for populations of thousands to millions of objects whose features are measured with possibly complex uncertainties. We describe a thinned latent point process framework for modeling such demographic data. We demonstrate accurate GPU-accelerated parametric conditional density deconvolution for simulated populations of up to 300,000 objects in ~1 hour using a single NVIDIA Tesla K40c GPU. Supplementary material provides details about the CUDAHM API and the demonstration problem.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
The rich still get richer: Empirical comparison of preferential attachment via linking statistics in Bitcoin and Ethereum
Authors:
Dániel Kondor,
Nikola Bulatovic,
József Stéger,
István Csabai,
Gábor Vattay
Abstract:
Bitcoin and Ethereum transactions present one of the largest real-world complex networks that are publicly available for study, including a detailed picture of their time evolution. As such, they have received a considerable amount of attention from the network science community, beside analysis from an economic or cryptography perspective. Among these studies, in an analysis on the early instance…
▽ More
Bitcoin and Ethereum transactions present one of the largest real-world complex networks that are publicly available for study, including a detailed picture of their time evolution. As such, they have received a considerable amount of attention from the network science community, beside analysis from an economic or cryptography perspective. Among these studies, in an analysis on the early instance of the Bitcoin network, we have shown the clear presence of the preferential attachment, or "rich-get-richer" phenomenon. Now, we revisit this question, using a recent version of the Bitcoin network that has grown almost 100-fold since our original analysis. Furthermore, we additionally carry out a comparison with Ethereum, the second most important cryptocurrency. Our results show that preferential attachment continues to be a key factor in the evolution of both the Bitcoin and Ethereum transactoin networks. To facilitate further analysis, we publish a recent version of both transaction networks, and an efficient software implementation that is able to evaluate linking statistics necessary for learn about preferential attachment on networks with several hundred million edges.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
The effect of emission lines on the performance of photometric redshift estimation algorithms
Authors:
Géza Csörnyei,
László Dobos,
István Csabai
Abstract:
We investigate the effect of strong emission line galaxies on the performance of empirical photometric redshift estimation methods. In order to artificially control the contribution of photometric error and emission lines to total flux, we develop a PCA-based stochastic mock catalogue generation technique that allows for generating infinite signal-to-noise ratio model spectra with realistic emissi…
▽ More
We investigate the effect of strong emission line galaxies on the performance of empirical photometric redshift estimation methods. In order to artificially control the contribution of photometric error and emission lines to total flux, we develop a PCA-based stochastic mock catalogue generation technique that allows for generating infinite signal-to-noise ratio model spectra with realistic emission lines on top of theoretical stellar continua. Instead of running the computationally expensive stellar population synthesis and nebular emission codes, our algorithm generates realistic spectra with a statistical approach, and - as an alternative to attempting to constrain the priors on input model parameters - works by matching output observational parameters. Hence, it can be used to match the luminosity, colour, emission line and photometric error distribution of any photometric sample with sufficient flux-calibrated spectroscopic follow-up. We test three simple empirical photometric estimation methods and compare the results with and without photometric noise and strong emission lines. While photometric noise clearly dominates the uncertainty of photometric redshift estimates, the key findings are that emission lines play a significant role in resolving colour space degeneracies and good spectroscopic coverage of the entire colour space is necessary to achieve good results with empirical photo-z methods. Template fitting methods, on the other hand, must use a template set with sufficient variation in emission line strengths and ratios, or even better, first estimate the redshift empirically and fit the colours with templates at the best-fit redshift to calculate the K-correction and various physical parameters.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
The anisotropy of the power spectrum in periodic cosmological simulations
Authors:
Gábor Rácz,
István Szapudi,
István Csabai,
László Dobos
Abstract:
The classical gravitational force on a torus is anisotropic and always lower than Newton's $1/r^2$ law. We demonstrate the effects of periodicity in dark matter only $N$-body simulations of spherical collapse and standard $Λ$CDM initial conditions. Periodic boundary conditions cause an overall negative and anisotropic bias in cosmological simulations of cosmic structure formation. The lower amplit…
▽ More
The classical gravitational force on a torus is anisotropic and always lower than Newton's $1/r^2$ law. We demonstrate the effects of periodicity in dark matter only $N$-body simulations of spherical collapse and standard $Λ$CDM initial conditions. Periodic boundary conditions cause an overall negative and anisotropic bias in cosmological simulations of cosmic structure formation. The lower amplitude of power spectra of small periodic simulations are a consequence of the missing large scale modes and the equally important smaller periodic forces. The effect is most significant when the largest mildly non-linear scales are comparable to the linear size of the simulation box, as often is the case for high-resolution hydrodynamical simulations. Spherical collapse morphs into a shape similar to an octahedron. The anisotropic growth distorts the large-scale $Λ$CDM dark matter structures. We introduce the direction-dependent power spectrum invariant under the octahedral group of the simulation volume and show that the results break spherical symmetry.
△ Less
Submitted 23 March, 2021; v1 submitted 18 June, 2020;
originally announced June 2020.
-
A common explanation of the Hubble tension and anomalous cold spots in the CMB
Authors:
András Kovács,
Róbert Beck,
István Szapudi,
István Csabai,
Gábor Rácz,
László Dobos
Abstract:
The standard cosmological paradigm narrates a reassuring story of a universe currently dominated by an enigmatic dark energy component. Disquietingly, its universal explaining power has recently been challenged by, above all, the $\sim4σ$ tension in the values of the Hubble constant. Another, less studied anomaly is the repeated observation of integrated Sachs-Wolfe imprints $\sim5\times$ stronger…
▽ More
The standard cosmological paradigm narrates a reassuring story of a universe currently dominated by an enigmatic dark energy component. Disquietingly, its universal explaining power has recently been challenged by, above all, the $\sim4σ$ tension in the values of the Hubble constant. Another, less studied anomaly is the repeated observation of integrated Sachs-Wolfe imprints $\sim5\times$ stronger than expected in the $Λ$CDM model from R>100 $Mpc/h$ super-structures. Here we show that the inhomogeneous AvERA model of emerging curvature is capable of telling a plausible albeit radically different story that explains both observational anomalies without dark energy. We demonstrate that while stacked imprints of R>100 $Mpc/h$ supervoids in cosmic microwave background temperature maps can discriminate between the AvERA and $Λ$CDM models, their characteristic differences may remain hidden using alternative void definitions and stacking methodologies. Testing the extremes, we then also show that the CMB Cold Spot can plausibly be explained in the AvERA model as an ISW imprint. The coldest spot in the AvERA map is aligned with multiple low-$z$ supervoids with R>100 $Mpc/h$ and central underdensity $δ_{0}\approx-0.3$, resembling the observed large-scale galaxy density field in the Cold Spot area. We hence conclude that the anomalous imprint of supervoids may well be the canary in the coal mine, and existing observational evidence for dark energy should be re-interpreted to further test alternative models.
△ Less
Submitted 7 September, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Kooplex: collaborative data analytics portal for advancing sciences
Authors:
Dávid Visontai,
József Stéger,
János Márk Szalai-Gindl,
László Dobos,
László Oroszlány,
István Ervin Csabai
Abstract:
Research collaborations are continuously emerging catalyzed by online platforms, where people can share their codes, calculations, data and results. These virtual research platforms are innovative, community oriented, flexible and secure as required by modern scientific approaches. A wide range of open source and commercial solutions are available in this field emphasizing the relevant aspects of…
▽ More
Research collaborations are continuously emerging catalyzed by online platforms, where people can share their codes, calculations, data and results. These virtual research platforms are innovative, community oriented, flexible and secure as required by modern scientific approaches. A wide range of open source and commercial solutions are available in this field emphasizing the relevant aspects of such a platform differently. In this paper we present our open source and modular platform, KOOPLEX, which combines the key concepts of dynamic collaboration, customizable research environment, data sharing, access to datahubs, reproducible research and reporting. It is easily deployable and scalable to serve more users or access large computational resources.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
Galaxy shape measurement with convolutional neural networks
Authors:
Dezső Ribli,
László Dobos,
István Csabai
Abstract:
We present our results from training and evaluating a convolutional neural network (CNN) to predict galaxy shapes from wide-field survey images of the first data release of the Dark Energy Survey (DES DR1). We use conventional shape measurements as ground truth from an overlap**, deeper survey with less sky coverage, the Canada-France Hawaii Telescope Lensing Survey (CFHTLenS). We demonstrate th…
▽ More
We present our results from training and evaluating a convolutional neural network (CNN) to predict galaxy shapes from wide-field survey images of the first data release of the Dark Energy Survey (DES DR1). We use conventional shape measurements as ground truth from an overlap**, deeper survey with less sky coverage, the Canada-France Hawaii Telescope Lensing Survey (CFHTLenS). We demonstrate that CNN predictions from single band DES images reproduce the results of CFHTLenS at bright magnitudes and show higher correlation with CFHTLenS at fainter magnitudes than maximum likelihood model fitting estimates in the DES Y1 im3shape catalogue. Prediction of shape parameters with a CNN is also extremely fast, it takes only 0.2 milliseconds per galaxy, improving more than 4 orders of magnitudes over forward model fitting. The CNN can also accurately predict shapes when using multiple images of the same galaxy, even in different color bands, with no additional computational overhead. The CNN is again more precise for faint objects, and the advantage of the CNN is more pronounced for blue galaxies than red ones when compared to the DES Y1 metacalibration catalogue, which fits a single Gaussian profile using riz band images. We demonstrate that CNN shape predictions within the metacalibration self-calibrating framework yield shear estimates with negligible multiplicative bias, $ m < 10^{-3}$, and no significant PSF leakage. Our proposed setup is applicable to current and next generation weak lensing surveys where higher quality ground truth shapes can be measured in dedicated deep fields.
△ Less
Submitted 3 June, 2019; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Weak lensing cosmology with convolutional neural networks on noisy data
Authors:
Dezső Ribli,
Bálint Ármin Pataki,
José Manuel Zorrilla Matilla,
Daniel Hsu,
Zoltán Haiman,
István Csabai
Abstract:
Weak gravitational lensing is one of the most promising cosmological probes of the late universe. Several large ongoing (DES, KiDS, HSC) and planned (LSST, EUCLID, WFIRST) astronomical surveys attempt to collect even deeper and larger scale data on weak lensing. Due to gravitational collapse, the distribution of dark matter is non-Gaussian on small scales. However, observations are typically evalu…
▽ More
Weak gravitational lensing is one of the most promising cosmological probes of the late universe. Several large ongoing (DES, KiDS, HSC) and planned (LSST, EUCLID, WFIRST) astronomical surveys attempt to collect even deeper and larger scale data on weak lensing. Due to gravitational collapse, the distribution of dark matter is non-Gaussian on small scales. However, observations are typically evaluated through the two-point correlation function of galaxy shear, which does not capture non-Gaussian features of the lensing maps. Previous studies attempted to extract non-Gaussian information from weak lensing observations through several higher-order statistics such as the three-point correlation function, peak counts or Minkowski-functionals. Deep convolutional neural networks (CNN) emerged in the field of computer vision with tremendous success, and they offer a new and very promising framework to extract information from 2 or 3-dimensional astronomical data sets, confirmed by recent studies on weak lensing. We show that a CNN is able to yield significantly stricter constraints of ($σ_8, Ω_m$) cosmological parameters than the power spectrum using convergence maps generated by full N-body simulations and ray-tracing, at angular scales and shape noise levels relevant for future observations. In a scenario mimicking LSST or Euclid, the CNN yields 2.4-2.8 times smaller credible contours than the power spectrum, and 3.5-4.2 times smaller at noise levels corresponding to a deep space survey such as WFIRST. We also show that at shape noise levels achievable in future space surveys the CNN yields 1.4-2.1 times smaller contours than peak counts, a higher-order statistic capable of extracting non-Gaussian information from weak lensing maps.
△ Less
Submitted 10 February, 2019;
originally announced February 2019.
-
StePS: A Multi-GPU Cosmological N-body Code for Compactified Simulations
Authors:
Gábor Rácz,
István Szapudi,
László Dobos,
István Csabai,
Alexander S. Szalay
Abstract:
We present the multi-GPU realization of the StePS (Stereographically Projected Cosmological Simulations) algorithm with MPI-OpenMP-CUDA hybrid parallelization and nearly ideal scale-out to multiple compute nodes. Our new zoom-in cosmological direct N-body simulation method simulates the infinite universe with unprecedented dynamic range for a given amount of memory and, in contrast to traditional…
▽ More
We present the multi-GPU realization of the StePS (Stereographically Projected Cosmological Simulations) algorithm with MPI-OpenMP-CUDA hybrid parallelization and nearly ideal scale-out to multiple compute nodes. Our new zoom-in cosmological direct N-body simulation method simulates the infinite universe with unprecedented dynamic range for a given amount of memory and, in contrast to traditional periodic simulations, its fundamental geometry and topology match observations. By using a spherical geometry instead of periodic boundary conditions, and gradually decreasing the mass resolution with radius, our code is capable of running simulations with a few gigaparsecs in diameter and with a mass resolution of $\sim 10^{9}M_{\odot}$ in the center in four days on three compute nodes with four GTX 1080Ti GPUs in each. The code can also be used to run extremely fast simulations with reasonable resolution for fitting cosmological parameters. These simulations are useful for prediction needs of large surveys. The StePS code is publicly available for the research community.
△ Less
Submitted 21 March, 2019; v1 submitted 14 November, 2018;
originally announced November 2018.
-
An improved cosmological parameter inference scheme motivated by deep learning
Authors:
Dezső Ribli,
Bálint Ármin Pataki,
István Csabai
Abstract:
Dark matter cannot be observed directly, but its weak gravitational lensing slightly distorts the apparent shapes of background galaxies, making weak lensing one of the most promising probes of cosmology. Several observational studies have measured the effect, and there are currently running, and planned efforts to provide even larger, and higher resolution weak lensing maps. Due to nonlinearities…
▽ More
Dark matter cannot be observed directly, but its weak gravitational lensing slightly distorts the apparent shapes of background galaxies, making weak lensing one of the most promising probes of cosmology. Several observational studies have measured the effect, and there are currently running, and planned efforts to provide even larger, and higher resolution weak lensing maps. Due to nonlinearities on small scales, the traditional analysis with two-point statistics does not fully capture all the underlying information. Multiple inference methods were proposed to extract more details based on higher order statistics, peak statistics, Minkowski functionals and recently convolutional neural networks (CNN). Here we present an improved convolutional neural network that gives significantly better estimates of $Ω_m$ and $σ_8$ cosmological parameters from simulated convergence maps than the state of art methods and also is free of systematic bias. We show that the network exploits information in the gradients around peaks, and with this insight, we construct a new, easy-to-understand, and robust peak counting algorithm based on the 'steepness' of peaks, instead of their heights. The proposed scheme is even more accurate than the neural network on high-resolution noiseless maps. With shape noise and lower resolution its relative advantage deteriorates, but it remains more accurate than peak counting.
△ Less
Submitted 17 December, 2018; v1 submitted 15 June, 2018;
originally announced June 2018.
-
The integrated Sachs-Wolfe effect in the AvERA cosmology
Authors:
Róbert Beck,
István Csabai,
Gábor Rácz,
István Szapudi
Abstract:
The recent AvERA cosmological simulation of Rácz et al. (2017) has a $Λ\mathrm{CDM}$-like expansion history and removes the tension between local and Planck (cosmic microwave background) Hubble constants. We contrast the AvERA prediction of the integrated Sachs--Wolfe (ISW) effect with that of $Λ\mathrm{CDM}$. The linear ISW effect is proportional to the derivative of the growth function, thus it…
▽ More
The recent AvERA cosmological simulation of Rácz et al. (2017) has a $Λ\mathrm{CDM}$-like expansion history and removes the tension between local and Planck (cosmic microwave background) Hubble constants. We contrast the AvERA prediction of the integrated Sachs--Wolfe (ISW) effect with that of $Λ\mathrm{CDM}$. The linear ISW effect is proportional to the derivative of the growth function, thus it is sensitive to small differences in the expansion histories of the respective models. We create simulated ISW maps tracing the path of light-rays through the Millennium XXL cosmological simulation, and perform theoretical calculations of the ISW power spectrum. AvERA predicts a significantly higher ISW effect than $Λ\mathrm{CDM}$, $A=1.93-5.29$ times larger depending on the $l$ index of the spherical power spectrum, which could be utilized to definitively differentiate the models. We also show that AvERA predicts an opposite-sign ISW effect in the redshift range $z \approx 1.5 - 4.4$, in clear contrast with $Λ\mathrm{CDM}$. Finally, we compare our ISW predictions with previous observations. While at present these cannot distinguish between the two models due to large error bars, and lack of internal consistency suggesting systematics, ISW probes from future surveys will tightly constrain the models.
△ Less
Submitted 25 January, 2018;
originally announced January 2018.
-
Compactified Cosmological Simulations of the Infinite Universe
Authors:
Gábor Rácz,
István Szapudi,
István Csabai,
László Dobos
Abstract:
We present a novel $N$-body simulation method that compactifies the infinite spatial extent of the Universe into a finite sphere with isotropic boundary conditions to follow the evolution of the large-scale structure. Our approach eliminates the need for periodic boundary conditions, a mere numerical convenience which is not supported by observation and which modifies the law of force on large sca…
▽ More
We present a novel $N$-body simulation method that compactifies the infinite spatial extent of the Universe into a finite sphere with isotropic boundary conditions to follow the evolution of the large-scale structure. Our approach eliminates the need for periodic boundary conditions, a mere numerical convenience which is not supported by observation and which modifies the law of force on large scales in an unrealistic fashion. We demonstrate that our method outclasses standard simulations executed on workstation-scale hardware in dynamic range, it is balanced in following a comparable number of high and low $k$ modes and, its fundamental geometry and topology match observations. Our approach is also capable of simulating an expanding, infinite universe in static coordinates with Newtonian dynamics. The price of these achievements is that most of the simulated volume has smoothly varying mass and spatial resolution, an approximation that carries different systematics than periodic simulations.
Our initial implementation of the method is called StePS which stands for Stereographically Projected Cosmological Simulations. It uses stereographic projection for space compactification and naive $\mathcal{O}(N^2)$ force calculation which is nevertheless faster to arrive at a correlation function of the same quality than any standard (tree or P$^3$M) algorithm with similar spatial and mass resolution. The $N^2$ force calculation is easy to adapt to modern graphics cards, hence our code can function as a high-speed prediction tool for modern large-scale surveys. To learn about the limits of the respective methods, we compare StePS with GADGET-2 \citep{Gadget2_2005MNRAS.364.1105S} running matching initial conditions.
△ Less
Submitted 15 February, 2018; v1 submitted 14 November, 2017;
originally announced November 2017.
-
Detecting and classifying lesions in mammograms with Deep Learning
Authors:
Dezső Ribli,
Anna Horváth,
Zsuzsa Unger,
Péter Pollner,
István Csabai
Abstract:
In the last two decades Computer Aided Diagnostics (CAD) systems were developed to help radiologists analyze screening mammograms. The benefits of current CAD technologies appear to be contradictory and they should be improved to be ultimately considered useful. Since 2012 deep convolutional neural networks (CNN) have been a tremendous success in image recognition, reaching human performance. Thes…
▽ More
In the last two decades Computer Aided Diagnostics (CAD) systems were developed to help radiologists analyze screening mammograms. The benefits of current CAD technologies appear to be contradictory and they should be improved to be ultimately considered useful. Since 2012 deep convolutional neural networks (CNN) have been a tremendous success in image recognition, reaching human performance. These methods have greatly surpassed the traditional approaches, which are similar to currently used CAD solutions. Deep CNN-s have the potential to revolutionize medical image analysis. We propose a CAD system based on one of the most successful object detection frameworks, Faster R-CNN. The system detects and classifies malignant or benign lesions on a mammogram without any human intervention. The proposed method sets the state of the art classification performance on the public INbreast database, AUC = 0.95 . The approach described here has achieved the 2nd place in the Digital Mammography DREAM Challenge with AUC = 0.85 . When used as a detector, the system reaches high sensitivity with very few false positive marks per image on the INbreast dataset. Source code, the trained model and an OsiriX plugin are availaible online at https://github.com/riblidezso/frcnn_cad .
△ Less
Submitted 9 November, 2017; v1 submitted 26 July, 2017;
originally announced July 2017.
-
High Quality Queueing Information from Accelerated Active Network Tomography
Authors:
Tommaso Rizzo,
Jozsef Steger,
Péter Pollner,
Istvan Csabai,
Gabor Vattay
Abstract:
Monitoring network state can be crucial in Future Internet infrastructures. Passive monitoring of all the routers is expensive and prohibitive. Storing, accessing and sharing the data is a technological challenge among networks with conflicting economic interests. Active monitoring methods can be attractive alternatives as they are free from most of these issues. Here we demonstrate that it is pos…
▽ More
Monitoring network state can be crucial in Future Internet infrastructures. Passive monitoring of all the routers is expensive and prohibitive. Storing, accessing and sharing the data is a technological challenge among networks with conflicting economic interests. Active monitoring methods can be attractive alternatives as they are free from most of these issues. Here we demonstrate that it is possible to improve the active network tomography methodology to such extent that the quality of the extracted link or router level delay is comparable to the passively measurable information. We show that the temporal precision of the measurements and the performance of the data analysis should be simultaneously improved to achieve this goal. In this paper we not only introduce a new efficient message-passing based algorithm but we also show that it is applicable for data collected by the ETOMIC high precision active measurement infrastructure. The measurements are conducted in the GEANT2 high speed academic network connecting the sites, which is an ideal test ground for such Future Internet applications.
△ Less
Submitted 5 December, 2017; v1 submitted 25 July, 2017;
originally announced July 2017.
-
Relative Rate Reduction Based Control with Adjustable Congestion Level
Authors:
Peter Haga,
Ferenc Toth,
Istvan Csabai,
Gabor Vattay
Abstract:
In Future Internet it is possible to change elements of congestion control in order to eliminate jitter and batch loss caused by the current control mechanisms based on packet loss events. We investigate the fundamental problem of adjusting sending rates to achieve optimal utilization of highly variable bandwidth of a network path using accurate packet rate information. This is done by continuousl…
▽ More
In Future Internet it is possible to change elements of congestion control in order to eliminate jitter and batch loss caused by the current control mechanisms based on packet loss events. We investigate the fundamental problem of adjusting sending rates to achieve optimal utilization of highly variable bandwidth of a network path using accurate packet rate information. This is done by continuously controlling the sending rate with a function of the measured packet rate at the receiver. We propose the relative loss of packet rate between the sender and the receiver (Relative Rate Reduction, RRR) as a new accurate and continuous measure of congestion of a network path, replacing the erratically fluctuating packet loss. We demonstrate that with choosing various RRR based feedback functions the optimum is reached with adjustable congestion level. The proposed method guarantees fair bandwidth sharing of competitive flows. Finally, we present testbed experiments to demonstrate the performance of the algorithm.
△ Less
Submitted 22 July, 2017;
originally announced July 2017.
-
Video Pandemics: Worldwide Viral Spreading of Psy's Gangnam Style Video
Authors:
Zsofia Kallus,
Daniel Kondor,
Jozsef Steger,
Istvan Csabai,
Eszter Bokanyi,
Gabor Vattay
Abstract:
Viral videos can reach global penetration traveling through international channels of communication similarly to real diseases starting from a well-localized source. In past centuries, disease fronts propagated in a concentric spatial fashion from the the source of the outbreak via the short range human contact network. The emergence of long-distance air-travel changed these ancient patterns. Howe…
▽ More
Viral videos can reach global penetration traveling through international channels of communication similarly to real diseases starting from a well-localized source. In past centuries, disease fronts propagated in a concentric spatial fashion from the the source of the outbreak via the short range human contact network. The emergence of long-distance air-travel changed these ancient patterns. However, recently, Brockmann and Helbing have shown that concentric propagation waves can be reinstated if propagation time and distance is measured in the flight-time and travel volume weighted underlying air-travel network. Here, we adopt this method for the analysis of viral meme propagation in Twitter messages, and define a similar weighted network distance in the communication network connecting countries and states of the World. We recover a wave-like behavior on average and assess the randomizing effect of non-locality of spreading. We show that similar result can be recovered from Google Trends data as well.
△ Less
Submitted 14 July, 2017;
originally announced July 2017.
-
Photo-z-SQL: integrated, flexible photometric redshift computation in a database
Authors:
Róbert Beck,
László Dobos,
Tamás Budavári,
Alexander S. Szalay,
István Csabai
Abstract:
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to per…
▽ More
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z estimation testing datasets, and provide an analysis of execution time and scalability with respect to different configurations. The code is available for download at https://github.com/beckrob/Photo-z-SQL.
△ Less
Submitted 20 March, 2017; v1 submitted 4 November, 2016;
originally announced November 2016.
-
Concordance cosmology without dark energy
Authors:
Gábor Rácz,
László Dobos,
Róbert Beck,
István Szapudi,
István Csabai
Abstract:
According to the separate universe conjecture, spherically symmetric sub-regions in an isotropic universe behave like mini-universes with their own cosmological parameters. This is an excellent approximation in both Newtonian and general relativistic theories. We estimate local expansion rates for a large number of such regions, and use a scale parameter calculated from the volume-averaged increme…
▽ More
According to the separate universe conjecture, spherically symmetric sub-regions in an isotropic universe behave like mini-universes with their own cosmological parameters. This is an excellent approximation in both Newtonian and general relativistic theories. We estimate local expansion rates for a large number of such regions, and use a scale parameter calculated from the volume-averaged increments of local scale parameters at each time step in an otherwise standard cosmological $N$-body simulation. The particle mass, corresponding to a coarse graining scale, is an adjustable parameter. This mean field approximation neglects tidal forces and boundary effects, but it is the first step towards a non-perturbative statistical estimation of the effect of non-linear evolution of structure on the expansion rate. Using our algorithm, a simulation with an initial $Ω_m=1$ Einstein--de~Sitter setting closely tracks the expansion and structure growth history of the $Λ$CDM cosmology. Due to small but characteristic differences, our model can be distinguished from the $Λ$CDM model by future precision observations. Moreover, our model can resolve the emerging tension between local Hubble constant measurements and the Planck best-fitting cosmology. Further improvements to the simulation are necessary to investigate light propagation and confirm full consistency with cosmic microwave background observations.
△ Less
Submitted 12 February, 2017; v1 submitted 29 July, 2016;
originally announced July 2016.
-
Race, Religion and the City: Twitter Word Frequency Patterns Reveal Dominant Demographic Dimensions in the United States
Authors:
Eszter Bokányi,
Dániel Kondor,
László Dobos,
Tamás Sebők,
József Stéger,
István Csabai,
Gábor Vattay
Abstract:
Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities, all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfact…
▽ More
Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities, all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in the OSN content? Here, we study language use in the US using a corpus of text compiled from over half a billion geo-tagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented with the Robust Principal Component Analysis (RPCA) methodology. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Our findings thus validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. Thus, they could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns found here.
△ Less
Submitted 11 May, 2016; v1 submitted 10 May, 2016;
originally announced May 2016.
-
Photometric redshifts for the SDSS Data Release 12
Authors:
Róbert Beck,
László Dobos,
Tamás Budavári,
Alexander S. Szalay,
István Csabai
Abstract:
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS spectroscopic catalog was augmented with data from o…
▽ More
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS spectroscopic catalog was augmented with data from other, publicly available spectroscopic surveys to mitigate target selection effects. The training set is comprised of $1,976,978$ galaxies, and extends up to redshift $z\approx 0.8$, with a useful coverage of up to $z\approx 0.6$. We provide photometric redshifts and realistic error estimates for the $208,474,076$ galaxies of the SDSS primary photometric catalog. We achieve an average bias of $\overline{Δz_{\mathrm{norm}}} = 5.84 \times 10^{-5}$, a standard deviation of $σ\left(Δz_{\mathrm{norm}}\right)=0.0205$, and a $3σ$ outlier rate of $P_o=4.11\%$ when cross-validating on our training set. The published redshift error estimates and photometric error classes enable the selection of galaxies with high quality photometric redshifts. We also provide a supplementary error map that allows additional, sophisticated filtering of the data.
△ Less
Submitted 21 April, 2016; v1 submitted 31 March, 2016;
originally announced March 2016.
-
Searching for electromagnetic counterpart of LIGO gravitational waves in the Fermi GBM data with ADWO
Authors:
Z. Bagoly,
D. Szécsi,
L. G. Balázs,
I. Csabai,
I. Horváth,
L. Dobos,
J. Lichtenberger,
L. V. Tóth
Abstract:
The Fermi collaboration identified a possible electromagnetic counterpart of the gravitational wave event of September 14, 2015. Our goal is to provide an unsupervised data analysis algorithm to identify similar events in Fermi's Gamma-ray Burst Monitor CTTE data stream. We are looking for signals that are typically weak. Therefore, they can only be found by a careful analysis of count rates of al…
▽ More
The Fermi collaboration identified a possible electromagnetic counterpart of the gravitational wave event of September 14, 2015. Our goal is to provide an unsupervised data analysis algorithm to identify similar events in Fermi's Gamma-ray Burst Monitor CTTE data stream. We are looking for signals that are typically weak. Therefore, they can only be found by a careful analysis of count rates of all detectors and energy channels simultaneously. Our Automatized Detector Weight Optimization (ADWO) method consists of a search for the signal, and a test of its significance. We developed ADWO, a virtual detector analysis tool for multi-channel multi-detector signals, and performed successful searches for short transients in the data-streams. We have identified GRB150522B, as well as possible electromagnetic candidates of the transients GW150914 and LVT151012. ADWO is an independently developed, unsupervised data analysis tool that only relies on the raw data of the Fermi satellite. It can therefore provide a strong, independent test to any electromagnetic signal accompanying future gravitational wave observations.
△ Less
Submitted 24 August, 2016; v1 submitted 21 March, 2016;
originally announced March 2016.
-
Quantifying correlations between galaxy emission lines and stellar continua
Authors:
Róbert Beck,
László Dobos,
Ching-Wa Yip,
Alexander S. Szalay,
István Csabai
Abstract:
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical properties of photometric galaxies. We show that emiss…
▽ More
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical properties of photometric galaxies. We show that emission line equivalent widths can be fairly well reconstructed from the stellar continuum using local multiple linear regression in the continuum principal component analysis (PCA) space. Line reconstruction is good for star-forming galaxies and reasonable for galaxies with active nuclei. We propose a practical method to combine stellar population synthesis models with empirical modelling of emission lines. The technique will help generate more accurate model spectra and mock catalogues of galaxies to fit observations of the new surveys. More accurate modelling of emission lines is also expected to improve template-based photometric redshift estimation methods. We also show that, by combining PCA coefficients from the pure continuum and the emission lines, automatic distinction between hosts of weak active galactic nuclei (AGNs) and quiescent star-forming galaxies can be made. The classification method is based on a training set consisting of high-confidence starburst galaxies and AGNs, and allows for the similar separation of active and star-forming galaxies as the empirical curve found by Kauffmann et al. We demonstrate the use of three important machine learning algorithms in the paper: k-nearest neighbour finding, k-means clustering and support vector machines.
△ Less
Submitted 11 January, 2016;
originally announced January 2016.
-
Environment Assisted Quantum Transport in Organic Molecules
Authors:
Gabor Vattay,
Istvan Csabai
Abstract:
One of the new discoveries in quantum biology is the role of Environment Assisted Quantum Transport (ENAQT) in excitonic transport processes. In disordered quantum systems transport is most efficient when the environment just destroys quantum interferences responsible for localization, but the coupling does not drive the system to fully classical thermal diffusion yet. This poised realm between th…
▽ More
One of the new discoveries in quantum biology is the role of Environment Assisted Quantum Transport (ENAQT) in excitonic transport processes. In disordered quantum systems transport is most efficient when the environment just destroys quantum interferences responsible for localization, but the coupling does not drive the system to fully classical thermal diffusion yet. This poised realm between the pure quantum and the semi-classical domains has not been considered in other biological transport processes, such as charge transport through organic molecules. Binding in receptor-ligand complexes is assumed to be static as electrons are assumed to be not able to cross the ligand molecule. We show that ENAQT makes cross ligand transport possible and efficient between certain atoms opening the way for the reorganization of the charge distribution on the receptor when the ligand molecule docks. This new effect can potentially change our understanding how receptors work. We demonstrate room temperature ENAQT on the caffeine molecule.
△ Less
Submitted 28 February, 2015;
originally announced March 2015.
-
Quantum Criticality at the Origin of Life
Authors:
Gabor Vattay,
Dennis Salahub,
Istvan Csabai,
Ali Nassimi,
Stuart A. Kaufmann
Abstract:
Why life persists at the edge of chaos is a question at the very heart of evolution. Here we show that molecules taking part in biochemical processes from small molecules to proteins are critical quantum mechanically. Electronic Hamiltonians of biomolecules are tuned exactly to the critical point of the metal-insulator transition separating the Anderson localized insulator phase from the conductin…
▽ More
Why life persists at the edge of chaos is a question at the very heart of evolution. Here we show that molecules taking part in biochemical processes from small molecules to proteins are critical quantum mechanically. Electronic Hamiltonians of biomolecules are tuned exactly to the critical point of the metal-insulator transition separating the Anderson localized insulator phase from the conducting disordered metal phase. Using tools from Random Matrix Theory we confirm that the energy level statistics of these biomolecules show the universal transitional distribution of the metal-insulator critical point and the wave functions are multifractals in accordance with the theory of Anderson transitions. The findings point to the existence of a universal mechanism of charge transport in living matter. The revealed bio-conductor material is neither a metal nor an insulator but a new quantum critical material which can exist only in highly evolved systems and has unique material properties.
△ Less
Submitted 3 March, 2015; v1 submitted 24 February, 2015;
originally announced February 2015.
-
Inferring the interplay of network structure and market effects in Bitcoin
Authors:
Dániel Kondor,
István Csabai,
János Szüle,
Márton Pósfai,
Gábor Vattay
Abstract:
A main focus in economics research is understanding the time series of prices of goods and assets. While statistical models using only the properties of the time series itself have been successful in many aspects, we expect to gain a better understanding of the phenomena involved if we can model the underlying system of interacting agents. In this article, we consider the history of Bitcoin, a nov…
▽ More
A main focus in economics research is understanding the time series of prices of goods and assets. While statistical models using only the properties of the time series itself have been successful in many aspects, we expect to gain a better understanding of the phenomena involved if we can model the underlying system of interacting agents. In this article, we consider the history of Bitcoin, a novel digital currency system, for which the complete list of transactions is available for analysis. Using this dataset, we reconstruct the transaction network between users and analyze changes in the structure of the subgraph induced by the most active users. Our approach is based on the unsupervised identification of important features of the time variation of the network. Applying the widely used method of Principal Component Analysis to the matrix constructed from snapshots of the network at different times, we are able to show how structural changes in the network accompany significant changes in the exchange price of bitcoins.
△ Less
Submitted 12 December, 2014;
originally announced December 2014.
-
Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh
Authors:
Dániel Kondor,
László Dobos,
István Csabai,
András Bodor,
Gábor Vattay,
Tamás Budavári,
Alexander S. Szalay
Abstract:
We present a case study about the spatial indexing and regional classification of billions of geographic coordinates from geo-tagged social network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft SQL Server. Due to the lack of certain features of the HTM library, we use it in conjunction with the GIS functions of SQL Server to significantly increase the efficiency of pre-fi…
▽ More
We present a case study about the spatial indexing and regional classification of billions of geographic coordinates from geo-tagged social network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft SQL Server. Due to the lack of certain features of the HTM library, we use it in conjunction with the GIS functions of SQL Server to significantly increase the efficiency of pre-filtering of spatial filter and join queries. For example, we implemented a new algorithm to compute the HTM tessellation of complex geographic regions and precomputed the intersections of HTM triangles and geographic regions for faster false-positive filtering. With full control over the index structure, HTM-based pre-filtering of simple containment searches outperforms SQL Server spatial indices by a factor of ten and HTM-based spatial joins run about a hundred times faster.
△ Less
Submitted 2 October, 2014;
originally announced October 2014.
-
Objective Identification of Informative Wavelength Regions in Galaxy Spectra
Authors:
Ching-Wa Yip,
Michael Mahoney,
Alex Szalay,
Istvan Csabai,
Tamas Budavari,
Rosemary Wyse,
Laszlo Dobos
Abstract:
Understanding the diversity in spectra is the key to determining the physical parameters of galaxies. The optical spectra of galaxies are highly convoluted with continuum and lines which are potentially sensitive to different physical parameters. Defining the wavelength regions of interest is therefore an important question. In this work, we identify informative wavelength regions in a single-burs…
▽ More
Understanding the diversity in spectra is the key to determining the physical parameters of galaxies. The optical spectra of galaxies are highly convoluted with continuum and lines which are potentially sensitive to different physical parameters. Defining the wavelength regions of interest is therefore an important question. In this work, we identify informative wavelength regions in a single-burst stellar populations model by using the CUR Matrix Decomposition. Simulating the Lick/IDS spectrograph configuration, we recover the widely used Dn(4000), Hbeta, and HdeltaA to be most informative. Simulating the SDSS spectrograph configuration with a wavelength range 3450-8350 Angstrom and a model-limited spectral resolution of 3 Angstrom, the most informative regions are: first region-the 4000 Angstrom break and the Hdelta line; second region-the Fe-like indices; third region-the Hbeta line; fourth region-the G band and the Hgamma line. A Principal Component Analysis on the first region shows that the first eigenspectrum tells primarily the stellar age, the second eigenspectrum is related to the age-metallicity degeneracy, and the third eigenspectrum shows an anti-correlation between the strengths of the Balmer and the Ca K and H absorptions. The regions can be used to determine the stellar age and metallicity in early-type galaxies which have solar abundance ratios, no dust, and a single-burst star formation history. The region identification method can be applied to any set of spectra of the user's interest, so that we eliminate the need for a common, fixed-resolution index system. We discuss future directions in extending the current analysis to late-type galaxies.
△ Less
Submitted 1 February, 2014; v1 submitted 2 December, 2013;
originally announced December 2013.
-
Regional properties of global communication as reflected in aggregated Twitter data
Authors:
Zsofia Kallus,
Norbert Barankai,
Daniel Kondor,
Laszlo Dobos,
Tamas Hanyecz,
Janos Szule,
Jozsef Steger,
Tamas Sebok,
Gabor Vattay,
Istvan Csabai
Abstract:
Twitter is a popular public conversation platform with world-wide audience and diverse forms of connections between users. In this paper we introduce the concept of aggregated regional Twitter networks in order to characterize communication between geopolitical regions. We present the study of a follower and a mention graph created from an extensive data set collected during the second half of the…
▽ More
Twitter is a popular public conversation platform with world-wide audience and diverse forms of connections between users. In this paper we introduce the concept of aggregated regional Twitter networks in order to characterize communication between geopolitical regions. We present the study of a follower and a mention graph created from an extensive data set collected during the second half of the year of $2012$. With a k-shell decomposition the global core-periphery structure is revealed and by means of a modified Regional-SIR model we also consider basic information spreading properties.
△ Less
Submitted 6 November, 2013;
originally announced November 2013.
-
Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages
Authors:
Dániel Kondor,
István Csabai,
László Dobos,
János Szüle,
Norbert Barankai,
Tamás Hanyecz,
Tamás Sebők,
Zsófia Kallus,
Gábor Vattay
Abstract:
Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we emp…
▽ More
Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.
△ Less
Submitted 5 November, 2013;
originally announced November 2013.
-
A multi-terabyte relational database for geo-tagged social network data
Authors:
László Dobos,
János Szüle,
Tamás Bodnár,
Tamás Hanyecz,
Tamás Sebők,
Dániel Kondor,
Zsófia Kallus,
József Stéger,
István Csabai,
Gábor Vattay
Abstract:
Despite their relatively low sampling factor, the freely available, randomly sampled status streams of Twitter are very useful sources of geographically embedded social network data. To statistically analyze the information Twitter provides via these streams, we have collected a year's worth of data and built a multi-terabyte relational database from it. The database is designed for fast data load…
▽ More
Despite their relatively low sampling factor, the freely available, randomly sampled status streams of Twitter are very useful sources of geographically embedded social network data. To statistically analyze the information Twitter provides via these streams, we have collected a year's worth of data and built a multi-terabyte relational database from it. The database is designed for fast data loading and to support a wide range of studies focusing on the statistics and geographic features of social networks, as well as on the linguistic analysis of tweets. In this paper we present the method of data collection, the database design, the data loading procedure and special treatment of geo-tagged and multi-lingual data. We also provide some SQL recipes for computing network statistics.
△ Less
Submitted 5 November, 2013; v1 submitted 4 November, 2013;
originally announced November 2013.
-
Refined position angle measurements for galaxies of the SDSS Stripe 82 co-added dataset
Authors:
József Varga,
István Csabai,
László Dobos
Abstract:
Position angle measurements of Sloan Digital Sky Survey (SDSS) galaxies, as measured by the surface brightness profile fitting code of the SDSS photometric pipeline (Lupton 2001), are known to be strongly biased, especially in the case of almost face-on and highly inclined galaxies. To address this issue we developed a reliable algorithm which determines position angles by means of isophote fittin…
▽ More
Position angle measurements of Sloan Digital Sky Survey (SDSS) galaxies, as measured by the surface brightness profile fitting code of the SDSS photometric pipeline (Lupton 2001), are known to be strongly biased, especially in the case of almost face-on and highly inclined galaxies. To address this issue we developed a reliable algorithm which determines position angles by means of isophote fitting. In this paper we present our algorithm and a catalogue of position angles for 26397 SDSS galaxies taken from the deep co-added Stripe 82 (equatorial stripe) images.
△ Less
Submitted 22 October, 2013;
originally announced October 2013.
-
Measuring the dimension of partially embedded networks
Authors:
Dániel Kondor,
Péter Mátray,
István Csabai,
Gábor Vattay
Abstract:
Scaling phenomena have been intensively studied during the past decade in the context of complex networks. As part of these works, recently novel methods have appeared to measure the dimension of abstract and spatially embedded networks. In this paper we propose a new dimension measurement method for networks, which does not require global knowledge on the embedding of the nodes, instead it exploi…
▽ More
Scaling phenomena have been intensively studied during the past decade in the context of complex networks. As part of these works, recently novel methods have appeared to measure the dimension of abstract and spatially embedded networks. In this paper we propose a new dimension measurement method for networks, which does not require global knowledge on the embedding of the nodes, instead it exploits link-wise information (link lengths, link delays or other physical quantities). Our method can be regarded as a generalization of the spectral dimension, that grasps the network's large-scale structure through local observations made by a random walker while traversing the links. We apply the presented method to synthetic and real-world networks, including road maps, the Internet infrastructure and the Gowalla geosocial network. We analyze the theoretically and empirically designated case when the length distribution of the links has the form P(r) ~ 1/r. We show that while previous dimension concepts are not applicable in this case, the new dimension measure still exhibits scaling with two distinct scaling regimes. Our observations suggest that the link length distribution is not sufficient in itself to entirely control the dimensionality of complex networks, and we show that the proposed measure provides information that complements other known measures.
△ Less
Submitted 28 August, 2013;
originally announced August 2013.
-
Do the rich get richer? An empirical analysis of the BitCoin transaction network
Authors:
Dániel Kondor,
Márton Pósfai,
István Csabai,
Gábor Vattay
Abstract:
The possibility to analyze everyday monetary transactions is limited by the scarcity of available data, as this kind of information is usually considered highly sensitive. Present econophysics models are usually employed on presumed random networks of interacting agents, and only macroscopic properties (e.g. the resulting wealth distribution) are compared to real-world data. In this paper, we anal…
▽ More
The possibility to analyze everyday monetary transactions is limited by the scarcity of available data, as this kind of information is usually considered highly sensitive. Present econophysics models are usually employed on presumed random networks of interacting agents, and only macroscopic properties (e.g. the resulting wealth distribution) are compared to real-world data. In this paper, we analyze BitCoin, which is a novel digital currency system, where the complete list of transactions is publicly available. Using this dataset, we reconstruct the network of transactions, and extract the time and amount of each payment. We analyze the structure of the transaction network by measuring network characteristics over time, such as the degree distribution, degree correlations and clustering. We find that linear preferential attachment drives the growth of the network. We also study the dynamics taking place on the transaction network, i.e. the flow of money. We measure temporal patterns and the wealth accumulation. Investigating the microscopic statistics of money movement, we find that sublinear preferential attachment governs the evolution of the wealth distribution. We report a scaling relation between the degree and wealth associated to individual nodes.
△ Less
Submitted 31 March, 2014; v1 submitted 18 August, 2013;
originally announced August 2013.
-
Graywulf: A platform for federated scientific databases and services
Authors:
László Dobos,
Alexander S. Szalay,
Tamás Budavári,
István Csabai,
Nolan Li
Abstract:
Many fields of science rely on relational database management systems to analyze, publish and share data. Since RDBMS are originally designed for, and their development directions are primarily driven by, business use cases they often lack features very important for scientific applications. Horizontal scalability is probably the most important missing feature which makes it challenging to adapt t…
▽ More
Many fields of science rely on relational database management systems to analyze, publish and share data. Since RDBMS are originally designed for, and their development directions are primarily driven by, business use cases they often lack features very important for scientific applications. Horizontal scalability is probably the most important missing feature which makes it challenging to adapt traditional relational database systems to the ever growing data sizes. Due to the limited support of array data types and metadata management, successful application of RDBMS in science usually requires the development of custom extensions. While some of these extensions are specific to the field of science, the majority of them could easily be generalized and reused in other disciplines. With the Graywulf project we intend to target several goals. We are building a generic platform that offers reusable components for efficient storage, transformation, statistical analysis and presentation of scientific data stored in Microsoft SQL Server. Graywulf also addresses the distributed computational issues arising from current RDBMS technologies. The current version supports load balancing of simple queries and parallel execution of partitioned queries over a set of mirrored databases. Uniform user access to the data is provided through a web based query interface and a data surface for software clients. Queries are formulated in a slightly modified syntax of SQL that offers a transparent view of the distributed data. The software library consists of several components that can be reused to develop complex scientific data warehouses: a system registry, administration tools to manage entire database server clusters, a sophisticated workflow execution framework, and a SQL parser library.
△ Less
Submitted 6 August, 2013;
originally announced August 2013.
-
Photo-Met: a non-parametric method for estimating stellar metallicity from photometric observations
Authors:
Gyöngyi Kerekes,
István Csabai,
László Dobos,
Márton Trencséni
Abstract:
Getting spectra at good signal-to-noise ratios takes orders of magnitudes more time than photometric observations. Building on the technique developed for photometric redshift estimation of galaxies, we develop and demonstrate a non-parametric photometric method for estimating the chemical composition of galactic stars. We investigate the efficiency of our method using spectroscopically determined…
▽ More
Getting spectra at good signal-to-noise ratios takes orders of magnitudes more time than photometric observations. Building on the technique developed for photometric redshift estimation of galaxies, we develop and demonstrate a non-parametric photometric method for estimating the chemical composition of galactic stars. We investigate the efficiency of our method using spectroscopically determined stellar metallicities from SDSS DR7. The technique is generic in the sense that it is not restricted to certain stellar types or stellar parameter ranges and makes it possible to obtain metallicities and error estimates for a much larger sample than spectroscopic surveys would allow. We find that our method performs well, especially for brighter stars and higher metallicities and, in contrast to many other techniques, we are able to reliably estimate the error of the predicted metallicities.
△ Less
Submitted 9 May, 2013;
originally announced May 2013.
-
Plane-Sweep Incremental Algorithm: Computing Delaunay Tessellations of Large Datasets
Authors:
Márton Trencséni,
István Csabai
Abstract:
We present the plane-sweep incremental algorithm, a hybrid approach for computing Delaunay tessellations of large point sets whose size exceeds the computer's main memory. This approach unites the simplicity of the incremental algorithms with the comparatively low memory requirements of plane-sweep approaches. The procedure is to first sort the point set along the first principal component and the…
▽ More
We present the plane-sweep incremental algorithm, a hybrid approach for computing Delaunay tessellations of large point sets whose size exceeds the computer's main memory. This approach unites the simplicity of the incremental algorithms with the comparatively low memory requirements of plane-sweep approaches. The procedure is to first sort the point set along the first principal component and then to sequentially insert the points into the tessellation, essentially simulating a swee** plane. The part of the tessellation that has been passed by the swee** plane can be evicted from memory and written to disk, limiting the memory requirement of the program to the "thickness" of the data set along its first principal component. We implemented the algorithm and used it to compute the Delaunay tessellation and Voronoi partition of the Sloan Digital Sky Survey magnitude space consisting of 287 million points.
△ Less
Submitted 12 October, 2012;
originally announced October 2012.
-
Strong random correlations in networks of heterogeneous agents
Authors:
Imre Kondor,
István Csabai,
Gábor Papp,
Enys Mones,
Gábor Czimbalmos,
Máté Csaba Sándor
Abstract:
Correlations and other collective phenomena in a schematic model of heterogeneous binary agents (individual spin-glass samples) are considered on the complete graph and also on 2d and 3d regular lattices. The system's stochastic dynamics is studied by numerical simulations. The dynamics is so slow that one can meaningfully speak of quasi-equilibrium states. Performing measurements of correlations…
▽ More
Correlations and other collective phenomena in a schematic model of heterogeneous binary agents (individual spin-glass samples) are considered on the complete graph and also on 2d and 3d regular lattices. The system's stochastic dynamics is studied by numerical simulations. The dynamics is so slow that one can meaningfully speak of quasi-equilibrium states. Performing measurements of correlations in such a quasi-equilibrium state we find that they are random both as to their sign and absolute value, but on average they fall off very slowly with distance in all instances that we have studied. This means that the system is essentially non-local, small changes at one end may have a strong impact at the other. Correlations and other local quantities are extremely sensitive to the boundary conditions all across the system, although this sensitivity disappears upon averaging over the samples or partially averaging over the agents. The strong, random correlations tend to organize a large fraction of the agents into strongly correlated clusters that act together. If we think about this model as a distant metaphor of economic agents or bank networks, the systemic risk implications of this tendency are clear: any impact on even a single strongly correlated agent will spread, in an unforeseeable manner, to the whole system via the strong random correlations.
△ Less
Submitted 24 February, 2014; v1 submitted 11 October, 2012;
originally announced October 2012.
-
Spatial Indexing of Large Multidimensional Databases
Authors:
István Csabai,
Márton Trencséni,
Géza Herczegh,
László Dobos,
Péter Józsa,
Norbert Purger,
Tamás Budavári,
Alexander Szalay
Abstract:
Scientific endeavors such as large astronomical surveys generate databases on the terabyte scale. These, usually multidimensional databases must be visualized and mined in order to find interesting objects or to extract meaningful and qualitatively new relationships. Many statistical algorithms required for these tasks run reasonably fast when operating on small sets of in-memory data, but take no…
▽ More
Scientific endeavors such as large astronomical surveys generate databases on the terabyte scale. These, usually multidimensional databases must be visualized and mined in order to find interesting objects or to extract meaningful and qualitatively new relationships. Many statistical algorithms required for these tasks run reasonably fast when operating on small sets of in-memory data, but take noticeable performance hits when operating on large databases that do not fit into memory. We utilize new software technologies to develop and evaluate fast multidimensional indexing schemes that inherently follow the underlying, highly non-uniform distribution of the data: they are layered uniform grid indices, hierarchical binary space partitioning, and sampled flat Voronoi tessellation of the data. Our working database is the 5-dimensional magnitude space of the Sloan Digital Sky Survey with more than 270 million data points, where we show that these techniques can dramatically speed up data mining operations such as finding similar objects by example, classifying objects or comparing extensive simulation sets with observations. We are also develo** tools to interact with the multidimensional database and visualize the data at multiple resolutions in an adaptive manner.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.
-
Revealing a strongly reddened, faint active galactic nucleus population by stacking deep co-added images
Authors:
József Varga,
István Csabai,
László Dobos
Abstract:
More than half of the sources identified by recent radio sky surveys have not been detected by wide-field optical surveys. We present a study based on our co-added image stacking technique, in which our aim is to detect the optical emission from unresolved, isolated radio sources of the Very Large Array (VLA) Faint Images of the Radio Sky at Twenty-cm (FIRST) survey that have no identified optical…
▽ More
More than half of the sources identified by recent radio sky surveys have not been detected by wide-field optical surveys. We present a study based on our co-added image stacking technique, in which our aim is to detect the optical emission from unresolved, isolated radio sources of the Very Large Array (VLA) Faint Images of the Radio Sky at Twenty-cm (FIRST) survey that have no identified optical counterparts in the Sloan Digital Sky Survey (SDSS) Stripe 82 co-added data set. From the FIRST catalogue, 2116 such radio point sources were selected, and cut-out images, centred on the FIRST coordinates, were generated from the Stripe 82 images. The already co-added cut-outs were stacked once again to obtain images of high signal-to-noise ratio, in the hope that optical emission from the radio sources would become detectable. Multiple stacks were generated, based on the radio luminosity of the point sources. The resulting stacked images show central peaks similar to point sources. The peaks have very red colours with steep optical spectral energy distributions. We have found that the optical spectral index alpha_nu falls in the range -2.9 < alpha_nu < -2.2, depending only weakly on the radio flux. The total integration times of the stacks are between 270 and 300 h, and the corresponding 5 sigma detection limit is estimated to be about m_r = 26.6 mag. We argue that the detected light is mainly from the central regions of dust-reddened Type 1 active galactic nuclei. Dust-reddened quasars might represent an early phase of quasar evolution, and thus they can also give us an insight into the formation of massive galaxies. The data used in the paper are available on-line at http://www.vo.elte.hu/doublestacking.
△ Less
Submitted 27 September, 2012;
originally announced September 2012.
-
SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases
Authors:
László Dobos,
Tamás Budavári,
Nolan Li,
Alexander S. Szalay,
István Csabai
Abstract:
Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclu…
▽ More
Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.
△ Less
Submitted 21 June, 2012;
originally announced June 2012.
-
A High Resolution Atlas of Composite SDSS Galaxy Spectra
Authors:
László Dobos,
István Csabai,
Ching-Wa Yip,
Tamás Budavári,
Vivienne Wild,
Alexander S. Szalay
Abstract:
In this work we present an atlas of composite spectra of galaxies based on the data of the Sloan Digital Sky Survey Data Release 7 (SDSS DR7). Galaxies are classified by colour, nuclear activity and star-formation activity to calculate average spectra of high signal-to-noise ratio and resolution (S/N = 132 - 4760 at {Dlambda = 1 A), using an algorithm that is robust against outliers. Besides compo…
▽ More
In this work we present an atlas of composite spectra of galaxies based on the data of the Sloan Digital Sky Survey Data Release 7 (SDSS DR7). Galaxies are classified by colour, nuclear activity and star-formation activity to calculate average spectra of high signal-to-noise ratio and resolution (S/N = 132 - 4760 at {Dlambda = 1 A), using an algorithm that is robust against outliers. Besides composite spectra, we also compute the first five principal components of the distributions in each galaxy class to characterize the nature of variations of individual spectra around the averages. The continua of the composite spectra are fitted with BC03 stellar population synthesis models to extend the wavelength coverage beyond the coverage of the SDSS spectrographs. Common derived parameters of the composites are also calculated: integrated colours in the most popular filter systems, line strength measurements, and continuum absorption indices (including Lick indices). These derived parameters are compared with the distributions of parameters of individual galaxies and it is shown on many examples that the composites of the atlas cover much of the parameter space spanned by SDSS galaxies. By co-adding thousands of spectra, a total integration time of several months can be reached, which results in extremely low noise composites. The variations in redshift not only allow for extending the spectral coverage bluewards to the original wavelength limit of the SDSS spectrographs, but also make higher spectral resolution achievable. The composite spectrum atlas is available online at http://www.vo.elte.hu/compositeatlas.
△ Less
Submitted 4 November, 2011;
originally announced November 2011.
-
Correlations between Nebular Emission and the Continuum Spectral Shape in SDSS Galaxies
Authors:
Zsuzsanna Győry,
Alexander S. Szalay,
Tamás Budavári,
István Csabai,
Stéphane Charlot
Abstract:
We present a statistical study of the correlations and dimensionality of emission lines carried out on a sample of over 40,000 Sloan Digital Sky Survey (SDSS) galaxies. Using principal component analysis, we found that the equivalent widths of the 11 strongest lines can be well represented using three parameters. We also explore correlations of the emission pattern with the eigenspace representati…
▽ More
We present a statistical study of the correlations and dimensionality of emission lines carried out on a sample of over 40,000 Sloan Digital Sky Survey (SDSS) galaxies. Using principal component analysis, we found that the equivalent widths of the 11 strongest lines can be well represented using three parameters. We also explore correlations of the emission pattern with the eigenspace representation of the continuum spectrum. The observed relations are used to provide an empirical prescription for expectation values and variances of emission-line strengths as a function of spectral shape. We show that this estimation of emission lines has a sufficient accuracy to make it suitable for photometric applications. The method has already proved useful in SDSS photometric redshift estimation.
△ Less
Submitted 9 October, 2011;
originally announced October 2011.