Research

Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.

First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?

Our main goal of developing an understanding of the fundamental mathematical and computational issues underlying the aforementioned questions. Ultimately, this will enable practitioners to make more informed decisions when investing time and money across the life cycle of their data science project. Specific research goals explored in this project include:

Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
Minimizing query complexity in interactive unsupervised learning problems.
Understanding space/sample complexity trade-offs when processing stochastic data.
Developing fine-grained approximation algorithms relevant to core data science tasks.
Using coding theory to enable communication-efficient distributed machine learning.
Designing variational inference methods with statistical guarantees given limited resources.
Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.

Publications

Sample Complexity of Probability Divergences under Group Symmetry
ICML 2023 (Z. Chen, M. A. Katsoulakis, L. Rey-Bellet, W. Zhu)
Identification of significant gene expression changes in multiple perturbation experiments using knockoffs
Briefings in Bioinformatics 2023 (T. Zhao, G. Zhu, H. V. Dubey, P. Flaherty)
Function-space regularized Rényi divergences
ICLR 2023 (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet)
Model Uncertainty and Correctability for Directed Graphical Models
SIAM/ASA Journal on Uncertainty Quantification 10 (4),1461-1512, (2022) (P. Birmpa, J. Feng, M. A. Katsoulakis, L. Rey-Bellet)
Structure-preserving GANs
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:1982-2020, (2022). (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet, W. Zhu)
Cumulant GAN
IEEE Transactions on Neural Networks and Learning Systems, (2022), doi: 10.1109/TNNLS.2022.3161127 (Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, M. A. Katsoulakis)
Optimizing variational representations of divergences and accelerating their statistical estimation
IEEE Transactions on Information Theory, (2022), doi: 10.1109/TIT.2022.3160659 (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
Uncertainty quantification for Markov Random Fields
SIAM/ASA J. Uncertainty Quantification, 9(4), 1457–1498, (2021). https://doi.org/10.1137/20M1374614 (P. Birmpa, M. A. Katsoulakis)
(f , Γ)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics
Journal of Machine Learning Research, 23 (39), 1-70, (2022) (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet)
Uncertainty Quantification and Error Propagation in the Enthalpy and Entropy of Surface Reactions Arising from a Single DFT Functional
J. Phys. Chem, (Gerhard Wittreich, Geun Ho Gu, Daniel Robinson, Markos Katsoulakis, Markos, Dionisios Vlachos)
Probabilistic Group Testing with a Linear Number of Tests
IEEE International Symposium on Information Theory (ISIT), 2021 (Larkin Flodin, Arya Mazumdar)
Variational Representations and Neural Network Estimation of Renyi Divergences
SIAM Journal on Mathematics of Data Science (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang)
Mutual Information for Explainable Deep Learning of Multiscale Systems
Journal of Computational Physics, Volume 444, 2021, 110551, .(Søren Taverniers, Eric J. Hall, Markos A. Katsoulakis, Daniel M. Tartakovsky)
Trace Reconstruction: Generalized and Parameterized
IEEE Transactions on Information Theory (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
PredictRoute: A Network Path Prediction Toolkit.
Sigmetrics 2021 (R. Singh, D. Tench, A. McGregor, P. Gill)
Correlation Clustering in Data Streams
Algorithmica (K. Ahn, G. Cormode, S. Guha, A. McGregor A. Wirth)
Exact and Approximate Hierarchical Clustering Using A*
UAI 2021 (Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer and Andrew McCallum)
Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data
UAI 2021 (Aaron Schein, Anjali Nagulpally, Hanna Wallach, and Patrick Flaherty)
GINNs: Graph-Informed Neural Networks for Multiscale Physics
Journal of Computational Physics, 2021 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky)
Efficient and Effective ER with Progressive Blocking
VLDB Journal, 2021. (S. Galhotra, D. Firmani, B. Saha, D. Srivastava)
Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
AISTATS 2021 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum)
Intervention Efficient Algorithms for Approximate Learning of Causal Graphs
ALT 2021 (R. Addanki, A. McGregor, C. Musco)
Diverse Data Selection under Fairness Constraints
ICDT 2021 (Z. Moumoulidou, A. McGregor, A. Meliou)
Maximum Coverage in the Data Stream Model: Parameterized and Generalized
ICDT 2021 (A. McGregor, D. Tench, H. Vu)
Semisupervised Clustering by Queries and Locally Encodable Source Coding
IEEE Transactions on Information Theory (TIT), vol. 67, no. 2, 2021. (A. Mazumdar, S. Pal)
vqSGD: Vector Quantized Stochastic Gradient Descent
AISTATS 2021. (V. Gandikota, D. Kane, R. Maity, A. Mazumdar)
Recovery of Sparse Linear Classifiers from Mixture of Responses
NeurIPS 2020. (V. Gandikota, A. Mazumdar, S. Pal)
A Workload-Adaptive Mechanism for Linear Queries Under Local Differential Privacy
VLDB 2020. (R. McKenna, R. Maity, A. Mazumdar, G. Miklau)
Recovery of Sparse Signals from a Mixture of Linear Samples
ICML 2020. (A. Mazumdar, S. Pal)
Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences
Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos)
Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance
FOCS 2020 (T. Kociumaka, B. Saha)
High Dimensional Discrete Integration over the Hypergrid
UAI 2020 (R. Maity, A. Mazumdar, S. Pal)
Efficient Intervention Design for Causal Discovery with Latents
ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco)
Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy
ESAIM: M2AN, 55 1 (2021) 131-169 (Jeremiah Birrell, Markos A. Katsoulakis and Luc Rey-Bellet)
Does Preprocessing help in Fast Sequence Comparisons?
STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha)
Reliable Distributed Clustering with Redundant Data Assignment
ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat)
Triangle and Four Cycle Counting in the Data Stream Model
PODS 2020 (A. McGregor, S. Vorotnikova)
Algebraic and Analytic Approaches for Parameter Learning in Mixture Models
ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
Data-driven Uncertainty Quantification in Systematically Coarse-grained Models
Soft Materials, 18:2-3, 348-368. (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis)
Vertex Ordering Problems in Directed Graph Streams.
SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova)
Sample Complexity of Learning Mixtures of Sparse Linear Regressions
NeurIPS 2019 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)

Preprints

Optimizing variational representations of divergences and accelerating their statistical estimation
ArXiv 2020, Under revision at IEEE Trans Info Theory (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
Cumulant GAN
ArXiv 2020, Under revision at IEEE Trans NN and Learning Syst. (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis)
MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization
ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp)
Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods
ArXiv 2020 (J. Birrell)
Distributional Robustness and Uncertainty Quantification for Rare Events
ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang)
Model Uncertainty and Correctability for Directed Graphical Models
(Panagiota Birmpa, Jinchao Feng, Markos Katsoulakis, Luc Rey-Bellet)