#### Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.- First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
- Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
- Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?

- Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
- Minimizing query complexity in interactive unsupervised learning problems.
- Understanding space/sample complexity trade-offs when processing stochastic data.
- Developing fine-grained approximation algorithms relevant to core data science tasks.
- Using coding theory to enable communication-efficient distributed machine learning.
- Designing variational inference methods with statistical guarantees given limited resources.
- Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.

#### Publications

- Exact and Approximate Hierarchical Clustering Using A*

UAI 2021 (Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer and Andrew McCallum) - Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data

UAI 2021 (Aaron Schein, Anjali Nagulpally, Hanna Wallach, and Patrick Flaherty) - GINNs: Graph-Informed Neural Networks for Multiscale Physics

Journal of Computational Physics, 2021 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky) - Efficient and Effective ER with Progressive Blocking

VLDB Journal, 2021. (S. Galhotra, D. Firmani, B. Saha, D. Srivastava) - Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering

AISTATS 2021 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum) - Intervention Efficient Algorithms for Approximate Learning of Causal Graphs

ALT 2021 (R. Addanki, A. McGregor, C. Musco) - Diverse Data Selection under Fairness Constraints

ICDT 2021 (Z. Moumoulidou, A. McGregor, A. Meliou) - Maximum Coverage in the Data Stream Model: Parameterized and Generalized

ICDT 2021 (A. McGregor, D. Tench, H. Vu) - Semisupervised Clustering by Queries and Locally Encodable Source Coding

IEEE Transactions on Information Theory (TIT), vol. 67, no. 2, 2021. (A. Mazumdar, S. Pal) - vqSGD: Vector Quantized Stochastic Gradient Descent

AISTATS 2021. (V. Gandikota, D. Kane, R. Maity, A. Mazumdar) - Recovery of Sparse Linear Classifiers from Mixture of Responses

NeurIPS 2020. (V. Gandikota, A. Mazumdar, S. Pal) - A Workload-Adaptive Mechanism for Linear Queries Under Local Differential Privacy

VLDB 2020. (R. McKenna, R. Maity, A. Mazumdar, G. Miklau) - Recovery of Sparse Signals from a Mixture of Linear Samples

ICML 2020. (A. Mazumdar, S. Pal) - Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences

Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos) - Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance

FOCS 2020 (T. Kociumaka, B. Saha) - High Dimensional Discrete Integration over the Hypergrid

UAI 2020 (R. Maity, A. Mazumdar, S. Pal) - Efficient Intervention Design for Causal Discovery with Latents

ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco) - Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy

ESAIM: M2AN, Forthcoming article (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet) - Does Preprocessing help in Fast Sequence Comparisons?

STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha) - Reliable Distributed Clustering with Redundant Data Assignment

ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat) - Triangle and Four Cycle Counting in the Data Stream Model

PODS 2020 (A. McGregor, S. Vorotnikova) - Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal) - Data-driven Uncertainty Quantification in Systematically Coarse-grained Models

Soft Materials, 18:2-3, 348-368. (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis) - Vertex Ordering Problems in Directed Graph Streams.

SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova) - Sample Complexity of Learning Mixtures of Sparse Linear Regressions

NeurIPS 2019 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)

#### Preprints

- Optimizing variational representations of divergences and accelerating their statistical estimation

ArXiv 2020 (J. Birrell, M. A. Katsoulakis, Y. Pantazis) - Cumulant GAN

ArXiv 2020 (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis) - (f , Γ)-Divergences: Interpolating between f -Divergences and Integral Probability Metrics,

ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet) - Uncertainty quantification for Markov Random Fields

ArXiv 2020 (P. Birmpa, M. A. Katsoulakis) - MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization

ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp) - Mutual Information for Explainable Deep Learning of Multiscale Systems

ArXiv 2020 (E. J. Hall, S. Taverniers, M. A. Katsoulakis, D. M. Tartakovsky) - Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods

ArXiv 2020 (J. Birrell) - A Variational Formula for Renyi Divergences

ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang) - Distributional Robustness and Uncertainty Quantification for Rare Events

ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang)