#### Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.- First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
- Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
- Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?

- Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
- Minimizing query complexity in interactive unsupervised learning problems.
- Understanding space/sample complexity trade-offs when processing stochastic data.
- Developing fine-grained approximation algorithms relevant to core data science tasks.
- Using coding theory to enable communication-efficient distributed machine learning.
- Designing variational inference methods with statistical guarantees given limited resources.
- Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.

#### Publications

- Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences

Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos) - Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods

ArXiv 2020 (J. Birrell) - Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance

FOCS 2020 (T. Kociumaka, B. Saha) - GINNs: Graph-Informed Neural Networks for Multiscale Physics

ArXiv 2020 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky) - A Variational Formula for Renyi Divergences

ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang) - Recovery of Sparse Signals from a Mixture of Linear Samples

ICML 2020 (A. Mazumdar, S, Pal) - High Dimensional Discrete Integration over the Hypergrid

UAI 2020 (R. Maity, A. Mazumdar, S. Pal) - Efficient Intervention Design for Causal Discovery with Latents

ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco) - Optimizing variational representations of divergences and accelerating their statistical estimation

ArXiv 2020 (J. Birrell, M. A. Katsoulakis, Y. Pantazis) - Cumulant GAN

ArXiv 2020 (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis) - Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy

ArXiv 2020 (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet) - Does Preprocessing help in Fast Sequence Comparisons?

STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha) - Reliable Distributed Clustering with Redundant Data Assignment

ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat) - vqSGD: Vector Quantized Stochastic Gradient Descent

ArXiv 2019 (V. Gandikota, D. Kane, R. K. Maity, A. Mazumdar) - MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization

ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp) - Triangle and Four Cycle Counting in the Data Stream Model

PODS 2020 (A. McGregor, S. Vorotnikova) - Compact Representation of Uncertainty in Hierarchical Clustering

ArXiv 2020 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum) - Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal) - Data-driven Uncertainty Quantification in Systematically Coarse-grained Models.

Soft Materials (to appear). (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis) - Distributional Robustness and Uncertainty Quantification for Rare Events

ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang) - Vertex Ordering Problems in Directed Graph Streams.

SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova) - Sample Complexity of Learning Mixtures of Sparse Linear Regressions

NeurIPS 2020 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)