TRIPODS Institute for Theoretical Foundations of Data Science

Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.
  • First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
  • Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
  • Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?
Our main goal of developing an understanding of the fundamental mathematical and computational issues underlying the aforementioned questions. Ultimately, this will enable practitioners to make more informed decisions when investing time and money across the life cycle of their data science project. Specific research goals explored in this project include:

  1. Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
  2. Minimizing query complexity in interactive unsupervised learning problems.
  3. Understanding space/sample complexity trade-offs when processing stochastic data.
  4. Developing fine-grained approximation algorithms relevant to core data science tasks.
  5. Using coding theory to enable communication-efficient distributed machine learning.
  6. Designing variational inference methods with statistical guarantees given limited resources.
  7. Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.


  1. Sample Complexity of Probability Divergences under Group Symmetry
    ICML 2023 (Z. Chen, M. A. Katsoulakis, L. Rey-Bellet, W. Zhu)
  2. Identification of significant gene expression changes in multiple perturbation experiments using knockoffs
    Briefings in Bioinformatics 2023 (T. Zhao, G. Zhu, H. V. Dubey, P. Flaherty)
  3. Function-space regularized Rényi divergences
    ICLR 2023 (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet)
  4. Model Uncertainty and Correctability for Directed Graphical Models
    SIAM/ASA Journal on Uncertainty Quantification 10 (4),1461-1512, (2022) (P. Birmpa, J. Feng, M. A. Katsoulakis, L. Rey-Bellet)
  5. Structure-preserving GANs
    Proceedings of the 39th International Conference on Machine Learning, PMLR 162:1982-2020, (2022). (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet, W. Zhu)
  6. Cumulant GAN
    IEEE Transactions on Neural Networks and Learning Systems, (2022), doi: 10.1109/TNNLS.2022.3161127 (Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, M. A. Katsoulakis)
  7. Optimizing variational representations of divergences and accelerating their statistical estimation
    IEEE Transactions on Information Theory, (2022), doi: 10.1109/TIT.2022.3160659 (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
  8. Uncertainty quantification for Markov Random Fields
    SIAM/ASA J. Uncertainty Quantification, 9(4), 1457–1498, (2021). (P. Birmpa, M. A. Katsoulakis)
  9. (f , Γ)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics
    Journal of Machine Learning Research, 23 (39), 1-70, (2022) (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet)
  10. Uncertainty Quantification and Error Propagation in the Enthalpy and Entropy of Surface Reactions Arising from a Single DFT Functional
    J. Phys. Chem, (Gerhard Wittreich, Geun Ho Gu, Daniel Robinson, Markos Katsoulakis, Markos, Dionisios Vlachos)
  11. Probabilistic Group Testing with a Linear Number of Tests
    IEEE International Symposium on Information Theory (ISIT), 2021 (Larkin Flodin, Arya Mazumdar)
  12. Variational Representations and Neural Network Estimation of Renyi Divergences
    SIAM Journal on Mathematics of Data Science (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang)
  13. Mutual Information for Explainable Deep Learning of Multiscale Systems
    Journal of Computational Physics, Volume 444, 2021, 110551, .(Søren Taverniers, Eric J. Hall, Markos A. Katsoulakis, Daniel M. Tartakovsky)
  14. Trace Reconstruction: Generalized and Parameterized
    IEEE Transactions on Information Theory (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
  15. PredictRoute: A Network Path Prediction Toolkit.
    Sigmetrics 2021 (R. Singh, D. Tench, A. McGregor, P. Gill)
  16. Correlation Clustering in Data Streams
    Algorithmica (K. Ahn, G. Cormode, S. Guha, A. McGregor A. Wirth)
  17. Exact and Approximate Hierarchical Clustering Using A*
    UAI 2021 (Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Avinava Dubey, Patrick Flaherty, Manzil Zaheer, Amr Ahmed, Kyle Cranmer and Andrew McCallum)
  18. Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data
    UAI 2021 (Aaron Schein, Anjali Nagulpally, Hanna Wallach, and Patrick Flaherty)
  19. GINNs: Graph-Informed Neural Networks for Multiscale Physics
    Journal of Computational Physics, 2021 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky)
  20. Efficient and Effective ER with Progressive Blocking
    VLDB Journal, 2021. (S. Galhotra, D. Firmani, B. Saha, D. Srivastava)
  21. Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
    AISTATS 2021 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum)
  22. Intervention Efficient Algorithms for Approximate Learning of Causal Graphs
    ALT 2021 (R. Addanki, A. McGregor, C. Musco)
  23. Diverse Data Selection under Fairness Constraints
    ICDT 2021 (Z. Moumoulidou, A. McGregor, A. Meliou)
  24. Maximum Coverage in the Data Stream Model: Parameterized and Generalized
    ICDT 2021 (A. McGregor, D. Tench, H. Vu)
  25. Semisupervised Clustering by Queries and Locally Encodable Source Coding
    IEEE Transactions on Information Theory (TIT), vol. 67, no. 2, 2021. (A. Mazumdar, S. Pal)
  26. vqSGD: Vector Quantized Stochastic Gradient Descent
    AISTATS 2021. (V. Gandikota, D. Kane, R. Maity, A. Mazumdar)
  27. Recovery of Sparse Linear Classifiers from Mixture of Responses
    NeurIPS 2020. (V. Gandikota, A. Mazumdar, S. Pal)
  28. A Workload-Adaptive Mechanism for Linear Queries Under Local Differential Privacy
    VLDB 2020. (R. McKenna, R. Maity, A. Mazumdar, G. Miklau)
  29. Recovery of Sparse Signals from a Mixture of Linear Samples
    ICML 2020. (A. Mazumdar, S. Pal)
  30. Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences
    Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos)
  31. Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance
    FOCS 2020 (T. Kociumaka, B. Saha)
  32. High Dimensional Discrete Integration over the Hypergrid
    UAI 2020 (R. Maity, A. Mazumdar, S. Pal)
  33. Efficient Intervention Design for Causal Discovery with Latents
    ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco)
  34. Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy
    ESAIM: M2AN, 55 1 (2021) 131-169 (Jeremiah Birrell, Markos A. Katsoulakis and Luc Rey-Bellet)
  35. Does Preprocessing help in Fast Sequence Comparisons?
    STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha)
  36. Reliable Distributed Clustering with Redundant Data Assignment
    ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat)
  37. Triangle and Four Cycle Counting in the Data Stream Model
    PODS 2020 (A. McGregor, S. Vorotnikova)
  38. Algebraic and Analytic Approaches for Parameter Learning in Mixture Models
    ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
  39. Data-driven Uncertainty Quantification in Systematically Coarse-grained Models
    Soft Materials, 18:2-3, 348-368. (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis)
  40. Vertex Ordering Problems in Directed Graph Streams.
    SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova)
  41. Sample Complexity of Learning Mixtures of Sparse Linear Regressions
    NeurIPS 2019 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)


  1. Optimizing variational representations of divergences and accelerating their statistical estimation
    ArXiv 2020, Under revision at IEEE Trans Info Theory (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
  2. Cumulant GAN
    ArXiv 2020, Under revision at IEEE Trans NN and Learning Syst. (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis)
  3. MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization
    ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp)
  4. Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods
    ArXiv 2020 (J. Birrell)
  5. Distributional Robustness and Uncertainty Quantification for Rare Events
    ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang)
  6. Model Uncertainty and Correctability for Directed Graphical Models
    (Panagiota Birmpa, Jinchao Feng, Markos Katsoulakis, Luc Rey-Bellet)