TRIPODS Institute for Theoretical Foundations of Data Science

Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.
  • First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
  • Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
  • Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?
Our main goal of developing an understanding of the fundamental mathematical and computational issues underlying the aforementioned questions. Ultimately, this will enable practitioners to make more informed decisions when investing time and money across the life cycle of their data science project. Specific research goals explored in this project include:

  1. Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
  2. Minimizing query complexity in interactive unsupervised learning problems.
  3. Understanding space/sample complexity trade-offs when processing stochastic data.
  4. Developing fine-grained approximation algorithms relevant to core data science tasks.
  5. Using coding theory to enable communication-efficient distributed machine learning.
  6. Designing variational inference methods with statistical guarantees given limited resources.
  7. Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.


  1. Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences
    Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos)
  2. Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods
    ArXiv 2020 (J. Birrell)
  3. Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance
    FOCS 2020 (T. Kociumaka, B. Saha)
  4. GINNs: Graph-Informed Neural Networks for Multiscale Physics
    ArXiv 2020 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky)
  5. A Variational Formula for Renyi Divergences
    ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang)
  6. Recovery of Sparse Signals from a Mixture of Linear Samples
    ICML 2020 (A. Mazumdar, S, Pal)
  7. High Dimensional Discrete Integration over the Hypergrid
    UAI 2020 (R. Maity, A. Mazumdar, S. Pal)
  8. Efficient Intervention Design for Causal Discovery with Latents
    ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco)
  9. Optimizing variational representations of divergences and accelerating their statistical estimation
    ArXiv 2020 (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
  10. Cumulant GAN
    ArXiv 2020 (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis)
  11. Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy
    ArXiv 2020 (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet)
  12. Does Preprocessing help in Fast Sequence Comparisons?
    STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha)
  13. Reliable Distributed Clustering with Redundant Data Assignment
    ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat)
  14. vqSGD: Vector Quantized Stochastic Gradient Descent
    ArXiv 2019 (V. Gandikota, D. Kane, R. K. Maity, A. Mazumdar)
  15. MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization
    ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp)
  16. Triangle and Four Cycle Counting in the Data Stream Model
    PODS 2020 (A. McGregor, S. Vorotnikova)
  17. Compact Representation of Uncertainty in Hierarchical Clustering
    ArXiv 2020 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum)
  18. Algebraic and Analytic Approaches for Parameter Learning in Mixture Models
    ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
  19. Data-driven Uncertainty Quantification in Systematically Coarse-grained Models.
    Soft Materials (to appear). (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis)
  20. Distributional Robustness and Uncertainty Quantification for Rare Events
    ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang)
  21. Vertex Ordering Problems in Directed Graph Streams.
    SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova)
  22. Sample Complexity of Learning Mixtures of Sparse Linear Regressions
    NeurIPS 2020 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)