TRIPODS Institute for Theoretical Foundations of Data Science

Research Overview

Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project.
  • First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results?
  • Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available?
  • Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated?
Our main goal of developing an understanding of the fundamental mathematical and computational issues underlying the aforementioned questions. Ultimately, this will enable practitioners to make more informed decisions when investing time and money across the life cycle of their data science project. Specific research goals explored in this project include:

  1. Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency.
  2. Minimizing query complexity in interactive unsupervised learning problems.
  3. Understanding space/sample complexity trade-offs when processing stochastic data.
  4. Developing fine-grained approximation algorithms relevant to core data science tasks.
  5. Using coding theory to enable communication-efficient distributed machine learning.
  6. Designing variational inference methods with statistical guarantees given limited resources.
  7. Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget.

Publications

  1. GINNs: Graph-Informed Neural Networks for Multiscale Physics
    Journal of Computational Physics, 2021 (E. J. Hall, S. Taverniers, M. A. Katsoulakis and D. Tartakovsky)
  2. Efficient and Effective ER with Progressive Blocking
    VLDB Journal, 2021. (S. Galhotra, D. Firmani, B. Saha, D. Srivastava)
  3. Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
    AISTATS 2021 (C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum)
  4. Intervention Efficient Algorithms for Approximate Learning of Causal Graphs
    ALT 2021 (R. Addanki, A. McGregor, C. Musco)
  5. Diverse Data Selection under Fairness Constraints
    ICDT 2021 (Z. Moumoulidou, A. McGregor, A. Meliou)
  6. Maximum Coverage in the Data Stream Model: Parameterized and Generalized
    ICDT 2021 (A. McGregor, D. Tench, H. Vu)
  7. Semisupervised Clustering by Queries and Locally Encodable Source Coding
    IEEE Transactions on Information Theory (TIT), vol. 67, no. 2, 2021. (A. Mazumdar, S. Pal)
  8. vqSGD: Vector Quantized Stochastic Gradient Descent
    AISTATS 2021. (V. Gandikota, D. Kane, R. Maity, A. Mazumdar)
  9. Recovery of Sparse Linear Classifiers from Mixture of Responses
    NeurIPS 2020. (V. Gandikota, A. Mazumdar, S. Pal)
  10. A Workload-Adaptive Mechanism for Linear Queries Under Local Differential Privacy
    VLDB 2020. (R. McKenna, R. Maity, A. Mazumdar, G. Miklau)
  11. Recovery of Sparse Signals from a Mixture of Linear Samples
    ICML 2020. (A. Mazumdar, S. Pal)
  12. Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences
    Science Advances (J. Feng, J. L. Lansford, M. A. Katsoulakis, M. G. Vlachos)
  13. Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance
    FOCS 2020 (T. Kociumaka, B. Saha)
  14. High Dimensional Discrete Integration over the Hypergrid
    UAI 2020 (R. Maity, A. Mazumdar, S. Pal)
  15. Efficient Intervention Design for Causal Discovery with Latents
    ICML 2020 (with R. Addanki, S. Kasiviswanathan, C. Musco)
  16. Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy
    ESAIM: M2AN, Forthcoming article (J. Birrell, M. A. Katsoulakis, L. Rey-Bellet)
  17. Does Preprocessing help in Fast Sequence Comparisons?
    STOC 2020 (E. Goldenberg, A. Rubinstein, B. Saha)
  18. Reliable Distributed Clustering with Redundant Data Assignment
    ISIT 2020 (V. Gandikota, A. Mazumdar, A. S. Rawat)
  19. Triangle and Four Cycle Counting in the Data Stream Model
    PODS 2020 (A. McGregor, S. Vorotnikova)
  20. Algebraic and Analytic Approaches for Parameter Learning in Mixture Models
    ALT 2020 (with A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)
  21. Data-driven Uncertainty Quantification in Systematically Coarse-grained Models
    Soft Materials, 18:2-3, 348-368. (T. Jin, A. Chazirakis, E. Kalligiannaki, V. Harmandaris and M. A. Katsoulakis)
  22. Vertex Ordering Problems in Directed Graph Streams.
    SODA 2020 (A. Chakrabarti, P. Ghosh, A. McGregor, S. Vorotnikova)
  23. Sample Complexity of Learning Mixtures of Sparse Linear Regressions
    NeurIPS 2019 (A. Krishnamurthy, A. Mazumdar, A. McGregor, S. Pal)

Preprints

  1. Optimizing variational representations of divergences and accelerating their statistical estimation
    ArXiv 2020 (J. Birrell, M. A. Katsoulakis, Y. Pantazis)
  2. Cumulant GAN
    ArXiv 2020 (Y. Pantazis, D. Paul, M. Fassoulakis, Y. Stylianou, M. A. Katsoulakis)
  3. (f , Γ)-Divergences: Interpolating between f -Divergences and Integral Probability Metrics,
    ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, L. Rey-Bellet)
  4. Uncertainty quantification for Markov Random Fields
    ArXiv 2020 (P. Birmpa, M. A. Katsoulakis)
  5. MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization
    ArXiv 2020 (P. Flaherty, P. Wiratchotisatian, J. Lee, Z. Tang, A. Trapp)
  6. Mutual Information for Explainable Deep Learning of Multiscale Systems
    ArXiv 2020 (E. J. Hall, S. Taverniers, M. A. Katsoulakis, D. M. Tartakovsky)
  7. Distributionally Robust Variance Minimization: Tight Variance Bounds over f-Divergence Neighborhoods
    ArXiv 2020 (J. Birrell)
  8. A Variational Formula for Renyi Divergences
    ArXiv 2020 (J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet and J. Wang)
  9. Distributional Robustness and Uncertainty Quantification for Rare Events
    ArXiv 2019 (J. Birrell, P. Dupuis, M. Katsoulakis, L. Rey-Bellet, J. Wang)