Semi-Supervised Learning: The Provable Benefits of Unlabeled Data for Sparse Gaussian Classification

  • Date in the past
  • Tuesday, 12. March 2024, 10:00 - 11:00
  • Mathematikon, SR 8
    • Boaz Nadler (Weizmann Institute of Science, Israel)
  • Address

    Mathematikon
    Im Neuenheimer Feld 205
    Seminar Room 8 (4 th floor)

  • Event Type

The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data enables learning significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this talk, we consider SSL for high dimensional sparse Gaussian classification. A key challenge here is feature selection, detecting the few variables informative for the classification problem.

For this SSL setting, we derive information theoretic lower bounds as well as computational lower bounds, based on the low-degree likelihood ratio framework. Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where a polynomial time SSL algorithm that we propose succeeds, but any computationally efficient supervised or unsupervised schemes, that separately use only the labeled or unlabeled data would fail. This result highlights the provable benefits of combining labeled and unlabeled data for feature selection in high dimensions.