TY - JOUR
T1 - Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets
AU - Calderon-Ramirez, Saul
AU - Oala, Luis
AU - Torrents-Barrena, Jordina
AU - Yang, Shengxiang
AU - Elizondo, David
AU - Moemeni, Armaghan
AU - Colreavy-Donnelly, Simon
AU - Samek, Wojciech
AU - Molina-Cabello, Miguel A.
AU - Lopez-Rubio, Ezequiel
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2023/4/1
Y1 - 2023/4/1
N2 - Semisupervised deep learning (SSDL) is a popular strategy to leverage unlabeled data for machine learning when labeled data is not readily available. In real-world scenarios, different unlabeled data sources are usually available, with varying degrees of distribution mismatch regarding the labeled datasets. It begs the question, which unlabeled dataset to choose for good SSDL outcomes. Oftentimes, semantic heuristics are used to match unlabeled data with labeled data. However, a quantitative and systematic approach to this selection problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabeled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labeled and unlabeled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labeled and unlabeled datasets. They use the feature space of a generic Wide-ResNet, which can be applied prior to learning, are quick to evaluate, and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabeled datasets prior to SSDL training.
AB - Semisupervised deep learning (SSDL) is a popular strategy to leverage unlabeled data for machine learning when labeled data is not readily available. In real-world scenarios, different unlabeled data sources are usually available, with varying degrees of distribution mismatch regarding the labeled datasets. It begs the question, which unlabeled dataset to choose for good SSDL outcomes. Oftentimes, semantic heuristics are used to match unlabeled data with labeled data. However, a quantitative and systematic approach to this selection problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabeled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labeled and unlabeled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labeled and unlabeled datasets. They use the feature space of a generic Wide-ResNet, which can be applied prior to learning, are quick to evaluate, and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabeled datasets prior to SSDL training.
KW - Dataset similarity
KW - MixMatch
KW - deep learning
KW - distribution mismatch
KW - out of distribution data
KW - semisupervised deep learning
UR - http://www.scopus.com/inward/record.url?scp=85151583071&partnerID=8YFLogxK
U2 - 10.1109/TAI.2022.3168804
DO - 10.1109/TAI.2022.3168804
M3 - Artículo
AN - SCOPUS:85151583071
SN - 2691-4581
VL - 4
SP - 282
EP - 291
JO - IEEE Transactions on Artificial Intelligence
JF - IEEE Transactions on Artificial Intelligence
IS - 2
ER -