Data Quality Metrics for Unlabelled Datasets

Catalina Diaz, Saul Calderon-Ramirez, Luis Diego Mora Aguilar

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

2 Citas (Scopus)

Resumen

Deep learning models usually need extensive amounts of data, and these data have to be labeled, becoming a concern when dealing with real-world applications. It is known that labeling a dataset is a costly task in time, money, and resource-wise. Consequently, Semi-supervised Learning Model (SSLM) approach comes into the picture as it uses labeled and unlabeled datasets to train a model, practice that is useful in improving the overall performance of the models. The unlabeled datasets may include out-of-distribution data or inside-of-distribution data points, which may affect the model's accuracy and future predictions. This investigation proposes a metric that can be useful to determine how much the unlabeled dataset can or cannot affect the accuracy of the SSLM. It also aims to demonstrate that the data quality metrics is a topic that needs further research, especially, when considering that the future of Deep learning models targets real-world applications such as healthcare. Concepts such as data quality metrics has been normally applied in structured data, however, it can also be applied in unstructured data (datasets used to train deep learning models). The method employed in this research takes the Mahalanobis distance as a base to generate a trend and then a metric. The approach follows what is demonstrated and proposed in [1], but uses the covariance matrices to compare the labeled and unlabeled datasets. The experimentation shows that the Mahalanobis distance generates results that are accordant to the proposed method, achieving a processing time lower by 99%. Using the Pierson's correlation method the result was a hard negative correlation with the MixMatch results reported in [1].

Idioma originalInglés
Título de la publicación alojada2022 IEEE 4th International Conference on BioInspired Processing, BIP 2022
EditorialInstitute of Electrical and Electronics Engineers Inc.
ISBN (versión digital)9781665470483
DOI
EstadoPublicada - 2022
Evento4th IEEE International Conference on BioInspired Processing, BIP 2022 - Cartago, Costa Rica
Duración: 15 nov 202217 nov 2022

Serie de la publicación

Nombre2022 IEEE 4th International Conference on BioInspired Processing, BIP 2022

Conferencia

Conferencia4th IEEE International Conference on BioInspired Processing, BIP 2022
País/TerritorioCosta Rica
CiudadCartago
Período15/11/2217/11/22

Huella

Profundice en los temas de investigación de 'Data Quality Metrics for Unlabelled Datasets'. En conjunto forman una huella única.

Citar esto