Using Cluster Analysis to Assess the Impact of Dataset Heterogeneity on Deep Convolutional Network Accuracy: A First Glance

Mauro Mendez, Saul Calderon, Pascal N. Tyrrell

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

8 Citas (Scopus)

Resumen

In this paper we performed cluster analysis using Fuzzy K-means over the image-based features of two models, to assess how dataset heterogeneity impacts model accuracy. A highly heterogeneous dataset is linked with sparse data samples, which usually impacts the overall model generalization and accuracy with test samples. We propose to measure the Coefficient of Variation (CV) in the resulting clusters, to estimate data heterogeneity as a metric for predicting model generalization and test accuracy. We show that highly heterogeneous datasets are common when the number of samples are not enough, thus yielding a high CV. In our experiments with two different models and datasets, higher CV values decreased model test accuracy considerably. We tested ResNet 18, to solve binary classification of x-ray teeth scans, and VGG16, to solve age regression from hand x-ray scans. Results obtained suggest that cluster analysis can be used to identify heterogeneity influence on CNN model testing accuracy. According to our experiments, we consider that a CV <5% is recommended to yield a satisfactory model test accuracy.

Idioma originalInglés
Título de la publicación alojadaHigh Performance Computing - 6th Latin American Conference, CARLA 2019, Revised Selected Papers
EditoresJuan Luis Crespo-Mariño, Esteban Meneses-Rojas
EditorialSpringer
Páginas307-319
Número de páginas13
ISBN (versión impresa)9783030410049
DOI
EstadoPublicada - 2020
Evento6th Latin American High Performance Computing Conference, CARLA 2019 - Turrialba, Costa Rica
Duración: 25 sept 201927 sept 2019

Serie de la publicación

NombreCommunications in Computer and Information Science
Volumen1087 CCIS
ISSN (versión impresa)1865-0929
ISSN (versión digital)1865-0937

Conferencia

Conferencia6th Latin American High Performance Computing Conference, CARLA 2019
País/TerritorioCosta Rica
CiudadTurrialba
Período25/09/1927/09/19

Huella

Profundice en los temas de investigación de 'Using Cluster Analysis to Assess the Impact of Dataset Heterogeneity on Deep Convolutional Network Accuracy: A First Glance'. En conjunto forman una huella única.

Citar esto