TY - GEN
T1 - Using Cluster Analysis to Assess the Impact of Dataset Heterogeneity on Deep Convolutional Network Accuracy
T2 - 6th Latin American High Performance Computing Conference, CARLA 2019
AU - Mendez, Mauro
AU - Calderon, Saul
AU - Tyrrell, Pascal N.
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - In this paper we performed cluster analysis using Fuzzy K-means over the image-based features of two models, to assess how dataset heterogeneity impacts model accuracy. A highly heterogeneous dataset is linked with sparse data samples, which usually impacts the overall model generalization and accuracy with test samples. We propose to measure the Coefficient of Variation (CV) in the resulting clusters, to estimate data heterogeneity as a metric for predicting model generalization and test accuracy. We show that highly heterogeneous datasets are common when the number of samples are not enough, thus yielding a high CV. In our experiments with two different models and datasets, higher CV values decreased model test accuracy considerably. We tested ResNet 18, to solve binary classification of x-ray teeth scans, and VGG16, to solve age regression from hand x-ray scans. Results obtained suggest that cluster analysis can be used to identify heterogeneity influence on CNN model testing accuracy. According to our experiments, we consider that a CV <5% is recommended to yield a satisfactory model test accuracy.
AB - In this paper we performed cluster analysis using Fuzzy K-means over the image-based features of two models, to assess how dataset heterogeneity impacts model accuracy. A highly heterogeneous dataset is linked with sparse data samples, which usually impacts the overall model generalization and accuracy with test samples. We propose to measure the Coefficient of Variation (CV) in the resulting clusters, to estimate data heterogeneity as a metric for predicting model generalization and test accuracy. We show that highly heterogeneous datasets are common when the number of samples are not enough, thus yielding a high CV. In our experiments with two different models and datasets, higher CV values decreased model test accuracy considerably. We tested ResNet 18, to solve binary classification of x-ray teeth scans, and VGG16, to solve age regression from hand x-ray scans. Results obtained suggest that cluster analysis can be used to identify heterogeneity influence on CNN model testing accuracy. According to our experiments, we consider that a CV <5% is recommended to yield a satisfactory model test accuracy.
KW - Cluster analysis
KW - Convolutional Neural Network
KW - Heterogeneity
KW - Small dataset
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85081181370&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-41005-6_21
DO - 10.1007/978-3-030-41005-6_21
M3 - Contribución a la conferencia
AN - SCOPUS:85081181370
SN - 9783030410049
T3 - Communications in Computer and Information Science
SP - 307
EP - 319
BT - High Performance Computing - 6th Latin American Conference, CARLA 2019, Revised Selected Papers
A2 - Crespo-Mariño, Juan Luis
A2 - Meneses-Rojas, Esteban
PB - Springer
Y2 - 25 September 2019 through 27 September 2019
ER -