Skip to main navigation Skip to search Skip to main content

Using Cluster Analysis to Assess the Impact of Dataset Heterogeneity on Deep Convolutional Network Accuracy: A First Glance

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

In this paper we performed cluster analysis using Fuzzy K-means over the image-based features of two models, to assess how dataset heterogeneity impacts model accuracy. A highly heterogeneous dataset is linked with sparse data samples, which usually impacts the overall model generalization and accuracy with test samples. We propose to measure the Coefficient of Variation (CV) in the resulting clusters, to estimate data heterogeneity as a metric for predicting model generalization and test accuracy. We show that highly heterogeneous datasets are common when the number of samples are not enough, thus yielding a high CV. In our experiments with two different models and datasets, higher CV values decreased model test accuracy considerably. We tested ResNet 18, to solve binary classification of x-ray teeth scans, and VGG16, to solve age regression from hand x-ray scans. Results obtained suggest that cluster analysis can be used to identify heterogeneity influence on CNN model testing accuracy. According to our experiments, we consider that a CV <5% is recommended to yield a satisfactory model test accuracy.

Original languageEnglish
Title of host publicationHigh Performance Computing - 6th Latin American Conference, CARLA 2019, Revised Selected Papers
EditorsJuan Luis Crespo-Mariño, Esteban Meneses-Rojas
PublisherSpringer
Pages307-319
Number of pages13
ISBN (Print)9783030410049
DOIs
StatePublished - 2020
Event6th Latin American High Performance Computing Conference, CARLA 2019 - Turrialba, Costa Rica
Duration: 25 Sep 201927 Sep 2019

Publication series

NameCommunications in Computer and Information Science
Volume1087 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference6th Latin American High Performance Computing Conference, CARLA 2019
Country/TerritoryCosta Rica
CityTurrialba
Period25/09/1927/09/19

Keywords

  • Cluster analysis
  • Convolutional Neural Network
  • Heterogeneity
  • Small dataset
  • Transfer learning

Fingerprint

Dive into the research topics of 'Using Cluster Analysis to Assess the Impact of Dataset Heterogeneity on Deep Convolutional Network Accuracy: A First Glance'. Together they form a unique fingerprint.

Cite this