A novel centroid initialization in missing value imputation towards mixed datasets

Titin Siswantining, Taufik Anwar, Devvi Sarwinda, Herley Shaori Al-Ash

Abstract


Currently, many databases contain missing values, especially in medical data. Statistical and data mining approaches often require complete data conditions, where these two approaches will not provide adequate performance if the data contains missing values. Several techniques have been made to overcome missing values, one of which is by deleting data containing missing values. However, this approach will omit a lot of information if the data found includes many missing values. This study used an imputation approach (filling in the missing attributes) with a clustering approach. One of the most common clustering approaches is K-Means Clustering. In K-means clustering, the value of the centroid gets from the closest observed value. In this study, we propose updating the centroid value based on the harmonic average of the distance across all observations per centroid. This method is known as K-Harmonic Means Clustering (KHM). We proposed a new program approach for a mixed dataset on three scenarios for missing values of 10%, 20%, and 30%. From the experiments conducted on experimental data sets containing missing values, we get a small proportion of missing values (10%) with a small number of clusters or K, which gives a smaller RMSE value compared to other scenarios.


Full Text: PDF

Published: 2021-02-12

How to Cite this Article:

Titin Siswantining, Taufik Anwar, Devvi Sarwinda, Herley Shaori Al-Ash, A novel centroid initialization in missing value imputation towards mixed datasets, Commun. Math. Biol. Neurosci., 2021 (2021), Article ID 11

Copyright © 2021 Titin Siswantining, Taufik Anwar, Devvi Sarwinda, Herley Shaori Al-Ash. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Commun. Math. Biol. Neurosci.

ISSN 2052-2541

Editorial Office: office@scik.org

 

Copyright ©2021 CMBN