Principal component analysis implementation on machine learning in diabetes classification

Michael Tantowen, Krisna Putra, Mahmud Isnan, Bens Pardamean

Abstract


Diabetes Mellitus, a global health burden linked to increased cancer risks, can be identified through variables like BMI, age, blood sugar, and HbA1c. This study explored diverse machine learning techniques for diabetes prediction, emphasizing dimensionality reduction and feature selection's role in enhancing model accuracy. Our motive is to compare the performance of multiple machine learning algorithms measures between original data and original data on which the handling sampling method or principal component analysis (PCA) was applied. The study utilizes Kaggle's "Diabetes Prediction Dataset" with 100,000 entries, employing eight features and one target variable related to diabetes. In the experiment, the dataset was divided into three distinct datasets: 1) whole dataset, 2) dataset containing males only, and 3) dataset containing females only. Those datasets were trained with multiple machine learning models: K-Nearest Neighbor (KNN), Decision Tree (DT), Support Vector Machines (SVM), XGBoost (XGB), and Random Forest (RF). The findings revealed that XGB outperformed other models with f1-score of 80.87 for an imbalanced dataset. Moreover, in diabetes classification based on gender, the random forest model was better for males with 80.34 as the f1-score while XGB was good for females 81.9 as the f1-score.

Full Text: PDF

Published: 2024-04-08

How to Cite this Article:

Michael Tantowen, Krisna Putra, Mahmud Isnan, Bens Pardamean, Principal component analysis implementation on machine learning in diabetes classification, Commun. Math. Biol. Neurosci., 2024 (2024), Article ID 45

Copyright © 2024 Michael Tantowen, Krisna Putra, Mahmud Isnan, Bens Pardamean. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Commun. Math. Biol. Neurosci.

ISSN 2052-2541

Editorial Office: office@scik.org

 

Copyright ©2024 CMBN