Water potability classification using machine learning: a case study on handling incomplete data

Hayyun Lisdiana, Karli Eka Setiawan

Abstract


Water is essential for the preservation of life on Earth, particularly for drinking purposes. Despite the abundance of water in the earth's ecosystem, the world is currently grappling with a significant global issue of contaminated water, a problem that extends beyond natural contamination and includes industrial wastewater. In this study, we aimed to investigate the potential of decision tree-based machine learning models, including decision trees, ensemble boosting, ensemble bagging, and random forests, in predicting water potability based on specific parameters. We used publicly available data from the "Water Quality and Potability" Kaggle dataset. Due to the high number of missing values for some parameters in the dataset, our research initially converted continuous, missing values into discrete or categorical values. We then filled these gaps with a general label, "unknown," instead of using mean or median values as other studies had done. Initially the result showed that the highest accuracy was random forest; our analysts showed that the sulfate parameters created confusion for the machine learning model due to the many missing values. So that, this research decided to exclude sulfate data from the dataset, and this research showed significant results where all decision tree-based machine learning models can achieve 100% accuracy, precision, recall, and f1-score on evaluation using the test dataset.


Full Text: PDF

Published: 2026-03-26

How to Cite this Article:

Hayyun Lisdiana, Karli Eka Setiawan, Water potability classification using machine learning: a case study on handling incomplete data, Commun. Math. Biol. Neurosci., 2026 (2026), Article ID 22

Copyright © 2026 Hayyun Lisdiana, Karli Eka Setiawan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Commun. Math. Biol. Neurosci.

ISSN 2052-2541

Editorial Office: [email protected]

 

Copyright ©2025 CMBN