Water potability classification using machine learning: a case study on handling incomplete data
Abstract
Water is essential for the preservation of life on Earth, particularly for drinking purposes. Despite the abundance of water in the earth's ecosystem, the world is currently grappling with a significant global issue of contaminated water, a problem that extends beyond natural contamination and includes industrial wastewater. In this study, we aimed to investigate the potential of decision tree-based machine learning models, including decision trees, ensemble boosting, ensemble bagging, and random forests, in predicting water potability based on specific parameters. We used publicly available data from the "Water Quality and Potability" Kaggle dataset. Due to the high number of missing values for some parameters in the dataset, our research initially converted continuous, missing values into discrete or categorical values. We then filled these gaps with a general label, "unknown," instead of using mean or median values as other studies had done. Initially the result showed that the highest accuracy was random forest; our analysts showed that the sulfate parameters created confusion for the machine learning model due to the many missing values. So that, this research decided to exclude sulfate data from the dataset, and this research showed significant results where all decision tree-based machine learning models can achieve 100% accuracy, precision, recall, and f1-score on evaluation using the test dataset.
Commun. Math. Biol. Neurosci.
ISSN 2052-2541
Editorial Office: [email protected]
Copyright ©2025 CMBN
Communications in Mathematical Biology and Neuroscience