Classification of essential and non-essential genes in human genome sequence data using ensemble machine learning

Sri Karnila, Favorisen Rosyking Lumbanraja, Akmal Junaidi, Warsono -

Abstract


DNA (Deoxyribonucleic Acid), RNA (Ribonucleic Acid), and proteins are basic biochemical molecules essential for cellular organization. DNA serves as the primary store of genetic information encoded in genes, and humans are estimated to have 20,000 to 30,000 genes. This genetic information is represented in a chemical code. Essential genes are fundamental to various biological mechanisms, and when disrupted or removed, they may cause genetic defects, mutations, or, in extreme conditions, result in organismal fatality. Identifying these genes through experimental methods requires large resources and is often inefficient. Computational methods, especially those involving machine learning, offer a more efficient and effective solution to this challenge. This research explores the ability of machine learning techniques to build classification models for human gene sequence data. Two data sets, Cellular Essential Gene (CEG) and Organism Essential Gene (OEG), were analyzed, with genes categorized as essential or non-essential. The study was structured through multiple phases, such as data acquisition, cleaning, feature engineering, and dividing the dataset into subsets for model training and evaluation. Model construction followed this phase, where various ensemble learning techniques were applied. These included algorithms like Decision Tree, Support Vector Machine, Random Forest, Extreme Gradient Boosting, and Adaptive Boosting. The best overall results were achieved using the SVM model with 5-mers, reaching a sensitivity of 0.81 and ROC AUC of 0.81 on the CEG dataset, and a PR AUC of 0.46, sensitivity of 0.69, and ROC AUC of 0.80 on the OEG dataset. The consistently high accuracy results indicate that these models effectively distinguish essential and non-essential genes. These machine learning-based classification models can potentially be valuable tools in the healthcare field, contributing to a deeper understanding of normal gene function in organisms.

Full Text: PDF

Published: 2026-01-23

How to Cite this Article:

Sri Karnila, Favorisen Rosyking Lumbanraja, Akmal Junaidi, Warsono -, Classification of essential and non-essential genes in human genome sequence data using ensemble machine learning, Commun. Math. Biol. Neurosci., 2026 (2026), Article ID 8

Copyright © 2026 Sri Karnila, Favorisen Rosyking Lumbanraja, Akmal Junaidi, Warsono -. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Commun. Math. Biol. Neurosci.

ISSN 2052-2541

Editorial Office: [email protected]

 

Copyright ©2025 CMBN