Optimizing breast cancer classification using SMOTE, Boruta, and XGBoost

(1) * Cicin Hardiyanti P Mail (Informatics Department, Universitas Alma Ata, Indonesia)
*corresponding author

Abstract


Breast cancer remains one of the leading causes of death among women worldwide. This study aims to develop a clinical data-based breast cancer classification framework by integrating the Synthetic Minority Oversampling Technique (SMOTE), the Boruta feature selection algorithm, and the XGBoost classifier. The proposed approach is tested using the Wisconsin Breast Cancer Diagnostic (WBCD) dataset, consisting of 569 samples and 30 numerical features. SMOTE addresses class imbalance, Boruta selects the most relevant diagnostic features, and XGBoost is the main classification algorithm due to its tabular and imbalanced data robustness. Model validation is conducted through Repeated Stratified K-Fold Cross Validation with 30 repetitions to ensure statistical stability. The resulting model achieves excellent classification performance, with an average accuracy of 0.9608 ± 0.0274, precision of 0.9465 ± 0.0481, Recall of 0.9512 ± 0.0524, and F1-score of 0.9475 ± 0.0374. The ROC-AUC value reaches 0.9926 ± 0.0094, the PR-AUC is 0.9906 ± 0.0113, and the Matthews Correlation Coefficient (MCC) is 0.9179 ± 0.0575, indicating a well-balanced model. Clinically, this model can aid early diagnosis by effectively reducing irrelevant diagnostic attributes, retaining only 10 key features without compromising accuracy, thereby offering a lightweight yet reliable diagnostic tool. However, limitations include the relatively small dataset and the absence of hyperparameter tuning. Future research should explore larger datasets, advanced ensemble methods, and interpretability techniques such as SHAP or LIME to improve clinical transparency and adoption.

Keywords


Breast cancer; SMOTE; Boruta; XGBoost; Imbalanced data handling

   

DOI

https://doi.org/10.31763/sitech.v6i1.2109
      

Article metrics

10.31763/sitech.v6i1.2109 Abstract views : 62 | PDF views : 32

   

Cite

   

Full Text

Download

References


[1] “Breast cancer,” World Health Organisation, 2024. [Online]. Available at: https://www.who.int/news-room/fact-sheets/detail/breast-cancer.

[2] “Global Cancer Observatory,” International Agency for Research on Cancer, 2022. [Online]. Available at: https://gco.iarc.who.int/media/globocan/factsheets/populations/360-indonesia-fact-sheet.pdf.

[3] K. Shaikh, S. Krishnan, and R. Thanki, “An Introduction to Breast Cancer,” in Artificial Intelligence in Breast Cancer Early Detection and Diagnosis, Cham: Springer International Publishing, 2021, pp. 1–20, doi: 10.1007/978-3-030-59208-0_1.

[4] S. Sriharikrishnaa, P. S. Suresh, and S. Prasada K., “An Introduction to Fundamentals of Cancer Biology,” Springer, Cham, 2023, pp. 307–330, doi: 10.1007/978-3-031-31852-8_11.

[5] E. Bassey, B. Chinemelum, and A. Huygens, “Review Paper Breast Cancer,” J. Glob. Biosci., vol. 11, no. 3, pp. 9248–9257, 2022, [Online]. Available at: https://www.mutagens.co.in/jgb/vol.11/110304.pdf.

[6] A. K. Das, S. K. Biswas, A. Bhattacharya, and E. Alam, “Introduction to Breast Cancer and Awareness,” in 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Mar. 2021, no. March, pp. 227–232, doi: 10.1109/ICACCS51430.2021.9441686.

[7] P. Bisoyi, “Malignant tumors – as cancer,” in Understanding Cancer, Elsevier, 2022, pp. 21–36, doi: 10.1016/B978-0-323-99883-3.00011-1.

[8] H. T. Nia, L. L. Munn, and R. K. Jain, “Physical traits of cancer,” Science (80-), vol. 370, no. 6516, p. 12, Oct. 2020, doi: 10.1126/science.aaz0868.

[9] P. Bisoyi, “A brief tour guide to cancer disease,” in Understanding Cancer, Elsevier, 2022, pp. 1–20, doi: 10.1016/B978-0-323-99883-3.00006-8.

[10] H. Oktavianto and R. P. Handri, "Breast Cancer Classification Analysis Using Naïve Bayes Algorithm," INFORMAL Informatics J., vol. 4, no. 3, p. 117, Jan. 2020, doi: 10.19184/isj.v4i3.14170.

[11] N. R. Muntiari and K. H. Hanif, “Classification of Breast Cancer Disease Using Comparison of Machine Learning Algorithms,” J. Ilmu Komput. dan Teknol., vol. 3, no. 1, pp. 1–6, May 2022, doi: 10.35960/ikomti.v3i1.766.

[12] K. Khadijah and R. Kusumaningrum, “Ensemble Classifier for Breast Cancer Classification,” IT J. Res. Dev., vol. 4, no. 1, pp. 61–71, Aug. 2019, doi: 10.25299/itjrd.2019.vol4(1).3540.

[13] jamaluddin, A. Kholiq Fajar, M. Zaenal Mutaqin, M. Malik Mutoffar, and D. Setiyadi, “Breast Cancer Classification Using Neural Network and Random Forest Algorithms,” J. Manaj. Inform. Sist. Inf., vol. 7, no. 1, p. 77, 2024, [Online]. Available at: https://e-journal.stmiklombok.ac.id/index.php/misi/article/view/1082.

[14] R. Erwandi and Suyanto, “Breast Cancer Classification Using Residual Neural Network,” J. Comput., vol. 5, no. 1, pp. 45–52, 2020, [Online]. Available at: https://socjs.telkomuniversity.ac.id/ojs/index.php/indojc/article/download/373/170/1691.

[15] A. Supriyanto, W. A. Kusuma, and H. Rahmawan, “Breast Cancer Tumor Classification Using Inception-V3 Architecture and Machine Learning Algorithms,” J. Al-AZHAR Indones. SERI SAINS DAN Teknol., vol. 7, no. 3, p. 187, Sep. 2022, doi: 10.36722/sst.v7i3.1284.

[16] M. A. Naji, S. El Filali, K. Aarika, E. H. Benlahmar, R. A. Abdelouhahid, and O. Debauche, “Machine Learning Algorithms For Breast Cancer Prediction And Diagnosis,” Procedia Comput. Sci., vol. 191, pp. 487–492, 2021, doi: 10.1016/j.procs.2021.07.062.

[17] S. Ara, A. Das, and A. Dey, “Malignant and Benign Breast Cancer Classification using Machine Learning Algorithms,” in 2021 International Conference on Artificial Intelligence (ICAI), Apr. 2021, no. June, pp. 97–101, doi: 10.1109/ICAI52203.2021.9445249.

[18] N. C. Ramadhan, H. H. H, T. Rohana, and A. M. Siregar, “Machine Learning Algorithm Optimization Using Xgboost Feature Selection for Breast Cancer Classification,” TIN Terap. Inform. Nusant., vol. 5, no. 2, pp. 162–171, 2024, doi: 10.47065/tin.v5i2.5408.

[19] K. Mallikharjuna Rao, G. Saikrishna, and K. Supriya, “Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset,” Multimed. Tools Appl., vol. 82, no. 24, pp. 37177–37196, Oct. 2023, doi: 10.1007/s11042-023-15087-5.

[20] B. L. Ortiz et al., “Data Preprocessing Techniques for AI and Machine Learning Readiness: Scoping Review of Wearable Sensor Data in Cancer Care,” JMIR mHealth uHealth, vol. 12, no. 1, p. e59587, Sep. 2024, doi: 10.2196/59587.

[21] J. Y.-L. Chan et al., “Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review,” Mathematics, vol. 10, no. 8, p. 1283, Apr. 2022, doi: 10.3390/math10081283.

[22] S. Subbiah, K. S. M. Anbananthen, S. Thangaraj, S. Kannan, and D. Chelliah, “Intrusion detection technique in wireless sensor network using grid search random forest with Boruta feature selection algorithm,” J. Commun. Networks, vol. 24, no. 2, pp. 264–273, Apr. 2022, doi: 10.23919/JCN.2022.000002.

[23] M. B. Kursa, “Robustness of Random Forest-based gene selection methods,” BMC Bioinformatics, vol. 15, no. 1, p. 8, Dec. 2014, doi: 10.1186/1471-2105-15-8.

[24] R. Iranzad and X. Liu, “A review of random forest-based feature selection methods for data science education and applications,” Int. J. Data Sci. Anal., pp. 1–15, Feb. 2024, doi: 10.1007/s41060-024-00509-w.

[25] M. Galih Pradana, K. Palilingan, Y. Vanli Akay, D. Puspasari Wijaya, and P. Hari Saputro, “Comparison of Multi Layer Perceptron, Random Forest & Logistic Regression on Students Performance Test,” in 2022 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Nov. 2022, pp. 462–466, doi: 10.1109/ICIMCIS56303.2022.10017501.

[26] H. Zhou, Y. Xin, and S. Li, “A diabetes prediction model based on Boruta feature selection and ensemble learning,” BMC Bioinformatics, vol. 24, no. 1, p. 224, Jun. 2023, doi: 10.1186/s12859-023-05300-5.

[27] H. Gharoun, N. Yazdanjue, M. S. Khorshidi, F. Chen, and A. H. Gandomi, “Leveraging Neural Networks and Calibration Measures for Confident Feature Selection,” IEEE Trans. Emerg. Top. Comput. Intell., vol. 9, no. 3, pp. 2179–2193, Jun. 2025, doi: 10.1109/TETCI.2025.3535659.

[28] H. Matsuo et al., “Diagnostic accuracy of deep-learning with anomaly detection for a small amount of imbalanced data: discriminating malignant parotid tumors in MRI,” Sci. Rep., vol. 10, no. 1, p. 19388, Nov. 2020, doi: 10.1038/s41598-020-76389-4.

[29] A. Ali, S. Shamsuddin, and A. Ralescu, “Classification with class imbalance problem: A review,” Int. J. Adv. Soft Comput., vol. 5, no. 3, pp. 1–30, 2013. [Online]. Available at: https://www.researchgate.net/profile/Aida-Ali-4/publication/288228469.

[30] P. Zhang, Y. Jia, and Y. Shang, “Research and application of XGBoost in imbalanced data,” Int. J. Distrib. Sens. Networks, vol. 18, no. 6, p. 155013292211069, Jun. 2022, doi: 10.1177/15501329221106935.

[31] S. Fatima, A. Hussain, S. Bin Amir, S. H. Ahmed, and S. M. H. Aslam, “XGBoost and Random Forest Algorithms: An in Depth Analysis,” Pakistan J. Sci. Res., vol. 3, no. 1, pp. 26–31, Oct. 2023, doi: 10.57041/pjosr.v3i1.946.

[32] J. Pasaribu, N. Yudistira, and W. F. Mahmudy, “Tabular Data Classification and Regression : XGBoost or Deep Learning with Retrieval-Augmented Generation,” IEEE Access, vol. 12, pp. 1–1, 2024, doi: 10.1109/ACCESS.2024.3518205.

[33] M. Imani, A. Beikmohammadi, and H. R. Arabnia, “Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels,” Technologies, vol. 13, no. 3, p. 88, Feb. 2025, doi: 10.3390/technologies13030088.

[34] D. Wilimitis and C. G. Walsh, “Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial,” JMIR AI, vol. 2, no. 1, p. e49023, Dec. 2023, doi: 10.2196/49023.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Cicin Hardiyanti P

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
Science in Information Technology Letters
ISSN 2722-4139
Published by Association for Scientific Computing Electrical and Engineering (ASCEE)
W : http://pubs2.ascee.org/index.php/sitech
E : sitech@ascee.org, andri@ascee.org, andri.pranolo.id@ieee.org

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

View My Stats