Analysis to predict diabetes Using Data Mining
Abstract
Abstract - Data mining is crucial for extracting patterns and valuable insights from extensive datasets, utilizing artificial intelligence and advanced data analysis techniques across various domains. Diabetes, a metabolic disorder characterized by elevated blood glucose levels, poses significant health risks, including cardiovascular and renal complications if untreated. Data mining plays a pivotal role in exploring and predicting diabetes by identifying high-risk populations, thereby enabling early intervention strategies such as lifestyle modifications and timely treatment initiation.
Analyzing comprehensive datasets encompassing diabetes-related factors such as weight, blood pressure, blood glucose levels, and genetic predispositions data mining constructs predictive models to assess risks and implement targeted interventions. In a comprehensive study involving 768 cases (268 positive and 500 negative) Logistic Regression achieved 70% accuracy, with a recall of 57% and an F1 score of 0.63 , Naive Bayes (GaussianNB) achieved 68% accuracy, with a recall rate of 54% and an F1 score of 0.61, Decision Tree Classifier achieved 66% accuracy, with a recall rate of 62% and an F1 score of 0.64 , Random Forest achieved 70% accuracy, with a recall rate of 59% and an F1 score of 0.64 , XGBClassifier achieved 66% accuracy, with a recall rate of 58% and an F1 score of 0.62.
The analysis underscores a trade-off between precision and recall, particularly in classifying high-risk diabetes cases. High precision reduces false positives but may lower recall, potentially missing true positive cases. Conversely, emphasizing recall may increase false positives. Achieving a balance between these metrics is critical for effective diabetes prediction and tailored healthcare strategies This abstract encapsulates the pivotal role of data mining in diabetes research, emphasizing its impact on predictive modeling and healthcare decision making.
References
Cut Fiarni, Evasaria M. Sipayung, Siti Maemunah ,"Analysis and Prediction of Diabetes Complication Disease using Data Mining Algorithm" Vol161,Pages 449-457, 2019, https://doi.org/10.1016/j.procs.2019.11.144.
Ahed J. Alkhatib, Amer Mahmoud Sindiani , Eman Hussein Alshdaifat, "Prediction of Risk Factors Leading to Diabetes Using Neural Network Analysis" vol3, Issue 2,2020 , https://asclepiusopen.com/clinical-research-in-diabetes-and-endocrinology/volume-3-issue-2/4.pdf
Mohanad M.Alsaleha , Kyung-Mo Yeonb , SohailAkhtara , Qazi Mohammad Sajid Jamala," XAI Implementation on Preliminary Data Analysis Phase: Explainable Output Application with Prediction of Diabetes Mellitus at Early Stage" Vol.13 No.02 (2022), 1070-1078 , https://doi.org/10.17762/turcomat.v13i2.12677.
Lindong Zhang , Min Liu," Analysis of Diabetes Disease Risk Prediction and Diabetes Medication Pattern Based on Data Mining",Vol 2022, Article ID 2665339, p9, https://doi.org/10.1155/2022/2665339.
K. Saravananathan, T. Velmurugan , "Quality Based Analysis of Clustering Algorithms using Diabetes Data for the Prediction of Disease", vol-8, Issue-11S2, 2019, 2278-3075, http://dx.doi.org/10.35940/ijitee.K1072.09811S219.
Hong Guo1,ZhiChao Fan1,Yan Zeng," Novel Data Mining Analysis Method on Risk Prediction of Type 2 Diabetes",94:1183–1198,2020, https://doi.org/10.1007/s11265-021-01717-4.
Joyce Jackson,"data mining a conceptual overview",vol 8 267-296,2002, https://doi.org/10.17705/1CAIS.00819.
David Crockett, Ryan Johnson, and Brian Eliason ," What is Data Mining in Healthcare", vol 8 ,2002, 267-296, https://www.healthcatalyst.com/wp-content/uploads/2014/06/What-is-data-mining-in-healthcare.pdf.
Ogundele I.O, Popoola O.L, Oyesola O.O, Orija K.T," A Review on Data Mining in Healthcare",vol 7, Issue 9, September 2018, ISSN: 2278 – 1323, https://www.researchgate.net/publication/370899263.
FRANS COENEN," Data Mining: Past, Present and Future",vol 7,26(01):25-29,2018, https://www.researchgate.net/publication/220254364.
Felipe Israel Marinho , Mario Henrique Akihiko da Costa Adaniya , "DATA MINING, MACHINE LEARNING, AND BUSINESS INTELLIGENCE - A CASE STUDY ON CRYPTOCURRENCIES",vol39, 2596-2809, 2023, http://periodicos.unifil.br/index.php/Revistateste/article/download/2891/2640/.
Bernd Kirchhof," 170 years of data mining: history and future", vol 262, pages 1013–1014, 2024, https://doi.org/10.1007/s00417-023-06359-9.
Kuldeep Nagi , "From Bits and Bytes to Big Data-An Historical Overview ", (June 9, 2020), , https://ssrn.com/abstract=3622921 or http://dx.doi.org/10.2139/ssrn.3622921.
Ravindra Maan , "The Evolution of Python Programming Language", 2040-0748 , Vol-9 Issue-02 July 2020 , https://ijgst.com/admin/uploadss/The%20Evolution%20of%20Python%20Programming%20Language.pdf.
Neesha Jothia , Nur’Aini Abdul Rashidb , Wahidah Husainc , "Data Mining in Healthcare – A Review",vol 72, P 306-313, 2015, https://doi.org/10.1016/j.procs.2015.12.145.
Furqan Alama , Rashid Mehmoodb , Iyad Katiba , Aiiad Albeshri," Analysis of Eight Data Mining Algorithms for Smarter Internet of Things (IoT)", Volume 98, P 437-442, 2016, https://doi.org/10.1016/j.procs.2016.09.068.
Steven J. Rigatti, MD, DBIM, DABFM," Random Forest", vol 47, : 31–39, https://doi.org/10.17849/insm-47-01-31-39.1.
Solane Duquea , Dr.Mohd. Nizam bin Omar ," Using Data Mining Algorithms for Developing a Model for Intrusion Detection System (IDS)", Vol 61, Pages 46-51, 2015, https://doi.org/10.1016/j.procs.2015.09.145.
Nabila Farnaaz , M. A. Jabbar ," Random Forest Modeling for Network Intrusion Detection System",89 213 – 217 , 2016, https://doi.org/10.1016/j.procs.2016.06.047.
Aritz Pe´rez *, Pedro Larran˜aga, In˜aki Inza," Supervised classification with conditional Gaussian networks: Increasing the structure complexity from naive Bayes",vol 43, p 1–25 , 2006 , https://doi.org/10.1016/j.ijar.2006.01.002.
Nurul Rismayanti , Ahmad Naswin , Umar Zaky , Muhammad Zakariyah , Dwi Amalia Purnamasari ," Evaluating Thresholding-Based Segmentation and Humoment Feature Extraction in Acute Lymphoblastic Leukemia Classification using Gaussian Naive Bayes",Volume 1 Issue 2 ISSN 3025-4167, https://doi.org/10.56705/ijaimi.v1i2.99.
Ivan Rodrigues, Alitta Parayil, Tarun Shetty, Imran Mirza," Use of Linear Discriminant Analysis (LDA), K Nearest Neighbours (KNN), Decision Tree (CART), Random Forest (RF), Gaussian Naive Bayes (NB), Support Vector Machines (SVM) to Predict Admission for Post Graduation Courses",7 Pages Posted: 26 Oct 2020, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3683065.
Sheikh Amir Fayaza , Majid Zamanb, Muheet Ahmed Buttc ," To Ameliorate Classification Accuracy using Ensemble Distributed Decision Tree (DDT) Vote Approach: An Empirical discourse of Geographical Data Mining", Volume 184, 2021, Pages 935-940, https://doi.org/10.1016/j.procs.2021.03.116.
Zeljko Vujovic, "Classification Model Evaluation Metrics", Volume 12 Issue 6, 2021, https://dx.doi.org/10.14569/IJACSA.2021.0120670.
Ching-Lung Fan,"Article Evaluation of Classification for Project Features with Machine Learning Algorithms", 2022, 14(2), 372; https://doi.org/10.3390/sym14020372.
Karan Bhowmick , Vivek Sarvaiya," A COMPARATIVE STUDY OF THE DIFFERENT CLASSIFICATION ALGORITHMS ON FOOTBALL ANALYTICS ", Int. J. Adv. Res. 9(08), 392-407, http://dx.doi.org/10.21474/IJAR01/13280.
D.Y. Lin,"Linear regression analysis of censored medical costs", Volume 1, Issue 1, March 2000, Pages 35–47, https://doi.org/10.1093/biostatistics/1.1.35.
Gülden Kaya Uyanık , Neşe Güler ,"A Study on Multiple Linear Regression Analysis" , Volume 106, 10 December 2013, Pages 234-240, https://doi.org/10.1016/j.sbspro.2013.12.027.
Peter C. Austina, Ewout W. Steyerbergd ,"The number of subjects per variable required in linear regression analyses" , VOLUME 68, ISSUE 6, P627-636, JUNE 2015, http://dx.doi.org/10.1016/j.jclinepi.2014.12.014.
Kolawole Ogunsina , Ilias Bilionis b , Daniel DeLaurentis , "Exploratory data analysis for airline disruption management", Volume 6, 15 December 2021, 100102, https://doi.org/10.1016/j.mlwa.2021.100102.
joan Stelmack, OD; Janet P. Szlyk, PhD; Thomas Stelmack, OD; Judith Babcock-Parziale, PhD; Paulette Demers-Turco, OD; R. Tracy Williams, OD; Robert W. Massof, PhD, "Use of Rasch person-item map in exploratory data analysis: A clinical perspective", Volume 41, Number 2, Pages 233–242,2004, http://dx.doi.org/10.1682/JRRD.2004.02.0233.
Kunitoshi Iseki 1, Yoshiharu Ikemiya, Kozen Kinjo, Taku Inoue, Chiho Iseki, Shuichi Takishita,"Body mass index and the risk of development of end-stage renal disease in a screened cohor", VOLUME 65, ISSUE 5, P1870-1876, MAY 2004 , https://doi.org/10.1111/j.1523-1755.2004.00582.x.
Massimo Cirillo, Pietro Anastasio , Natale G. De Santo, "Relationship of gender, age, and body mass index to errors in predicted kidney function", (2005) 20: 1791–1798, https://doi.org/10.1093/ndt/gfh962.
Chandra L. Jackson, PhD, MS, Hsin-Chieh Yeh, PhD , Moyses Szklo, MD, DrPH, Frank B. Hu, MD, PhD , Nae-Yuh Wang, PhD , Rosemary Dray-Spira, MD, PhD, and Frederick L. Brancati, MD, MHS," Body-Mass Index and All-Cause Mortality in US Adults With and Without Diabetes ", 29(1):25–33,2013, DOI: 10.1007/s11606-013-2553-7.
George A Bray, Kathleen A Jablonski, Wilfred Y Fujimoto, Elizabeth Barrett-Connor, Steven Haffner, Robert L Hanson, James O Hill, Van Hubbard, Andrea Kriska, Elizabeth Stamm, and F Xavier Pi-Sunyer , " Relation of central adiposity and body mass index to the development of diabetes in the Diabetes Prevention Program ", r 2008;87:1212– 8, https://doi.org/10.1093/ajcn/87.5.1212.
Ari Karppinen , Jaakko Kukkonen , Jari Härkönen , Mari Kauhaniemi , Anu Kousa , Tarja Koskentalo,"A modelling system for predicting urban air pollution: Comparison of model predictions with the data of an urban measurement network in Helsinki", 34(22):3735-3743 , https://www.researchgate.net/publication/222829613_A_modelling_system_for_predicting_urban_air_pollution_Comparison_of_model_predictions_with_the_data_of_an_urban_measurement_network_in_Helsinki
Daniel L. Moody , "Measuring the Quality of Data Models: An Empirical Evaluation of the Use of Quality Metrics in Practice", Proceedings of the 11th European Conference on Information Systems, ECIS 2003, Naples, Italy 16-21 June 2003, http://aisel.aisnet.org/ecis2003/78.
A.L.Sayeth Saabith , MMM.Fareez , T.Vinothraj , " Python Current Trend Applications-An Overview" , Volume 6, Issue 10, October-2019, e-ISSN: 2348 - 4470, print-ISSN: 2348-6406 , https://www.scribd.com/document/544106143/IJAERDV06I1085481.
Andre M. Carrington , Douglas G. Manuel, Paul W. Fieguth , Tim Ramsay , Venet Osmani , Bernhard Wernly, Carol Bennett, Steven Hawken , Olivia Magwood, Yusuf Sheikh, Matthew McInnes, and Andreas Holzinger , Senior Member, "Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation" , Volume: 45, Issue: 1, 01 January 2023 , https://doi.org/10.1109/TPAMI.2022.3145392.
Weichao Xu; Shun Liu; Xu Sun; Siyang Liu; Yun Zhang ,"A Fast Algorithm for Unbiased Estimation of Variance of AUC Based on Dynamic Programming", vol 9553 – 9560 , 2016 , https://doi.org/10.1109/ACCESS.2016.2628102.
Krzysztof Gajowniczek , Tomasz Ząbkowski , "ImbTreeAUC: An R package for building classification trees using the area under the ROC curve (AUC) on imbalanced datasets",Volume 15, July 2021, 100755, https://doi.org/10.1016/j.softx.2021.100755.
Farrukh Aslam Khan, Khan Zeb, Mabrook Alrakhami, Abdelouahid Derhab , " Detection and Prediction of Diabetes Using Data Mining A Comprehensive Review",vol9 IEEE Access PP(99):1-1, https://ieeexplore.ieee.org/document/9354154.