Konferans bildirisi Açık Erişim

DIAGNOSIS OF DIABETES DISEASE USING MACHINE LEARNING METHODS IN AN IMBALANCED DIABETES DATASET

   İsmail Buğra Bölükbaşı; Betül Yağmahan

In recent years, the number of people with diabetes has been increasing daily. Diabetes is an important
disease that can cause serious damage to the body in the future and even cause death if precautions are
not taken. Early and accurate detection of ever-increasing diabetes is gaining more importance in the
medical world. The number of studies using machine learning methods to diagnose diabetes has
increased significantly in the literature.
In this study, type-2 diabetes disease was classified using different data preprocessing and machine
learning methods on real-world data taken from a public hospital in Turkey. Logistic regression, Naive
Bayes, C4.5, and Random Forest classification models were used in the study. In the classification
models, the patient's age, gender, complete blood count, biochemistry, and hormone test results were
used as input variables, and the disease diagnosis made by specialist doctors was used as output variable.
In total, 43 different variables were studied. When the dataset was examined, it was noticed that there
was an imbalance between the classes in the target variable. In cases where there is a class imbalance,
the classification models can make incorrect assignments to the classes. To eliminate the class imbalance
in the data set used in the study, three different resampling methods were used: random undersampling
(RUS), random oversampling (ROS), and synthetic minority oversampling (SMOTE).
The performances of four different machine learning methods were compared on each of the original
training dataset, random undersampled training dataset, random oversampled training dataset, and
synthetic minority oversampled training dataset. A total of 16 different scenarios were studied.
As a result of the analysis of all scenarios, four combinations that give the best results were determined.
These are Naive Bayes working with original training dataset, Random Forest working with random
undersampled training and synthetic minority oversampled training datasets, and C4.5 algorithm
working with random oversampled training dataset. The algorithm that takes the first place among the
four scenarios that show the best results is the Random Forest algorithm working with random
undersampled training dataset.

Dosyalar (270.0 kB)
Dosya adı Boyutu
DIAGNOSIS OF DIABETES DISEASE USING MACHINE LEARNING METHODS IN AN IMBALANCED DIABETES DATASET.pdf
md5:7263bbe549773ced2254b23b669f0165
270.0 kB İndir
0
0
görüntülenme
indirilme
Tüm sürümler Bu sürüm
Görüntülenme 00
İndirme 00
Veri hacmi 0 Bytes0 Bytes
Tekil görüntülenme 00
Tekil indirme 00

Alıntı yap