Machine Learning–Based Risk Factor Analysis and Prediction Model Construction for the Occurrence of Chronic Heart Failure: Health Ecologic Study

Introduction

Heart failure is a complex clinical syndrome where the ventricular filling or ejection capacity is compromised due to any structural or functional abnormality of the heart []. Chronic heart failure (CHF) is a severe manifestation or advanced stage of various cardiovascular diseases. It features high mortality and rehospitalization rates, thus constituting the ultimate battlefield for cardiovascular disease prevention and control []. According to the World Health Organization, the global prevalence of heart failure in adults is 1% to 2%, with 25,000-30,000 new cases expected each year []. The multiple physical and psychological symptoms endured by patients with CHF not only intensify the burden of the patients and caregivers but also diminish their quality of life [,]. Therefore, it has become an inevitable trend to identify risk factors for those who have not developed CHF and patients with CHF, to prevent the occurrence of CHF in a timely manner, and to strengthen the health management of patients with CHF.

Preventive management in current guidelines and practice is mostly based on comprehensive cardiovascular interventions []. However, within the framework of the goals of “health for all” and “personalized medicine,” early and precise risk stratification takes precedence in cardiovascular prevention and control []. Risk factors for heart failure are the primary concern in suspected diagnoses, and the control of risk factors is the top priority in the primary prevention of patients with heart failure. Diagnosis, treatment, and management are based on this. Risk assessment, as the initiating link of precise health management of CHF, particularly needs to recognize the significance of risk factor identification in the face of the disease characteristics of CHF, such as complex etiology, severe condition, rapid progression, and poor prognosis. With the in-depth research of precision medicine and the change in residents’ living behaviors, the update of important risk factors and cardiovascular markers is a crucial element in the development of risk assessment programs and the improvement of assessment tools.

The accuracy and comprehensiveness of identifying the risk factors of CHF, as well as the advancement and scientific nature of the modeling techniques, are the keys to guaranteeing the smooth implementation of risk assessment. As for the risk factors, they mainly include personal factors (such as gender, age, weight, etc) [] and disease-related factors (such as ejection fraction, biomarkers, myocardial imaging, cardiac ultrasound parameters, etc). For instance, Salvioni et al [] developed a method for integrating Metabolic Exercise Test data combined with Cardiac and Kidney Indexes (MECKI) scores from 2715 patients. It was demonstrated that the MECKI score is a highly efficient tool for facilitating risk stratification and therapeutic decision-making for patients with heart failure. Klimczak-Tomaniak et al [] conducted repeated measurements of 92 biomarkers that optimally predict adverse clinical events in heart failure and can be used for dynamic risk assessment in clinical practice. However, these risk factor identifications focus only on individual and disease treatment–related factors, ignoring broader social factors that affect health. In terms of modeling techniques, traditional modeling methods such as Cox and logistic regressions cannot well deal with the complex relationships that may exist between variables, and researchers have begun to introduce digital technology into model construction to obtain a large amount of data information in the database after multidimensional interactions. Angraal et al [] used a random forest (RF) model to predict mortality and hospitalization rates for heart failure in outpatients with preserved ejection fraction. Wang et al [] used a machine learning model to accurately predict the risk of heart failure in older patients with prediabetes or diabetes using data from the National Health and Nutrition Examination Survey. Although risk assessment models based on big data and artificial intelligence exist, the mainstay of current risk assessment approaches remains the risk factor itself. Moreover, most of the models are predicated on the patient’s medical record information from the initial diagnosis and hospitalization, intended for medical and nursing professional evaluations, as well as checks of objective data such as those from laboratories, and a dynamic prediction model that changes with the condition has not been established.

In summary, in the face of the incomplete coverage of clinical risk factors of CHF and the low accuracy of risk prediction modeling, constructing an accurate and convenient heart failure risk prediction model is an important pathway and tool for the identification of accurate risk factors, and fully using the big data resources for the refinement of risk factors and completing the digital modeling is an important guarantee for the effectiveness of the final risk assessment.

MethodsData Source

This study applied to the Jackson Heart Study (JHS) database through the National Heart Lung and Blood Institute. The database contains data on 3883 individuals. The participants were African American adults aged 35-84 years. Data were collected at baseline (V1: 2000-2004), first follow-up (V2: 2005-2008), and second follow-up (V3: 2009-2013). There were 489 patients with CHF at baseline and 3394 without CHF. The first follow-up V2 (2005-2008) lost 777 cases, leaving 3106 cases; the second follow-up V3 (2009-2013) lost 462 cases, leaving 2644 cases.

Study Population

Patients who did not have heart failure at baseline (V1) of the JHS database were selected for this study ().

Textbox 1. Inclusion and exclusion criteria.

Inclusion criteria

Those who did not have chronic heart failure at baselineThose who participated in the first and second follow-up visits, or those who had electronic medical records

Exclusion criteria

Patients who met the Framingham heart failure diagnostic criteria []Study Variables

The inclusion of study variables was carried out at 5 levels of health ecology. A total of 53 variables were finally included. Among them, the individual trait level, including general information, biological indicators, disease history, family history, and symptoms, had a total of 28 variables; the individual behavioral trait level, including diet, exercise, sleep, and psychology, had a total of 12 variables; the interpersonal relationship level, including family relationship, social relationship, neighborhood relationship, had a total of 4 variables; the work and life level, including working conditions, living conditions, access to health care, had a total of 8 variables; and the macropolicy level had only 1 variable, being health insurance policy. All the variables can be seen in Table S1 in .

Statistical Analyses

The study encompassed several crucial steps. In the data preprocessing stage, 2 primary operations were undertaken. First, for data proofreading, a check was made to determine if there were any missing values within the data. In the case of categorized data, values were assigned if necessary. Second, with regard to missing value processing, data with missing values exceeding 30% were directly eliminated. Concurrently, for data with missing values less than 30%, Monte Carlo multiple interpolation was used for interpolation.

Proceeding to the data analysis phase, SPSS 22.0 (IBM Corp) was used for data statistical analysis. For measurement data, if they adhered to a normal distribution, a t test (2-tailed) was used for comparison between groups. The t test (2-tailed) was a statistical test designed to determine if there was a significant difference between the means of the two groups. In the event that the measurement data did not follow a normal distribution, they were represented as median (IQR). Count data were described by frequency and percentage. Regarding the measured information in the influencing factors, if it satisfied the normal distribution, the independent samples t test (2-tailed) was used for testing. The independent samples t test (2-tailed) was a method for comparing the means of two independent groups. If the measured information in the influencing factors did not conform to the normal distribution, the Mann-Whitney U rank sum test was used. The Mann-Whitney U rank sum test was a nonparametric test used for comparing two independent groups. This study consisted entirely of count data, frequency and percentage were used for statistical description, and the chi-square test was used for statistical analysis. The chi-square test was a statistical test designed to determine if there was a significant association between two categorical variables.

Next, in the delineation of the dataset step, this study used randomized splitting to divide the dataset into a training set (70%), a test set (15%), and a validation set (15%).

In the feature selection process, before feature selection, the data were standardized. Then, after the completion of the standardization process, feature selection was carried out. Pycharm (JetBrains Corp) was used. Principal component analysis (PCA) and RF were two methods used to screen the variables, respectively. PCA was a statistical procedure that used an orthogonal transformation to convert a set of correlated variables into a set of uncorrelated variables known as principal components []. RF was an ensemble learning method for classification, regression, and other tasks that operated by constructing a multitude of decision trees during training and outputting the class that was the mode of the classes of the individual trees []. The two feature selection methods were compared, the importance of the features was ranked, and the visualization of feature importance was also performed.

In the imbalance data handling stage, this study used 5 methods for processing unbalanced datasets. These methods included oversampling, undersampling, Adaptive Synthetics Sampling, Synthetic Minority Over-sampling Technique (SMOTE), and Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN). Models constructed using the original dataset were compared with models constructed from datasets that had been processed by these 5 methods.

Finally, in the model construction stage, 7 models were constructed in this study. They were decision tree, RF, support vector machine, extreme gradient boosting, adaptive boosting (AdaBoost), naive Bayes model, multilayer perceptron, and bootstrap forest. The decision tree was a flowchart-like structure where each internal node represented a test on an attribute, each branch represented the outcome of a test, and each leaf node represented a class or a value []. The RF was an ensemble learning method as previously mentioned []. The support vector machine was a supervised learning algorithm that analyzed data for classification and regression analysis []. Extreme gradient boosting was an efficient implementation of gradient boosting for large datasets []. AdaBoost was an iterative algorithm that combined multiple weak classifiers to form a strong classifier []. The naive Bayes model was a probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions between the features []. The multilayer perceptron was a type of artificial neural network with multiple layers of neurons []. The bootstrap forest was an ensemble method that combined multiple bootstrap samples to build a forest of trees. The accuracy, precision, sensitivity, F1-score, and area under the curve (AUC) of each model were compared, and receiver operating characteristic (ROC) curves of each model were drawn. After selecting the optimal model, a 10-fold cross-validation was performed using the training and validation set. In order to construct a better model, we used hyperparameter optimization to find the best combination of parameters that makes the model perform best on the training set and achieve better results on the test set.

Ethical Considerations

The study was approved by the Ethics Committee of Zhongda Hospital, Southeast University (2022ZDSYLL401-Y01).

ResultsPatient Characteristics

Among the 3883 individuals in the JHS, 489 had CHF at baseline and 3394 did not have CHF. Excluding patients who were lost to visit and did not have any electronic medical records, 2553 people did not have CHF at baseline. The mean age of this population was 57.84 (13.45) years, with 1590 female participants (62.2% of the population) and 963 male participants (37.7% of the population). The screening process is shown in . The dataset (n=2553) was randomly divided into three parts: 70% (n=1787) of the data for model training, 15% (n=383) for model test and 15% (n=383) for model validation. The amount of missing data for the predictors left ventricular diastolic diameter, left ventricular systolic diameter, left ventricular mass, depression, vitamin D2, vitamin D3 derivatives, shortness of breath, walking 100 m with wheezing, and loneliness was greater than 30%, so these predictors were not included in the analysis.

‎

Figure 1. Study flowchart. CHF: chronic heart failure; JHS: Jackson Heart Study; V2: first follow-up; V3: second follow-up. Univariate Analysis of Variables Influencing the Occurrence of CHF

Alcohol, smoking, health insurance, income, occupation, education, minutes of walking or running, nocturnal sleep dyspnea, ever had swelling of feet or ankles, chest ever sound wheezy without cold, left ventricular regional wall motion, cardiac disease, age, BMI, systolic blood pressure, glycosylated hemoglobin, triglycerides, ultrasensitive C-reactive protein, glomerular filtration rate, heart rate, stress, regional poverty population ratio, the status of favorite food stores within 3 km, status of sports facilities within 3 km, and the ratio of early maximal ventricular filling velocity to atrial maximal ventricular filling velocity were statistically significantly different in the population without heart failure (P<.05). A one-way analysis of the baseline characteristics of the included patients is shown in -.

Table 1. One-way analysis of the baseline characteristics (part 1).VariablesPeople without heart failure (n=2307), n (%)People with heart failure (n=264), n (%)Chi-square (df)PGender0.3 (1).56
Woman1441 (56.4)149 (5.84)

Man866 (33.92)97 (3.8)

Alcohol26.2 (1)<.001
Yes1114 (43.63)76 (2.98)

No1183 (46.34)167 (6.54)

Smoking11.1 (1)<.001
Yes680 (2.66)98 (3.84)

No1621 (63.49)148 (5.8)

Health insurance9.5 (1).002
Yes1972 (77.24)192 (7.52)

No335 (13.12)54 (2.12)

Threats or harassment12.1 (6).06
Several times a day15 (0.59)3 (0.12)

Almost every day15 (0.59)3 (0.12)

At least once a week45 (1.76)3 (0.12)

A few times a month42 (1.65)4 (0.16)

A few times a year102 (4)3 (0.12)

Less than a few times a year351 (13.75)27 (1.06)

Never1704 (66.75)196 (7.68)

Income47.5 (3)<.001
Poor198 (7.76)42 (1.65)

Lower-middle431 (16.88)71 (2.78)

Upper-middle632 (24.76)56 (2.19)

Affluent707 (27.69)39 (1.53)

Occupation31.6 (10)<.001
Management or Professional903 (35.37)71 (2.78)

Service515 (20.17)81 (3.17)

Sales441 (17.27)30 (1.18)

Farming2 (0.08)0 (0)

Construction119 (4.66)19 (0.74)

Production311 (12.18)44 (1.72)

Military3 (0.12)0 (0)

Sick2 (0.08)0 (0)

Unemployed2 (0.08)1 (0.04)

Retired3 (0.12)0 (0)

Student2 (0.08)0 (0)

Education100.5 (2)<.001
Less than high school300 (11.75)91 (3.56)

High-school graduate or General Educational Development459 (17.98)45 (1.76)

Attended vocational school, trade school, or college1543 (60.43)110 (4.31)

Table 2. One-way analysis of the baseline characteristics (part 2).VariablesPeople without heart failure (n=2307), n (%)People with heart failure (n=264), n (%)Chi-square (df)PStress living in neighborhood2.1 (3).56
Not stressful1708 (6.66)188 (7.36)

Mildly stressful348 (13.63)31 (1.21)

Moderately stressful148 (5.80)12 (0.59)

Very stressful91 (3.56)11 (4.31)

Minutes of walking or running10.1 (4).04
Less than 5 minutes1069 (41.87)133 (5.21)

At least 5 but less than 15 minutes405 (15.87)44 (1.72)

At least 15 but less than 30 minutes331 (12.97)22 (0.86)

At least 30 but less than 45 minutes200 (7.83)24 (0.94)

At least 45 minutes298 (11.67)23 (0.90)

Place they usually go to for health care11.4 (8).18
Walk-in clinic180 (7.05)11 (0.43)

Health Maintenance Organization clinic16 (0.63)0 (0)

Hospital clinic242 (9.48)30 (1.18)

Neighborhood health center115 (4.5)18 (0.71)

Hospital emergency room66 (2.59)10 (0.39)

Public health department clinic23 (0.9)2 (0.08)

Company or industry clinic63 (2.47)4 (0.16)

Doctor’s office1334 (52.25)155 (6.07)

Other12 (0.47)0 (0)

Difficulty in obtaining health service6.2 (3).10
Very hard85 (3.33)13 (0.51)

Fairly hard114 (4.47)20 (0.78)

Not too hard395 (15.47)42 (1.65)

Not hard at all1694 (66.35)171 (6.7)

Satisfied with doctor1.7 (4).79
Very satisfied1524 (59.69)173 (6.78)

Somewhat satisfied595 (23.31)60 (2.35)

Somewhat dissatisfied56 (2.19)5 (0.2)

Very dissatisfied21 (0.82)2 (0.08)

Not sure38 (1.49)3 (0.12)

Ever awakened by trouble breathing102.4 (1)<.001
Yes78 (3.06)44 (1.72)

No2204 (86.33)200 (7.83)

Rate your sleep quality overall7.5 (4).68
Excellent232 (9.09)23 (0.9)

Fair514 (20.13)52 (2.04)

Good796 (31.18)77 (3.02)

Poor190 (7.44)31 (1.21)

Very good562 (22.01)52 (2.04)

Ever had swelling of feet or ankles27.1 (1)<.001
Yes956 (37.45)144 (5.64)

No1328 (52.02)99 (3.88)

Chest ever sound wheezy without cold21.8 (1)<.001
Yes156 (6.11)37 (1.45)

No2131 (83.47)207 (8.11)

Marriage37.5 (4)<.001
Divorced342 (13.4)31 (1.21)

Married1356 (53.11)118 (4.62)

Unmarried258 (10.11)26 (1.02)

Separate92 (3.6)11 (0.43)

Widowed249 (9.75)59 (2.31)
LVa regional wall motion29.7 (3)<.001
Abnormal7 (2.74)4 (0.16)

Border7 (2.74)6 (0.24)

Normal2179 (85.35)222 (8.7)

Can’t assess13 (0.51)1 (0.04)

Cardiac disease66.6 (1)<.001
Yes168 (6.58)56 (2.19)

No2139 (83.78)190 (7.44)

Family history of cardiovascular disease, n (%)0.0 (1).87
Yes748 (29.30)81 (3.17)

No1559 (61.07)165 (6.46)

aLV: left ventricle.

Table 3. One-way analysis of the baseline characteristics (part 3).VariablesPeople without heart failure, median (IQR)People with heart failure, median (IQR)Mann-Whitney U testPAge (years)54 (45-63)67 (58.75-73)–11.9
BMI (kg/m2)30.29 (26.74-35)31.58 (28.06-42.61)–3.59<.001Systolic blood pressure (mm Hg)123.83 (114.66-134.83)132.08 (122.91-145.38)–7.83<.001Glycosylated hemoglobin (%)5.6 (5.3-6.1)6 (5.5-7.3)–7.39<.001Low-density lipoprotein (mg/dL)126 (102-149)120.5 (99.75-147.25)c–1.16.25High-density lipoprotein (mg/dL)50 (42-60)49 (40-58)c–1.66.10Fasting triglycerides (mg/dL)87 (63-124)98 (74-141)c–4.35<.001Fasting total cholesterol (mg/dL)197 (173-223)193 (170-220)c–0.97.33Ultrasensitive C-Reactive protein (mg/dL)0.25 (0.1-0.54)0.31 (0.14-0.65)–2.73.01Glomerular filtration rate (mL/min)87.09 (76.97-97.78)80.16 (64.43-91.9)–6.65<.001Ejection fraction (%)65 (55-65)65 (55-65)–1.18.24Heart rate (beats/min)63 (56-70)66 (57-73)–3.37<.001Vitamin D3 (ng/mL)12.2 (8.7-16.9)11.7 (8.75-15.45)–1.29.20Dark-colored green vegetables (ng/mL)0.26 (0.15-0.42)0.24 (0.15-0.39)–1.14.25Stress4 (2-8)4 (1-7)–2.69.01Egg (ng/mL)0.32 (0.91-0.66)0.32 (0.84-0.8)–0.07.94Fish (ng/mL)0.86 (0-0.17)0.86 (0-0.2)–1.53.13Proportion of the population living in poverty in the area (%)0.22 (0.1-0.32)0.31 (0.21-0.36)–5.24<.001Status of favorite food stores within 3 km0.24 (0.06-0.44)0.34 (0.16-0.49)–4.43<.001Status of sports facilities within 3 km0.37 (0.18-0.63)0.4 (0.23-0.71)–2.24.03Peak early diastolic velocity of mitral annulus (cm/s)0.83 (0.71-0.96)0.8 (0.67-0.97)–1.10.27Ratio of early maximal ventricular filling velocity to atrial maximal ventricular filling velocity1.06 (0.87-1.27)0.92 (0.76-1.35)–6.37<.001Hours of actual sleep at night6 (6-7)6 (5-8)–1.04.30Machine Learning Analysis of the Occurrence of CHFFeature Selection

This study made use of PCA and the RF method for feature selection. The model was trained using root mean square error (RMSE) as a criterion. The principle is that each step includes an additional feature with the highest variance as a basis for classification []. The number of selected features was used as the horizontal axis and the predicted RMSE score for each fitted model was used as the vertical axis.

PCA offers the advantage of transforming a set of correlated variables into a set of uncorrelated principal components, thereby reducing dimensionality and enhancing interpretability. After feature selection by PCA, a total of 15 features were incorporated. The initial eigenvalues, percentage of variance, and accumulation of these features are presented in . RMSE results are presented in .

Table 4. The result of these features selected by principal component analysis.FeatureInitial eigenvaluePercentage of variance (%)Accumulation (%)Age (years)3.277.427.42Gender2.475.6113.03Fasting total cholesterol (mg/dL)2.215.0218.04High-density lipoprotein (mg/dL)1.954.4422.48Status of favorite food stores within 3 km1.784.0326.52Income1.723.9030.41Peak early diastolic velocity of mitral annulus1.483.3633.77LVa regional wall motion1.363.0936.86Smoke1.262.8539.71Dark-colored green vegetables1.222.7842.49Minutes of walking or running1.172.6745.16Cardiac disease1.112.5247.68Threats or harassment1.082.4550.13Rate your sleep quality overall1.052.452.54Systolic blood pressure (mm Hg)1.037454.88

aLV: left ventricle.

‎

Figure 2. Root mean square error (RMSE) results of the principal component analysis.

On the other hand, the RF method is known for its robustness and ability to handle high-dimensional data. After feature selection by RF, a total of 21 features were included. The importance of these features is shown in . RMSE results are presented in .

Table 5. Characteristic importance of feathers selected by random forest.FeatureCharacteristic importanceAge0.07Glomerular filtration rate0.07Glycosylated hemoglobin0.06Systolic blood pressure0.05BMI0.04Ratio of early maximal ventricular filling velocity to atrial maximal ventricular filling velocity0.04Eggs0.04Dark-colored green vegetables0.04Heart rate0.03Peak early diastolic velocity of mitral annulus0.03Fasting total cholesterol0.03Vitamin D30.03Proportion of the population living in poverty in the area0.03Ultrasensitive C-reactive protein0.03Status of sports facilities within 3 km0.03Low-density lipoprotein0.03Ever awakened by trouble breathing0.03Status of favorite food stores within 3 km0.03Fasting triglycerides0.03High-density lipoprotein0.03Ejection fraction0.03‎

Figure 3. Root mean square error (RMSE) results of the random forest method.

To compare the feature selection methods, this study used 10-fold cross-validation. After calculating the mean RMSE and plotting the image as shown in , the results demonstrated that the outcomes after feature selection by RF outperformed those after feature selection by PCA. The average RMSE mean of RF was 0.30. The average RMSE mean of both RF and original data was 0.31. This highlights the superiority of the RF method in terms of providing more accurate and reliable feature selection results.

‎

Figure 4. Root mean square error (RMSE) results of the original data and different feather selection. PCA: principal component analysis; RF: random forest. Data Balance

This study’s analysis of machine learning algorithms using diverse data balancing techniques is presented in . The result indicated that, in contrast to other data balancing strategies used in this study, SMOTE-ENN consistently surpassed all evaluated models in terms of accuracy.

Table 6. Comparison of imbalanced data handling techniques across each machine learning algorithm.Algorithms and performance metricsUnbalanced dataUndersamplingOversamplingADASYNaSMOTEbSMOTE-ENNcDecision tree
Accuracy88.51%56.06%55.43%59.30%61.86%68.14%
AUCd0.660.560.550.600.620.71Random forest
Accuracy91.38%68.18%54.43%55.06%67.86%69.33%
AUC0.810.830.770.800.800.85XGBooste
Accuracy90.86%69.70%59.71%59.44%70.57%68.68%
AUC0.800.760.770.810.790.81AdaBoostf
Accuracy87.73%71.21%67.43%76.76%71.71%75.30%
AUC0.670.83

View original article

JMIR MEDICAL INFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Machine Learning–Based Risk Factor Analysis and Prediction Model Construction for the Occurrence of Chronic Heart Failure: Health Ecologic Study

Comments (0)