Milestone Project 1: Cleveland Heart Disease Dataset

Hello, I'm JayaPrakash and this is a part of the milestone project for my online course on Machine Learning. The dataset used is a modified version of the original one used by many to learn the various classifiers used in machine learning, "The Cleveland Heart Disease Dataset" - https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci.

The methodology followed is explained below:

  1. Data description
  2. Exploratory Data Analysis
  3. Selecting the features of the classifier
  4. Selecting the classifier
  5. Evaluating the classifier
  6. Tuning the classifier
  7. Saving the model

The tools used in this project are:

  1. Jupyter Notebook
  2. Python
  3. Pandas
  4. Matplotlib
  5. Numpy
  6. Scikit-learn

1. Data Description

The dataset contains a total of 297 entries having 14 attributes. The attributes and their descriptions are as follows:

  1. age: age in years
  2. sex: sex (1 = male; 0 = female)
  3. cp: chest pain type -- Value 0: typical angina -- Value 1: atypical angina -- Value 2: non-anginal pain -- Value 3: asymptomatic
  4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  5. chol: serum cholestoral in mg/dl
  6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  7. restecg: resting electrocardiographic results -- Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina (1 = yes; 0 = no)
  10. oldpeak = ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment -- Value 0: upsloping -- Value 1: flat -- Value 2: downsloping
  12. ca: number of major vessels (0-3) colored by flourosopy
  13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
  14. condition: 0 = no disease, 1 = disease (The Target)

The evaluation condition for the classifier will be set at 95% accuracy in predicting whether a given patient has heart disease based on the features above.

Since there is no missing data, and all the columns are numerical values, we can directly start assuming classifiers.

Usually we must clean the data for missing values and encode the categorical data before we can fit the model/classifier

2. Exploratory Data Analysis

We can visualise the various attributes of the data through the matplotlib library.

3. Selecting Features for the Classifier

Since the number of potential features is low, we can avoid reducing it further. Usually we would need to reduce the feature list to hasten the model training and testing phase.

4. Selecting a Classifier

Based on the Scikit learn criteria for selecting a model, https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html, we opt to the following models

Tuning the hyperparameters of Logistic Regression and Random Forest Model

5. Evaluating the Classifier

Now that we have a prototype model, we need to evaluate its predicting capacity. And for that we use the following metrics: