Predicting Medical Insurance Charge Using Machine Learning

Adakole Emmanuel Audu
CodeX
Published in
6 min readMar 27, 2021

--

Background:

As the years have gone by, most people are now aware of the necessity of a medical insurance for themselves and their family. Depending on the medical care, an insurance company collects yearly premiums, but it is very difficult to estimate the medical expenses due to various health conditions of payees and other factors contributing to the payees health.

Some conditions are, however, more prevalent for certain segments of the population. An example of this is throat cancer which is more likely among smokers than non-smokers, and heart diseases such as cardiomyopathy may be more likely among the obese.

As a result, insurers invest a great deal of time and money to develop models that accurately predict medical expenses.

Objectives:

The objective of this project is as follows:

  1. To create a regression model to predict the medical expenses of payees for insurance companies

Dataset:
The dataset for this project consists of 7 different columns; age, sex, BMI(body mass index), Children, smokers, region and charges. The table below describes the dataset columns better.

Data Pre-processing:
Before we can apply our regression algorithms on our dataset, we first need to pre-process it and make sure it is in a format where our algorithms will produce the best result.

First we check for null values using the isnull function and sum function as shown in the code snippet below

Lucky for us, this returns a zero; hence we can start to handle our categorical data.

Handle Categorical Data with Dummy Data:

Categorical Data require special actions in linear regression; unlike continuous data they cannot be directly entered into the regression algorithm. There are various ways to handle categorical data; in this situation, we create a function which uses the pandas get_dummies function. It creates a different column for each categorical value i.e. if we use a gender column with only male and female values; it would create a single column like is_male which could either receive a 1 or 0 value.

The code below describes how we can achieve this.

The handle_cat_data function accepts the categorical features and the entire dataset as parameters. It uses a loop to go through each of these columns in the datasets, using the pandas get_dummies function to convert each value of each respective categorical values to a column i.e. in the case of gender_male or gender_female. It also allows us to pass the drop_first as an argument which tells the get_dummies function to drop the first value i.e. if the value is not a female, in this dataset, it’s definitely a male.

Now that we have handled our categorical data, it’s time to standardise or normalise our large numerical values.

Standardize to Handle Large Numerical Values:

Having large values like age and smaller values like children is a bad idea before regression or classification. This is because the large values would tilt the resulting models and have an unfair effect due to the difference in size; hence it is a good idea to standardise or normalise the values which is basically converting the values to be between 0 and 1 and hence ultimately reducing the difference. We can standardize with the code snippet shown below:

Most regression algorithms built into Sci-kit learner have in-built normalisation methods so in most cases we would not need to use the function shown above.

Randomly Split The Dataset into 80% Training & 20% Testing Data:

In order to avoid overfitting or to make a model that’s always a 100% accurate, it’s necessary to split our dataset into training and testing data where we use the training dataset to train our model and then we use the testing dataset to test the accuracy of our model. The code snippet below shows how it’s done.

In the code snippet above we select all the other columns except the label column and pass them to the X dataset and then pass only the label to the Y dataset. Afterwards we hot encode our datasets to further handle the categorical data. We Hot encode so our classifier doesn’t mistake our dummy data to be numerical where when it the > and < or = don’t apply to them. Afterwards we use the Pandas train_test_split function to split our dataset into 80% training and 20% testing.

Apply Regression Algorithms:

Here we create a function that runs the different algorithms and outputs the results for each with the option of saving the model or retrieving a saved model.

The custom run_reg_model function takes an array of estimators and estimator names as shown above. It also takes the dataset and a value if to save or not and the index of the estimator to change. It fits each model and outputs its results and then visualizes the result, comparing both the test result and the predicted results in a grouped bar chart.

Results:

The most interesting thing to notice from the regression is how our training dataset gives a slightly lower accuracy score than our testing dataset. This is strange as we would expect the dataset used for training to be more accurate than that used for testing. I’m guessing it’s as a result of the training dataset being larger hence the reason for more errors. If both were of the same size, it would give the accuracy and error rate.

The difference in performance for the different estimators wasn’t much with both linear regression, ridge regression and lasso regression giving the same accuracy rate of 76% and elastic net regression and the orthogonal matching pursuit CV gave different results. The Elastic Net Regression gave the lowest result. It’s interesting because tuning the hyper parameters for Elastic Net Regression, if I normalized the dataset; it even reduced the performance more to about 0.02 accuracy rate. Tuning the Hyperparameters for the linear and ridge regression made little or no difference.

Something else that was interesting is that feature selection makes no significant positive difference to the performance of our estimators; I tested this with the RFE feature selection method and with all the selected estimators. The next phase is to visualize the results of our estimators with the testing data.

Visualise Models with Training & Testing Data:

The custom made function run_reg_models already calls the visualize function after it’s done with the estimation. The code snippet below shows how the visualise function works:

The code snippet below describes how we visualise the result:

The histogram plots below describe the differences in performance for all the resulting models.

Conclusion on Selected Estimator:

My final selected estimator is the linear regression which is the simplest amongst all the other estimators. It involves finding the correlation and the best line that fits the correlation between attributes. I choose this at it produced the highest score, along with the ridge regression for the investigation.

--

--