Predicting Students Decision To Go Into University

Published in

CodeX

5 min readJul 16, 2020

Introduction

Educational Data Mining is the analysis of students’ data with machine learning algorithms in order to infer information to further aid students with their education. It is a field with enormous potential benefits and it has received a lot of interest from researchers and some educational institutions over the years.

Objective

One very common application of machine learning to educational data is to predict students grades ( I will cover this in a different article ). In this project though, our objective is clearly stated as follows:

To Predict A Students final decision to further their education by enrolling into universities.

With this information, it’s possible for school administrators to understand the right approach to take for each group of students. To either encourage students by explaining to them the importance of tertiary education, or showing them viable alternatives and how to pursue these alternatives.

Dataset

To create our model for classification, we need to feed our algorithms with data. We need to find this data and in this project, we will make use of a publicly available Portuguese school dataset. It can be found with this url:https://github.com/Emmanuel96/Higher_edu_dec_prediction/blob/master/Dataset/student.csv

The figure below explains the features present in the dataset.

Implementation

To achieve the objective explained earlier, we will use three different classification algorithms (Multi-layered Perceptron Classifier, K-Nearest neighbor and Adaboost Ensemble) to classify students according to their decision

Once our environment is set up, we first import our Panda Libraries as follow:

import pandas as pd
import numpy as np
import seaborn as sns

Please note that you would have to install these libraries either with pip or conda, if you do not already have them installed.

Next, we import the necessary scikit-learn libraries:

# sklearn imports
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics# classifiers
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

Classification

Machine learning algorithms mostly belong to either the category of supervised or unsupervised learning. Classification is a type of supervised learning which we used to classify a set of data into two or more classes.

To achieve this in our case, firstly, we read our csv file, create a dataframe out of it and drop our null values as follows:

data = pd.read_csv(
    r'C:/Users/Emmanuel/Documents/projects/Python/Students Data Analysis/Dataset/student.csv')
student_data = pd.DataFrame(data)# drop all null data
student_data.dropna(inplace=True)

Handle Categorical Values

Classification algorithms don’t work well with categorical values, we need to convert them to binary values. In this case we use Pandas dummy_data function with our custom made function as follows:

# function to convert categorical data to binary data
def handle_cat_data(cat_feats, data):
    for f in cat_feats:
        to_add = pd.get_dummies(data[f], prefix=f, drop_first=True)
        merged_list = data.join(
            to_add, how='left', lsuffix='_left', rsuffix='_right')
        data = merged_list    # then drop the categorical features
    data.drop(cat_feats, axis=1, inplace=True)    return data# drop all null data
student_data.dropna(inplace=True)# array of categorical values
cat_data = ['school', 'sex', 'address', 'famsize', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'activities', 'nursery', 'fatherd', 'Pstatus', 'higher', 'internet', 'romantic', 'famrel',
            'freetime', 'goout', 'Dalc', 'Walc', 'health', 'Medu', 'famsup']student_data = handle_cat_data(cat_data, student_data)

Split Dataset To Test and Train Data

We cannot use the same data for both training and testing purposes because our models are already familiar with the training datasets, hence to test its accuracy and robustness, we need a test dataset in which our model is not familiar with.

We can’t use our entire dataset directly for both training and testing, hence we split it into 80% for training and 20% for testing with the train_test_split method. We select our higher column as our target column and leave the other columns, except the target column to train our model with. We achieve this with the code snippet below:

# divide data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(student_data.drop(
'failures', axis=1), student_data.failures, test_size=0.25, stratify=student_data.failures)# classifier names for run model function
classifier_names = ["K Nearest Neighbour",
"Neural Networks", "Adaboost Classifier"]# classifiers
classifiers = [
KNeighborsClassifier(),
MLPClassifier(),
AdaBoostClassifier(),
]

Finally we run our model by calling the run_models function as shown below:

# run classification model
run_models(classifier_names, classifiers, X_train, X_test, y_train, y_test)

Results And Model Evaluation

The table below shows the accuracy of the different algorithms used. These values may vary slightly from what you get based on different factors i.e. your machine.

To evaluate a model, we could use different factors, but in this case we chose the accuracy and the time taken in seconds to train the model.

MLP is a neural network classifier; it usually requires a large dataset and can be very complex. Here we used a simple version with the default 200 iterations and 5 layers. As expected, because of its complexity, it took the longest time. I expected the accuracy to be high, but if fine tuned, it would give a better accuracy.

The last classification algorithm used here is Adaboost. Its is an ensemble classifier i.e. combination of two or more classification algorithms. In most cases, they give better results, but not in our case here. In this situation it fell a bit short of the KNN classifier. I am also confident with hyper parameter fine tuning, we would have better accuracy

We can easily see that the K Nearest Neighbour gave the best accuracy; this is very interesting as it is relatively the simplest algorithm used here giving an accuracy of 82.76% while also taking the shortest time to run.

Future Work:

Although, we would need to test our model with larger dataset, we can currently use it to predict students decisions. To make it more robust,we will need to fine tune the hyper parameters and see what difference this makes to the accuracies.

Thank you for reading my article and please feel free to reach out to me if you have any questions or comments.