My Titanic Machine Learning Model (Kaggle Competition)
Kaggle provides a Machine Learning competition which allows you to predict whether 418 passengers survived or not based on a set of passenger characteristics including their ticket class, name, sex, age, fare, cabin, where they embarked from and if they are part of a family or not.
Kaggle provides a data set of 891 passengers which includes whether the passenger survived the Titanic’s fateful voyage or not (as well as the characteristics mentioned above).
The aim of the competition is to train your machine learning model on the 891 data set and use it to predict the survival outcomes for the 418 passengers where this is not provided. You upload your submission for the 418 passengers to Kaggle and it gives you the % of outcomes you successfully predicted and where you placed on the public leaderboards.
If you don’t want to see my methodology and are just interested in the outcome, I used a RandomForestClassifier model to predict the survival outcomes and got 78% correct which isn’t bad considering by guessing you should get 50%. However I just missed the top half of the leaderboard which means there is more work to do…
My methodology
- Data exploration — given the data set was provided and in a reasonable state I used Python’s Seaborn to investigate the data
2. Data Preparation / Feature Engineering — Knowing machine learning models struggle with categorical data, it was important that I started to encode the data so the model can learn effectively:
- Male and female encoded to 0 and 1
- Created a title column. 0 if Mr., Miss. or Mrs. and 1 if another unusual title.
- Cabin transformed to 0 if known or 1 if unknown
- Embarked data is One Hot Encoded to 0s and 1s
- Fare is numerical and left as is, however any blanks are filled with the average fare
- Age is numerical and left as is, however blanks were a little more tricky. Filling the Age columns with the mean would change the distribution too much so I used Python’s MICE method from the Impyute module which calculates the age based on the other characterises while keeping within the same distribution curve. You can read more about it here.
- Sibsp — number of siblings was left as-is
- Parch — number of parents / children aboard was left as-is
3. Model creation and hyper optimisation — Random Forest classifier creates a set of decision trees from a randomly selected subset of the training data to predict the survival outcome. The model is trained using the 891 data and its features. It will predict the survival of a passenger by decisions made on features such as Sex, Cabin, Fare, Age etc.
It then aggregates the votes from different decisions trees to decide the final outcome.
The model is then used to predict the survival outcome on the 418 passenger data set— it does so by taking the majority view of all the trees in the forest.
There were two model parameters which I optimised by running multiple models:
n_estimators represent the number of trees in the forest — usually the higher the number of trees the better to learn the data, however adding a lot trees can slow down the training process considerably. I found after 32, the performance of the model peaked
max_depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data. The model starts to overfit for large depth values so I used 5 as the max_depth
In conclusion the RandomForestClassifier was able to score 86% success rate using the training data, and my submission to Kaggle scored 78% so still some work to do! Any comments or feedback on my approach is welcomed.
My code for the above is available on my GitHub.