Titanic Data

Analyzing survivors of the titanic tragedy.

Posted by Amer Shalan on October 26, 2016

We take a look at the titanic dataset.

What variables contributed to an individuals survival?

Did different ages or sexes have a higher rate of survival?

Cleaning the data

We had two missing ages. We can replace that with the median.

# Replacing missing ages with median df['Age'] = df['Age'].fillna(df['Age'].median())

The cabin data had very little data. We drop that.

# Delete Cabin del df['Cabin']

We drop the remaining rows that have NaN's.

# Drop remaining NaN's df.dropna(inplace=True)

Plotting the data

I started by plotting all peoples ages and their survival.

Age Category
0-5 Baby
6-15 Child
16-30 Adult
31-40 Mid
41-60 Old
61-80 Very Old

I then plotted each class and their survival.

This is a a little difficult to read. Why dont we put it into a stacked bar graph to see the weight of each genre and the songs in that genre hitting Number 1.

Data Wrangling

In an attempt to figure out which coefficients I should use, I normalized the appropriate data and ran a RFE model.

nc = [x for x in df.columns if x not in ['Survived', 'Sex', 'Embarked', 'index', 'PassengerId' , 'Name', 'Ticket']] df_norm = (df[nc] - df[nc].mean()) / (df[nc].max() - df[nc].min()) svc = SVC(kernel="linear", C=1) rfe = RFE(estimator=svc, n_features_to_select=1, step=1) rfe.fit(df_norm, df.Survived) ranking = rfe.ranking_ coeffs = pd.DataFrame(ranking, index=df[nc].columns, columns=['coef']) coeffs.sort_values('coef')
Pclass 1
Fare 2
Age 3
SibSp 4
Parch 5

I dropped the fare data since the class data can represent similar findings but with a higher correlation.

Created dummy vriables for:

  • Sex
  • Age (binned with the table above)
  • Embarked
  • Pclass


I ran a RFECV to figure out the optimal number of features to use to get the highest score. What I figured out was that 7 features was the best

So I ran a Logistic Regression and received these absolute coefficients.

Coefficient Feature
2 2.600351 male
5 2.032660 Age_Baby
10 1.482869 Class_1
11 0.908808 Class_2
3 0.585350 Embark_C
0 0.447184 SibSp
6 0.365025 Age_Child
9 0.180293 Age_Old
4 0.161709 Embark_Q
1 0.072917 Parch
8 0.049452 Age_Mid
7 0.044070 Age_Adult

Since 4 of the bottom 5 coefficients are dummy created variables, we can only drop the Parch variable from our model.

With the new model, we end up with this confusion matrix

Predicted Survived Predicted did_not_survive
Survived 82 30
Did not Survive 26 156

With an roc curve:

We try the model with a GridSearch, we end up with this confusion matrix

Predicted Survived Predicted did_not_survive
Survived 80 32
Did not Survive 25 157

We perform a Gridsearch for the same classification problem as above, but use KNeighborsClassifier a our estimator. The best parameters end up being

{'n_neighbors': 3, 'weights': 'uniform'}. So we fit a new kNN model with the optimal parameters found in gridsearch and end up with this confusion matrix.

Predicted Survived Predicted did_not_survive
Survived 75 37
Did not Survive 30 152

Which turns out to be worse than our original Linear model.

Perhaps it's better to keep our model linear.

Hope you enjoy my findings :)