Titanic. Embarkation
In previous posts we started exploratory data analysis of the Titanic dataset. We already checked the list of features and checked whether age is a useful feature.
This time let's talk about the embarked feature. It is about the place where a passenger embarked on the ship. We can think about this feature in two modes: wearing an ML engineer hat, or wearing an analyst hat. For an ML engineer, it is quite enough to have a strong connection between a feature and the target to include the feature into the dataset and train a model. An analyst is a little more curious creature. They would ask more questions. What is the nature of the connection between embarkation and survival? How does this feature interact with other features?
Let's check the picture. I'm more than confident that it is not totally accurate, but I hope it is correct in the most important aspect: the order of ports. Before I checked this order, I had a very naive hypothesis that passengers who embarked earlier had a better chance to leave the ship, so their survival rate is higher. But that can't be the case for two reasons. First of all, the dataset contains records of all passengers who were on board at the moment of the disaster. Second, passengers from the last stop had the best chances.
The real reason seems to be a little more subtle. In "S" there were a lot of crew and third-class passengers, and later we will see that it was bad to be a third-class man on Titanic (oops, spoiler). Therefore, embarked could be a proxy feature for economic status.
