Titanic. Age.

We started talking about the Titanic dataset. Let's discuss the Age factor. We saw that Sex, Pclass and Embarked are strong features and to study them we used that these features are categorical with low cardinality. Age is different. It's a continuous value from 0 to 80 with missing values. I tried the following approaches to figure out its signal:

๐Ÿ‘ ROC with age as the score and survival as the target: 47% ROC AUC, no clean picture although some age ranges seem to carry signal
๐Ÿ‘ Train CatBoost with age as the only feature, survival as the target and logloss as the loss function. Not bad on train, but random on eval
๐Ÿ‘ In the spirit of Bayesian analysis, compare the age distribution of survivors and not-survivors. It seems the most promising exercise. There is a 0-10 range with higher density in the survival distribution, while 18-28 is more prominent among those who died. Naive model which just gives 1 to the 0-10 range and so on gives 57% ROC AUC, it seems decent
๐Ÿ‘ Age field is filled for 80.1% of cases
๐Ÿ‘ Survival rate is 40.6% (slightly more) for passengers with known age and 29.4% (significantly less) for passengers with missing age

All in all, age feature seems to be weak, but useful.