Titanic Survival Dataset

Exploratory Data Analysis

By: Michael Kam

Overview

Check the subslide down below by pressing the down arrow on the keyboard or the navigation panel.

Missing Values

Relationship between Ticket Class, Sex, Port Embarked, and Survival

Ticket Fare Distribution based on Ticket Class, Sex, and Survival

Correlation of Numerical Variables

Logistic Regression

Based on Wikipedia, logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.

Logistic regression model is commonly used for classification problem in machine learning. Press the arrow down to see the formula.

The following is the formula of the logistic regression:

$$\textit{l} = \log\frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_m x_m$$

where $\textit{l}$ is the log odds model and $\beta_i$ is the parameter or the coefficient of independent variable $x_i$.

Next Steps

  • Impute or drop missing values if necessary
  • Build a baseline model with Logistic Regression
  • Use p-value forward selection to build the final model
  • Evaluate the confusion matrix
  • Compare with XGBoost model