Data Analysis using Pandas

A Statistical analysis of the survivability of Titanic passengers.

(15 points total)

NAME = "Justin Gabb"

For this project, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this project can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file train.csv. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question:

What categories of passengers were most likely to survive the Titanic disaster?

Question 1.

*The answer to the main question- What categories of passengers were most likely to survive the Titanic disaster?_(2 points)

Answer: The categories of passengers to most likely survive are 1st class females with a sibling or spouse, zero children and in the age category 25-64 as these categories have the highest combined means and summed survival, the mean is 100 and the total survived is 24.

  • The detailed explanation of the logic of the analysis (2 points)

Answer: The answer is derived from 96 possible combinations of 5 categories taking the highest mean and then the highest sum of survived passengers. All Categories were pre analysed by survival probability and for correlation, certain ones were excluded as they did not provide valuable insight into survival rates. Categories were divided by age groups, sex, class, children, and spouses&siblings and then survival averaged and the highest average sum was located using nlargest function in python.

Question 2. (3 points)

  • What other attributes did you use for the analysis? Explain how you used them and why you decided to use them.

Answer: The attributes that were used and then grouped by([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"]) and then ["Survived"] was aggregated by (["mean", "count", "sum"]). Which was then sorted by nlargest mean and then sum. These attributes were used because they showed the highest correlation or meaningfulness to survival. Age categories were also used ['0-14','15-24', '25-64', '>65'] because that allowed for binning of survival based on age which allowed for more granular probabilities.

  • Provide a complete list of all attributes used.

Answer: ('Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Age_Groups', 'IsFemale').

Question 3. (3 points)

  • Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?

Answer: Age bins were created based on statistics Canada & International norms for age groupings to calculate a more granular survival probability based on age. [children ='0-14', youth='15-24', adult='25-64', senior='>65']. The following attributes, 'SibSp' and 'Parch', were engineered from a scale of 1-6 & 1-8 to either true of false as that provided easier calculations of survival probabilities without too many combinations which were creating a way too of a granular calculation of probabilties and became too hard to disguish survival groups. Also there was no correlation in higher survival and a higher amount of children or sibssp there was just a correlation if that variable was yes or no. Sex attribute was engineered to IsFemale True or False 1 or 0 as its easier to peform calculations on when you have matching data types and goes well with the other engineered attibutes of 1 and 0's. I also engineered embarked code to C:1, S:2, Q:3 for better statisitcal calcuations but this category did not score well in the correlation matrix and did not have high survival ratios when calculated manually so was excluded in final analysis.

  • If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

Answer: The following attributes were excluded: 'Ticket', 'Fare', 'Cabin', 'Embarked'. Because there was not a strong enough correlation with survival and no meaningful survival averages could be calculated for large similiar groups. For the Cabin attribute: 687 / 891 records are missing and cabin assumptions based on names is just too fuzzy of a logic-for example do two adult brothers with the same last name stay in the same room or seperate?, so it wasnt very useful. Embarked's correlation from a sociological perspective doesnt make any sense and when engineered to more useful data type it didnt score high in the correlation matrix and survival rates based on embarked locations are low. Ticket number just doesnt make any sense either as there is no correlating with survival as each ticket number is unique to each passenger and cannot be grouped into anything meaningful. Fare is best represented by passenger class as the higher the fare the higher the class which can be proven in a simple bar chart, so passenger class has already binned Fare abstractly.

Question 4. (5 points)

  • How did you treat missing values for those attributes that you included in the analysis (for example, age attribute)? Provide a detailed explanation in the comments.

Answer: Missing age attributes were grouped into two groups, by age 0-14 as children can be distiguished by the salutation in their names 'Master', and the rest just adults 15-100 as there is no way to distinguish between adult age groups. the mean child (age<=14) age was used to fill in NAN's where 'Master' was found and the median was used to fill in the adult age group NAN's as there were outliers effecting the mean. Any other NANs were filled in with zero values.

Random interesting fact: There was one 80 year old man that did survive and he is an extreme anomaly defying all odds!

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set()
In [2]:
t_surv = pd.read_csv('C:/Python development/U of T Data Analysis 1/Assignments/Assignment 2/train.csv', index_col=0, sep=',')
In [3]:
#general overview of data structure, basic dataframe statistics, and distribution of numerical fields
t_surv.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
In [4]:
t_surv.describe()
Out[4]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [5]:
t_surv.isnull().sum()
Out[5]:
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64
In [6]:
t_surv.nunique()
Out[6]:
Survived      2
Pclass        3
Name        891
Sex           2
Age          88
SibSp         7
Parch         7
Ticket      681
Fare        248
Cabin       147
Embarked      3
dtype: int64
In [7]:
#histograms show outliers on Age old end and SibSp and Parch
t_surv.hist(bins=50, figsize=(12,8));
In [8]:
#Lets see some basic survival stats
t_surv.Survived.aggregate(['mean','count','sum'])
Out[8]:
mean       0.383838
count    891.000000
sum      342.000000
Name: Survived, dtype: float64
In [9]:
sex_survival = t_surv.groupby(['Sex'])['Survived'].aggregate(['mean','count','sum'])
sex_survival
Out[9]:
mean count sum
Sex
female 0.742038 314 233
male 0.188908 577 109

pre analysis of which features will provide the most valid correlations to high survival.¶

In [10]:
#correlation with having kids but no correlation with more kids and higher survival, survival doesnt increase with the amount of kids.
sex_parch_panalysis = t_surv.groupby(['Sex', 'Parch'])['Survived'].aggregate(['mean','count','sum'])
sex_parch_panalysis['sum'].plot(kind='bar',figsize=(10,5));
In [11]:
#there is a correlation with having sibs/sp and higher survival, however survival doesnt increase with a higher amount 
#of sibs / sp
sex_sibsp_panalysis = t_surv.groupby(['Sex', 'SibSp'])['Survived'].aggregate(['mean','count','sum'])
sex_sibsp_panalysis['sum'].plot(kind='bar',figsize=(10,5));
In [12]:
#Sex and fare is too messy to derive anything meaninful about survival- is very long.
sex_fare_panalysis = t_surv.groupby(['Sex', 'Fare'])['Survived'].aggregate(['mean','count','sum'])
In [13]:
#Engineer embarked code to C:1, S:2, Q:3 for better statisitcal calcuations
t_surv['embarked_code'] = t_surv['Embarked'].replace({'C':1,'S':2,'Q':3})
embarked_panalysis = t_surv.groupby(['embarked_code'])['Survived'].aggregate(['mean','count','sum'])
embarked_panalysis
Out[13]:
mean count sum
embarked_code
1.0 0.553571 168 93
2.0 0.336957 644 217
3.0 0.389610 77 30
In [14]:
#FARE is better represented by Pclass as higher fares are represented by higher class, so its ok to exclude 'fare'
#from in depth analysis
fare_class_panalysis = t_surv.groupby(['Pclass'])['Fare'].mean()
fare_class_panalysis.plot(kind='bar',figsize=(10,5));
sns.lmplot(x='Age', y='Fare', hue='Pclass', data=t_surv, fit_reg=False)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x1770042d308>
In [16]:
#sex and class however shows meaningful correlation to survival
sex_pclass_panalysis = t_surv.groupby(['Sex', 'Pclass'])['Survived'].aggregate(['mean','count','sum'])
sex_pclass_panalysis['mean'].plot(kind='bar',figsize=(10,5));
sex_pclass_panalysis
Out[16]:
mean count sum
Sex Pclass
female 1 0.968085 94 91
2 0.921053 76 70
3 0.500000 144 72
male 1 0.368852 122 45
2 0.157407 108 17
3 0.135447 347 47
In [17]:
#feature engineering and impute missing values. Children under 14 have their mean, and adults will be using median to 
#eliminate outliers effect on the average age as there is one 80 year old -that survived! Surprisingly.
#replace number of kids with either has kids or not as amount of kids has no effect on survival see above graphs
t_surv['Parch'] = t_surv['Parch'].replace([1,2,3,4,5,6], 1)

#replace number of sibs with either has sibs/Sp or not as amount has no effect on survival see above graphs
t_surv['SibSp'] = t_surv['SibSp'].replace([1,2,3,4,5,6,7,8], 1)

#Convert sex column to isfemale true or false 1 or 0 as its easier to calculate stats
t_surv['IsFemale'] = (t_surv['Sex'] =='female').astype(int)
In [18]:
t_surv.head(10)
Out[18]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked embarked_code IsFemale
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2.0 0
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1.0 1
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 2.0 1
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 2.0 1
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 2.0 0
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 3.0 0
7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 2.0 0
8 0 3 Palsson, Master. Gosta Leonard male 2.0 1 1 349909 21.0750 NaN S 2.0 0
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 1 347742 11.1333 NaN S 2.0 1
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C 1.0 1
In [19]:
#Impute childrens age and adult ages seperately as childrens missing age valuescan be derived from the word 'master'
#in their names.

#impute childrens age
children = t_surv[(t_surv["Age"] <= 14)]
average_child_age = children.Age.mean()

#target only children missing ages using name 'master'
missingchildren = t_surv[t_surv['Name'].str.contains('Master') & t_surv['Age'].isnull()]
index1 = missingchildren.index
t_surv.loc[index1,'Age'] = average_child_age

#impute adults age using median as there are age outliers skewing age average
not_child = t_surv[t_surv.Age >= 15]
average_age = not_child.Age.median()
t_surv['Age'] = t_surv['Age'].fillna(average_age)
missing_values2 = t_surv.isnull().sum()
missing_values2
Out[19]:
Survived           0
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
embarked_code      2
IsFemale           0
dtype: int64
In [20]:
#enginner feature on age bins, statistics canada shows age categories as children 0-14, 
#youth 15-24, adult 25-64, elderly>65- created corresponding age bins
bins = [0, 15, 25, 65, 100]
bin_labels = ['0-14','15-24', '25-64', '>65']
t_surv["Age_Groups"] = pd.cut(t_surv["Age"], bins, right=False, labels = bin_labels)

#lets get rid of those unattractive nans
t_surv = t_surv.replace(np.nan,0)
In [21]:
#can now calculate a correlation matrix for fun
correlation_matrix = t_surv.corr(method="spearman")
correlation_matrix
plt.figure(figsize=(10,5))
ax = sns.heatmap(correlation_matrix, cmap="seismic", annot=True, linewidths=.5, cbar_kws={"label": "Spearman's ρ"}, vmin=-1, vmax=1);
ax.set_title("Heat map");
In [22]:
#calc survival averages based on every possible relevant combination
prob_matrix = t_surv.groupby([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"])["Survived"].agg(["mean", "count", "sum"])
prob_matrix = prob_matrix.replace(np.nan,0)
In [ ]:
#lets see if the kids are ok
children = t_surv[(t_surv["Age"] <= 14)]
gen_child_surv = children["Survived"].agg(["mean", "count", "sum"])
children_survival = children.groupby(["Pclass", "IsFemale"])["Survived"].agg(["mean", "count", "sum"])
In [31]:
gen_child_surv
Out[31]:
mean      0.580247
count    81.000000
sum      47.000000
Name: Survived, dtype: float64
In [32]:
children_survival
Out[32]:
mean count sum
Pclass IsFemale
1 0 1.000000 3 3
1 0.500000 2 1
2 0 1.000000 9 9
1 1.000000 10 10
3 0 0.354839 31 11
1 0.500000 26 13
In [24]:
top_10_survival = prob_matrix.nlargest(10,['mean', 'sum'])

top_10_survival
Out[24]:
mean count sum
Pclass IsFemale Age_Groups Parch SibSp
1 1 25-64 0 1 1.0 24 24.0
15-24 0 0 1.0 8 8.0
25-64 1 0 1.0 8 8.0
15-24 1 0 1.0 7 7.0
2 0 0-14 1 1 1.0 7 7.0
1 1 15-24 0 1 1.0 6 6.0
2 1 0-14 1 1 1.0 5 5.0
15-24 1 1 1.0 5 5.0
25-64 1 0 1.0 5 5.0
1 1 15-24 1 1 1.0 4 4.0
In [27]:
plt.figure(figsize=(15,10))
ax = top_10_survival['sum'].plot(kind='bar', rot=0, figsize=(15, 10));
ax.set_ylabel("Highest Survival");