Data Analysis using Pandas

A Statistical analysis of the survivability of Titanic passengers.

(15 points total)

NAME = "Justin Gabb"

For this project, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this project can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file train.csv. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question:

What categories of passengers were most likely to survive the Titanic disaster?

Question 1.

*The answer to the main question- What categories of passengers were most likely to survive the Titanic disaster?_(2 points)

Answer: The categories of passengers to most likely survive are 1st class females with a sibling or spouse, zero children and in the age category 25-64 as these categories have the highest combined means and summed survival, the mean is 100 and the total survived is 24.

The detailed explanation of the logic of the analysis (2 points)

Answer: The answer is derived from 96 possible combinations of 5 categories taking the highest mean and then the highest sum of survived passengers. All Categories were pre analysed by survival probability and for correlation, certain ones were excluded as they did not provide valuable insight into survival rates. Categories were divided by age groups, sex, class, children, and spouses&siblings and then survival averaged and the highest average sum was located using nlargest function in python.

Question 2. (3 points)

What other attributes did you use for the analysis? Explain how you used them and why you decided to use them.

Answer: The attributes that were used and then grouped by([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"]) and then ["Survived"] was aggregated by (["mean", "count", "sum"]). Which was then sorted by nlargest mean and then sum. These attributes were used because they showed the highest correlation or meaningfulness to survival. Age categories were also used ['0-14','15-24', '25-64', '>65'] because that allowed for binning of survival based on age which allowed for more granular probabilities.

Provide a complete list of all attributes used.

Answer: ('Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Age_Groups', 'IsFemale').

Question 3. (3 points)

Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?

Answer: Age bins were created based on statistics Canada & International norms for age groupings to calculate a more granular survival probability based on age. [children ='0-14', youth='15-24', adult='25-64', senior='>65']. The following attributes, 'SibSp' and 'Parch', were engineered from a scale of 1-6 & 1-8 to either true of false as that provided easier calculations of survival probabilities without too many combinations which were creating a way too of a granular calculation of probabilties and became too hard to disguish survival groups. Also there was no correlation in higher survival and a higher amount of children or sibssp there was just a correlation if that variable was yes or no. Sex attribute was engineered to IsFemale True or False 1 or 0 as its easier to peform calculations on when you have matching data types and goes well with the other engineered attibutes of 1 and 0's. I also engineered embarked code to C:1, S:2, Q:3 for better statisitcal calcuations but this category did not score well in the correlation matrix and did not have high survival ratios when calculated manually so was excluded in final analysis.

If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

Answer: The following attributes were excluded: 'Ticket', 'Fare', 'Cabin', 'Embarked'. Because there was not a strong enough correlation with survival and no meaningful survival averages could be calculated for large similiar groups. For the Cabin attribute: 687 / 891 records are missing and cabin assumptions based on names is just too fuzzy of a logic-for example do two adult brothers with the same last name stay in the same room or seperate?, so it wasnt very useful. Embarked's correlation from a sociological perspective doesnt make any sense and when engineered to more useful data type it didnt score high in the correlation matrix and survival rates based on embarked locations are low. Ticket number just doesnt make any sense either as there is no correlating with survival as each ticket number is unique to each passenger and cannot be grouped into anything meaningful. Fare is best represented by passenger class as the higher the fare the higher the class which can be proven in a simple bar chart, so passenger class has already binned Fare abstractly.

Question 4. (5 points)

How did you treat missing values for those attributes that you included in the analysis (for example, age attribute)? Provide a detailed explanation in the comments.

Answer: Missing age attributes were grouped into two groups, by age 0-14 as children can be distiguished by the salutation in their names 'Master', and the rest just adults 15-100 as there is no way to distinguish between adult age groups. the mean child (age<=14) age was used to fill in NAN's where 'Master' was found and the median was used to fill in the adult age group NAN's as there were outliers effecting the mean. Any other NANs were filled in with zero values.

Random interesting fact: There was one 80 year old man that did survive and he is an extreme anomaly defying all odds!

In [1]:

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set()

In [2]:

t_surv = pd.read_csv('C:/Python development/U of T Data Analysis 1/Assignments/Assignment 2/train.csv', index_col=0, sep=',')

In [3]:

#general overview of data structure, basic dataframe statistics, and distribution of numerical fields
t_surv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

In [4]:

t_surv.describe()

Out[4]:

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

In [5]:

t_surv.isnull().sum()

Out[5]:

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [6]:

t_surv.nunique()

Out[6]:

Survived      2
Pclass        3
Name        891
Sex           2
Age          88
SibSp         7
Parch         7
Ticket      681
Fare        248
Cabin       147
Embarked      3
dtype: int64

In [7]:

#histograms show outliers on Age old end and SibSp and Parch
t_surv.hist(bins=50, figsize=(12,8));

In [8]:

#Lets see some basic survival stats
t_surv.Survived.aggregate(['mean','count','sum'])

Out[8]:

mean       0.383838
count    891.000000
sum      342.000000
Name: Survived, dtype: float64

In [9]:

sex_survival = t_surv.groupby(['Sex'])['Survived'].aggregate(['mean','count','sum'])
sex_survival

Out[9]:

	mean	count	sum
Sex
female	0.742038	314	233
male	0.188908	577	109

pre analysis of which features will provide the most valid correlations to high survival.¶

In [10]:

#correlation with having kids but no correlation with more kids and higher survival, survival doesnt increase with the amount of kids.
sex_parch_panalysis = t_surv.groupby(['Sex', 'Parch'])['Survived'].aggregate(['mean','count','sum'])
sex_parch_panalysis['sum'].plot(kind='bar',figsize=(10,5));

In [11]:

#there is a correlation with having sibs/sp and higher survival, however survival doesnt increase with a higher amount 
#of sibs / sp
sex_sibsp_panalysis = t_surv.groupby(['Sex', 'SibSp'])['Survived'].aggregate(['mean','count','sum'])
sex_sibsp_panalysis['sum'].plot(kind='bar',figsize=(10,5));

In [12]:

#Sex and fare is too messy to derive anything meaninful about survival- is very long.
sex_fare_panalysis = t_surv.groupby(['Sex', 'Fare'])['Survived'].aggregate(['mean','count','sum'])

In [13]:

#Engineer embarked code to C:1, S:2, Q:3 for better statisitcal calcuations
t_surv['embarked_code'] = t_surv['Embarked'].replace({'C':1,'S':2,'Q':3})
embarked_panalysis = t_surv.groupby(['embarked_code'])['Survived'].aggregate(['mean','count','sum'])
embarked_panalysis

Out[13]:

	mean	count	sum
embarked_code
1.0	0.553571	168	93
2.0	0.336957	644	217
3.0	0.389610	77	30

In [14]:

#FARE is better represented by Pclass as higher fares are represented by higher class, so its ok to exclude 'fare'
#from in depth analysis
fare_class_panalysis = t_surv.groupby(['Pclass'])['Fare'].mean()
fare_class_panalysis.plot(kind='bar',figsize=(10,5));
sns.lmplot(x='Age', y='Fare', hue='Pclass', data=t_surv, fit_reg=False)

Out[14]:

<seaborn.axisgrid.FacetGrid at 0x1770042d308>

In [16]:

#sex and class however shows meaningful correlation to survival
sex_pclass_panalysis = t_surv.groupby(['Sex', 'Pclass'])['Survived'].aggregate(['mean','count','sum'])
sex_pclass_panalysis['mean'].plot(kind='bar',figsize=(10,5));
sex_pclass_panalysis

Out[16]:

		mean	count	sum
Sex	Pclass
female	1	0.968085	94	91
	2	0.921053	76	70
	3	0.500000	144	72
male	1	0.368852	122	45
	2	0.157407	108	17
	3	0.135447	347	47

In [17]:

#feature engineering and impute missing values. Children under 14 have their mean, and adults will be using median to 
#eliminate outliers effect on the average age as there is one 80 year old -that survived! Surprisingly.
#replace number of kids with either has kids or not as amount of kids has no effect on survival see above graphs
t_surv['Parch'] = t_surv['Parch'].replace([1,2,3,4,5,6], 1)

#replace number of sibs with either has sibs/Sp or not as amount has no effect on survival see above graphs
t_surv['SibSp'] = t_surv['SibSp'].replace([1,2,3,4,5,6,7,8], 1)

#Convert sex column to isfemale true or false 1 or 0 as its easier to calculate stats
t_surv['IsFemale'] = (t_surv['Sex'] =='female').astype(int)

In [18]:

t_surv.head(10)

Out[18]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	embarked_code	IsFemale
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	2.0	0
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	1.0	1
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	2.0	1
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	2.0	1
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	2.0	0
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q	3.0	0
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	2.0	0
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	1	1	349909	21.0750	NaN	S	2.0	0
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	1	347742	11.1333	NaN	S	2.0	1
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C	1.0	1

In [19]:

#Impute childrens age and adult ages seperately as childrens missing age valuescan be derived from the word 'master'
#in their names.

#impute childrens age
children = t_surv[(t_surv["Age"] <= 14)]
average_child_age = children.Age.mean()

#target only children missing ages using name 'master'
missingchildren = t_surv[t_surv['Name'].str.contains('Master') & t_surv['Age'].isnull()]
index1 = missingchildren.index
t_surv.loc[index1,'Age'] = average_child_age

#impute adults age using median as there are age outliers skewing age average
not_child = t_surv[t_surv.Age >= 15]
average_age = not_child.Age.median()
t_surv['Age'] = t_surv['Age'].fillna(average_age)
missing_values2 = t_surv.isnull().sum()
missing_values2

Out[19]:

Survived           0
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
embarked_code      2
IsFemale           0
dtype: int64

In [20]:

#enginner feature on age bins, statistics canada shows age categories as children 0-14, 
#youth 15-24, adult 25-64, elderly>65- created corresponding age bins
bins = [0, 15, 25, 65, 100]
bin_labels = ['0-14','15-24', '25-64', '>65']
t_surv["Age_Groups"] = pd.cut(t_surv["Age"], bins, right=False, labels = bin_labels)

#lets get rid of those unattractive nans
t_surv = t_surv.replace(np.nan,0)

In [21]:

#can now calculate a correlation matrix for fun
correlation_matrix = t_surv.corr(method="spearman")
correlation_matrix
plt.figure(figsize=(10,5))
ax = sns.heatmap(correlation_matrix, cmap="seismic", annot=True, linewidths=.5, cbar_kws={"label": "Spearman's ρ"}, vmin=-1, vmax=1);
ax.set_title("Heat map");

In [22]:

#calc survival averages based on every possible relevant combination
prob_matrix = t_surv.groupby([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"])["Survived"].agg(["mean", "count", "sum"])
prob_matrix = prob_matrix.replace(np.nan,0)

In [ ]:

#lets see if the kids are ok
children = t_surv[(t_surv["Age"] <= 14)]
gen_child_surv = children["Survived"].agg(["mean", "count", "sum"])
children_survival = children.groupby(["Pclass", "IsFemale"])["Survived"].agg(["mean", "count", "sum"])

In [31]:

gen_child_surv

Out[31]:

mean      0.580247
count    81.000000
sum      47.000000
Name: Survived, dtype: float64

In [32]:

children_survival

Out[32]:

		mean	count	sum
Pclass	IsFemale
1	0	1.000000	3	3
1	1	0.500000	2	1
2	0	1.000000	9	9
2	1	1.000000	10	10
3	0	0.354839	31	11
3	1	0.500000	26	13

In [24]:

top_10_survival = prob_matrix.nlargest(10,['mean', 'sum'])

top_10_survival

Out[24]:

					mean	count	sum
Pclass	IsFemale	Age_Groups	Parch	SibSp
1	1	25-64	0	1	1.0	24	24.0
		15-24	0	0	1.0	8	8.0
		25-64	1	0	1.0	8	8.0
		15-24	1	0	1.0	7	7.0
2	0	0-14	1	1	1.0	7	7.0
1	1	15-24	0	1	1.0	6	6.0
2	1	0-14	1	1	1.0	5	5.0
		15-24	1	1	1.0	5	5.0
		25-64	1	0	1.0	5	5.0
1	1	15-24	1	1	1.0	4	4.0

In [27]:

plt.figure(figsize=(15,10))
ax = top_10_survival['sum'].plot(kind='bar', rot=0, figsize=(15, 10));
ax.set_ylabel("Highest Survival");