(15 points total)
NAME = "Justin Gabb"
For this project, we will analyze the open dataset with data on the passengers aboard the Titanic.
The data file for this project can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file train.csv. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.
Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question:
What categories of passengers were most likely to survive the Titanic disaster?
*The answer to the main question- What categories of passengers were most likely to survive the Titanic disaster?_(2 points)
Answer: The categories of passengers to most likely survive are 1st class females with a sibling or spouse, zero children and in the age category 25-64 as these categories have the highest combined means and summed survival, the mean is 100 and the total survived is 24.
Answer: The answer is derived from 96 possible combinations of 5 categories taking the highest mean and then the highest sum of survived passengers. All Categories were pre analysed by survival probability and for correlation, certain ones were excluded as they did not provide valuable insight into survival rates. Categories were divided by age groups, sex, class, children, and spouses&siblings and then survival averaged and the highest average sum was located using nlargest function in python.
Question 2. (3 points)
Answer: The attributes that were used and then grouped by([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"]) and then ["Survived"] was aggregated by (["mean", "count", "sum"]). Which was then sorted by nlargest mean and then sum. These attributes were used because they showed the highest correlation or meaningfulness to survival. Age categories were also used ['0-14','15-24', '25-64', '>65'] because that allowed for binning of survival based on age which allowed for more granular probabilities.
Answer: ('Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Age_Groups', 'IsFemale').
Question 3. (3 points)
Answer: Age bins were created based on statistics Canada & International norms for age groupings to calculate a more granular survival probability based on age. [children ='0-14', youth='15-24', adult='25-64', senior='>65']. The following attributes, 'SibSp' and 'Parch', were engineered from a scale of 1-6 & 1-8 to either true of false as that provided easier calculations of survival probabilities without too many combinations which were creating a way too of a granular calculation of probabilties and became too hard to disguish survival groups. Also there was no correlation in higher survival and a higher amount of children or sibssp there was just a correlation if that variable was yes or no. Sex attribute was engineered to IsFemale True or False 1 or 0 as its easier to peform calculations on when you have matching data types and goes well with the other engineered attibutes of 1 and 0's. I also engineered embarked code to C:1, S:2, Q:3 for better statisitcal calcuations but this category did not score well in the correlation matrix and did not have high survival ratios when calculated manually so was excluded in final analysis.
Answer: The following attributes were excluded: 'Ticket', 'Fare', 'Cabin', 'Embarked'. Because there was not a strong enough correlation with survival and no meaningful survival averages could be calculated for large similiar groups. For the Cabin attribute: 687 / 891 records are missing and cabin assumptions based on names is just too fuzzy of a logic-for example do two adult brothers with the same last name stay in the same room or seperate?, so it wasnt very useful. Embarked's correlation from a sociological perspective doesnt make any sense and when engineered to more useful data type it didnt score high in the correlation matrix and survival rates based on embarked locations are low. Ticket number just doesnt make any sense either as there is no correlating with survival as each ticket number is unique to each passenger and cannot be grouped into anything meaningful. Fare is best represented by passenger class as the higher the fare the higher the class which can be proven in a simple bar chart, so passenger class has already binned Fare abstractly.
Question 4. (5 points)
age attribute)? Provide a detailed explanation in the comments.Answer: Missing age attributes were grouped into two groups, by age 0-14 as children can be distiguished by the salutation in their names 'Master', and the rest just adults 15-100 as there is no way to distinguish between adult age groups. the mean child (age<=14) age was used to fill in NAN's where 'Master' was found and the median was used to fill in the adult age group NAN's as there were outliers effecting the mean. Any other NANs were filled in with zero values.
Random interesting fact: There was one 80 year old man that did survive and he is an extreme anomaly defying all odds!
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
t_surv = pd.read_csv('C:/Python development/U of T Data Analysis 1/Assignments/Assignment 2/train.csv', index_col=0, sep=',')
#general overview of data structure, basic dataframe statistics, and distribution of numerical fields
t_surv.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 714 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 9 Cabin 204 non-null object 10 Embarked 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB
t_surv.describe()
| Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
t_surv.isnull().sum()
Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
t_surv.nunique()
Survived 2 Pclass 3 Name 891 Sex 2 Age 88 SibSp 7 Parch 7 Ticket 681 Fare 248 Cabin 147 Embarked 3 dtype: int64
#histograms show outliers on Age old end and SibSp and Parch
t_surv.hist(bins=50, figsize=(12,8));
#Lets see some basic survival stats
t_surv.Survived.aggregate(['mean','count','sum'])
mean 0.383838 count 891.000000 sum 342.000000 Name: Survived, dtype: float64
sex_survival = t_surv.groupby(['Sex'])['Survived'].aggregate(['mean','count','sum'])
sex_survival
| mean | count | sum | |
|---|---|---|---|
| Sex | |||
| female | 0.742038 | 314 | 233 |
| male | 0.188908 | 577 | 109 |
#correlation with having kids but no correlation with more kids and higher survival, survival doesnt increase with the amount of kids.
sex_parch_panalysis = t_surv.groupby(['Sex', 'Parch'])['Survived'].aggregate(['mean','count','sum'])
sex_parch_panalysis['sum'].plot(kind='bar',figsize=(10,5));
#there is a correlation with having sibs/sp and higher survival, however survival doesnt increase with a higher amount
#of sibs / sp
sex_sibsp_panalysis = t_surv.groupby(['Sex', 'SibSp'])['Survived'].aggregate(['mean','count','sum'])
sex_sibsp_panalysis['sum'].plot(kind='bar',figsize=(10,5));
#Sex and fare is too messy to derive anything meaninful about survival- is very long.
sex_fare_panalysis = t_surv.groupby(['Sex', 'Fare'])['Survived'].aggregate(['mean','count','sum'])
#Engineer embarked code to C:1, S:2, Q:3 for better statisitcal calcuations
t_surv['embarked_code'] = t_surv['Embarked'].replace({'C':1,'S':2,'Q':3})
embarked_panalysis = t_surv.groupby(['embarked_code'])['Survived'].aggregate(['mean','count','sum'])
embarked_panalysis
| mean | count | sum | |
|---|---|---|---|
| embarked_code | |||
| 1.0 | 0.553571 | 168 | 93 |
| 2.0 | 0.336957 | 644 | 217 |
| 3.0 | 0.389610 | 77 | 30 |
#FARE is better represented by Pclass as higher fares are represented by higher class, so its ok to exclude 'fare'
#from in depth analysis
fare_class_panalysis = t_surv.groupby(['Pclass'])['Fare'].mean()
fare_class_panalysis.plot(kind='bar',figsize=(10,5));
sns.lmplot(x='Age', y='Fare', hue='Pclass', data=t_surv, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x1770042d308>
#sex and class however shows meaningful correlation to survival
sex_pclass_panalysis = t_surv.groupby(['Sex', 'Pclass'])['Survived'].aggregate(['mean','count','sum'])
sex_pclass_panalysis['mean'].plot(kind='bar',figsize=(10,5));
sex_pclass_panalysis
| mean | count | sum | ||
|---|---|---|---|---|
| Sex | Pclass | |||
| female | 1 | 0.968085 | 94 | 91 |
| 2 | 0.921053 | 76 | 70 | |
| 3 | 0.500000 | 144 | 72 | |
| male | 1 | 0.368852 | 122 | 45 |
| 2 | 0.157407 | 108 | 17 | |
| 3 | 0.135447 | 347 | 47 |
#feature engineering and impute missing values. Children under 14 have their mean, and adults will be using median to
#eliminate outliers effect on the average age as there is one 80 year old -that survived! Surprisingly.
#replace number of kids with either has kids or not as amount of kids has no effect on survival see above graphs
t_surv['Parch'] = t_surv['Parch'].replace([1,2,3,4,5,6], 1)
#replace number of sibs with either has sibs/Sp or not as amount has no effect on survival see above graphs
t_surv['SibSp'] = t_surv['SibSp'].replace([1,2,3,4,5,6,7,8], 1)
#Convert sex column to isfemale true or false 1 or 0 as its easier to calculate stats
t_surv['IsFemale'] = (t_surv['Sex'] =='female').astype(int)
t_surv.head(10)
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | embarked_code | IsFemale | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | |||||||||||||
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2.0 | 0 |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1.0 | 1 |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2.0 | 1 |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 2.0 | 1 |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 2.0 | 0 |
| 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 3.0 | 0 |
| 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 2.0 | 0 |
| 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 1 | 1 | 349909 | 21.0750 | NaN | S | 2.0 | 0 |
| 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 1 | 347742 | 11.1333 | NaN | S | 2.0 | 1 |
| 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C | 1.0 | 1 |
#Impute childrens age and adult ages seperately as childrens missing age valuescan be derived from the word 'master'
#in their names.
#impute childrens age
children = t_surv[(t_surv["Age"] <= 14)]
average_child_age = children.Age.mean()
#target only children missing ages using name 'master'
missingchildren = t_surv[t_surv['Name'].str.contains('Master') & t_surv['Age'].isnull()]
index1 = missingchildren.index
t_surv.loc[index1,'Age'] = average_child_age
#impute adults age using median as there are age outliers skewing age average
not_child = t_surv[t_surv.Age >= 15]
average_age = not_child.Age.median()
t_surv['Age'] = t_surv['Age'].fillna(average_age)
missing_values2 = t_surv.isnull().sum()
missing_values2
Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 embarked_code 2 IsFemale 0 dtype: int64
#enginner feature on age bins, statistics canada shows age categories as children 0-14,
#youth 15-24, adult 25-64, elderly>65- created corresponding age bins
bins = [0, 15, 25, 65, 100]
bin_labels = ['0-14','15-24', '25-64', '>65']
t_surv["Age_Groups"] = pd.cut(t_surv["Age"], bins, right=False, labels = bin_labels)
#lets get rid of those unattractive nans
t_surv = t_surv.replace(np.nan,0)
#can now calculate a correlation matrix for fun
correlation_matrix = t_surv.corr(method="spearman")
correlation_matrix
plt.figure(figsize=(10,5))
ax = sns.heatmap(correlation_matrix, cmap="seismic", annot=True, linewidths=.5, cbar_kws={"label": "Spearman's ρ"}, vmin=-1, vmax=1);
ax.set_title("Heat map");
#calc survival averages based on every possible relevant combination
prob_matrix = t_surv.groupby([ "Pclass", "IsFemale", "Age_Groups","Parch","SibSp"])["Survived"].agg(["mean", "count", "sum"])
prob_matrix = prob_matrix.replace(np.nan,0)
#lets see if the kids are ok
children = t_surv[(t_surv["Age"] <= 14)]
gen_child_surv = children["Survived"].agg(["mean", "count", "sum"])
children_survival = children.groupby(["Pclass", "IsFemale"])["Survived"].agg(["mean", "count", "sum"])
gen_child_surv
mean 0.580247 count 81.000000 sum 47.000000 Name: Survived, dtype: float64
children_survival
| mean | count | sum | ||
|---|---|---|---|---|
| Pclass | IsFemale | |||
| 1 | 0 | 1.000000 | 3 | 3 |
| 1 | 0.500000 | 2 | 1 | |
| 2 | 0 | 1.000000 | 9 | 9 |
| 1 | 1.000000 | 10 | 10 | |
| 3 | 0 | 0.354839 | 31 | 11 |
| 1 | 0.500000 | 26 | 13 |
top_10_survival = prob_matrix.nlargest(10,['mean', 'sum'])
top_10_survival
| mean | count | sum | |||||
|---|---|---|---|---|---|---|---|
| Pclass | IsFemale | Age_Groups | Parch | SibSp | |||
| 1 | 1 | 25-64 | 0 | 1 | 1.0 | 24 | 24.0 |
| 15-24 | 0 | 0 | 1.0 | 8 | 8.0 | ||
| 25-64 | 1 | 0 | 1.0 | 8 | 8.0 | ||
| 15-24 | 1 | 0 | 1.0 | 7 | 7.0 | ||
| 2 | 0 | 0-14 | 1 | 1 | 1.0 | 7 | 7.0 |
| 1 | 1 | 15-24 | 0 | 1 | 1.0 | 6 | 6.0 |
| 2 | 1 | 0-14 | 1 | 1 | 1.0 | 5 | 5.0 |
| 15-24 | 1 | 1 | 1.0 | 5 | 5.0 | ||
| 25-64 | 1 | 0 | 1.0 | 5 | 5.0 | ||
| 1 | 1 | 15-24 | 1 | 1 | 1.0 | 4 | 4.0 |
plt.figure(figsize=(15,10))
ax = top_10_survival['sum'].plot(kind='bar', rot=0, figsize=(15, 10));
ax.set_ylabel("Highest Survival");