Through extensive testing the StandardScaler() data transformer was chosen as it had the highest score for both Randomforest and KNeighbors. It was compared to the following data transformers. Too keep the project streamlined only the final version is presented: Pipeline1 MinMaxScaler() StandardScaler() RobustScaler() MaxAbs() KNeighbors 80.26 82.65 78.15 80.58 RandomForest 81.06 99.89 81.08 81.22 The highest scoring hyperparameters are: RandomForestClassifier: {'classifier_rfc__n_estimators': 50} KNeighborsClassifier: {'classifier_knc__n_neighbors': 20} The RandomForestClassifier performs best in the cross validation scores: Random Forest Classifier 1 using StandardScaler mean accuracy: 82.495 % std: 0.005 % KNeighbors Classifier 1 using StandardScaler mean accuracy: 81.46 % std: 0.003 % I attempted to group the data into different groups to see if scores would increase but they did not, I chose what I thought were sparse data matrix type numbers in one group and then regular numbers in another as such and applied differant data transformers to them but in the end StandardScaler() using all data in one group yielded the highest scores: sparse_data = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',] numeric_data = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE','BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'] The correlation matrix does show some high correlations between bill amount variables, however due to this being a credit bureau history dataset, and we are trying to predict a financial result - next payment default, payment amounts and bill amounts are the heart of the dataset and crucial to predictions and I believe removal would not make sense. I did attempt at removing outliers to improve score but there was no increase in the scores using the below script. print(credit_data.shape) credit_data['LIMIT_BAL z-score'] = stats.zscore(credit_data['LIMIT_BAL']) credit_data['BILL_AMT1 z-score'] = stats.zscore(credit_data['BILL_AMT1']) credit_data['BILL_AMT2 z-score'] = stats.zscore(credit_data['BILL_AMT2']) credit_data['BILL_AMT3 z-score'] = stats.zscore(credit_data['BILL_AMT3']) credit_data['BILL_AMT4 z-score'] = stats.zscore(credit_data['BILL_AMT4']) credit_data['BILL_AMT5 z-score'] = stats.zscore(credit_data['BILL_AMT5']) credit_data['BILL_AMT6 z-score'] = stats.zscore(credit_data['BILL_AMT6']) credit_data['PAY_AMT1 z-score'] = stats.zscore(credit_data['PAY_AMT1']) credit_data['PAY_AMT2 z-score'] = stats.zscore(credit_data['PAY_AMT2']) credit_data['PAY_AMT3 z-score'] = stats.zscore(credit_data['PAY_AMT3']) credit_data['PAY_AMT4 z-score'] = stats.zscore(credit_data['PAY_AMT4']) credit_data['PAY_AMT5 z-score'] = stats.zscore(credit_data['PAY_AMT5']) credit_data['PAY_AMT6 z-score'] = stats.zscore(credit_data['PAY_AMT6']) credit_data = credit_data.loc[credit_data['LIMIT_BAL z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['BILL_AMT1 z-score'].abs() <= 3] credit_data= credit_data.loc[credit_data['BILL_AMT2 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['BILL_AMT3 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['BILL_AMT4 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['BILL_AMT5 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['BILL_AMT6 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT1 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT2 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT3 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT4 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT5 z-score'].abs() <= 3] credit_data = credit_data.loc[credit_data['PAY_AMT6 z-score'].abs() <= 3] print(credit_data.shape) credit_data = credit_data.drop(columns=['LIMIT_BAL z-score', 'BILL_AMT1 z-score', 'BILL_AMT2 z-score', 'BILL_AMT3 z-score', 'BILL_AMT4 z-score', 'BILL_AMT5 z-score', 'BILL_AMT6 z-score', 'PAY_AMT1 z-score', 'PAY_AMT2 z-score', 'PAY_AMT3 z-score', 'PAY_AMT4 z-score', 'PAY_AMT5 z-score', 'PAY_AMT6 z-score']) print(credit_data.shape) The end...