course4_assignment2_completed

Clustering using KMeans, and Segmentation Analysis¶

Justin Gabb¶

Learning Outcomes¶

In this assignment, you will do the following:

Explore a dataset and carry out clustering using k-means algorithm
Identify the optimum number of clusters for a given dataset

Problem Description¶

In this assignment, you will study the electricity demand from clients in Portugal, during 2013 and 2014. You have been provided with the data file, which you should download when you download this assignment file.

The data available contains 370 time series, corresponding to the electric demand for 370 clients, between 2011 and 2014.

In this guided exercise, you will use clustering techniques to understand the typical usage behaviour during 2013-2014.

Both these datasets are publicly available, and can be used to carry out experiments. Their source is below:

Data: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014#
Electric Demand: http://www.think-energy.net/KWvsKWH.htm

We will start by exploring the data set and continue on to the assignment. Consider this as a working notebook, you will add your work to the same notebook.

In this assignment we will use the sklearn package for k-means. Please refer here for the documentation https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

The sklearn package for k-means is one of the many clustering algorithms found in the module "sklearn.cluster". These come with a variety of functions that you can call by importing the package.

For example

from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

Data Preparation¶

Start by downloading the data to a local directory and modify the "pathToFile" and "fileName" variables, if needed. The data file has been provided with this assignment. It is also available at the links provided above.

Assignment #2 - Clustering using KMeans, and Segmentation Analysis

Justin Gabb

June 20 2021

In [66]:

pathToFile = r"C:\\Users\\Z33\Desktop\\Data Science\\U of T - Data Science\\Data Science 4 - Machine Learning\\Assignment 2 - Clustering\\"
#pathToFile = r"C:\\Users\\Lenovo\\Desktop\\Data Science\\U of T - Data Science\\Data Science 4 - Machine Learning\\Assignment 2 - Clustering\\"
fileName = 'LD2011_2014.txt'

In [67]:

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import random
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
random.seed(42)

In [68]:

# Replace "," by ".", otherwise the numbers will be in the form 2,3445 instead of 2.3445
import fileinput

with fileinput.FileInput(pathToFile+fileName, inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(",", "."), end='')

In [69]:

# Create dataframe
import pandas as pd
data = pd.read_csv(pathToFile+fileName, sep=";", index_col=0)

Quick data inspection¶

In [70]:

data.head(2)

Out[70]:

	MT_001	MT_002	MT_003	MT_004	MT_005	MT_006	MT_007	MT_008	MT_009	MT_010	...	MT_361	MT_362	MT_363	MT_364	MT_365	MT_366	MT_367	MT_368	MT_369	MT_370
2011-01-01 00:15:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2011-01-01 00:30:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

2 rows × 370 columns

In [71]:

data.tail(2)

Out[71]:

	MT_001	MT_002	MT_003	MT_004	MT_005	MT_006	MT_007	MT_008	MT_009	MT_010	...	MT_361	MT_362	MT_363	MT_364	MT_365	MT_366	MT_367	MT_368	MT_369	MT_370
2014-12-31 23:45:00	1.269036	21.337127	1.737619	166.666667	85.365854	285.714286	10.17524	225.589226	64.685315	72.043011	...	246.252677	28000.0	1443.037975	909.090909	26.075619	4.095963	664.618086	146.911519	646.627566	6540.540541
2015-01-01 00:00:00	2.538071	19.914651	1.737619	178.861789	84.146341	279.761905	10.17524	249.158249	62.937063	69.892473	...	188.436831	27800.0	1409.282700	954.545455	27.379400	4.095963	628.621598	131.886477	673.020528	7135.135135

2 rows × 370 columns

In [72]:

data.shape

Out[72]:

(140256, 370)

As it can be seen, the dataframe contains a row for each interval of 15 minutes between Jan 1, 2011 to Dec 31 2014. There are 370 columns corresponding 370 clients. The dataframe is indexed by the timestamp.¶

Since the frequency is 15 minutes, each day provides $24\times 4 = 96$ datapoints, which multiplied by 365 days and 4 years (plus 1 day in Feb 29, 2012) gives: $96 \times 365 \times 4 + 96 = 140256$, as observed in data.shape

In [73]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 140256 entries, 2011-01-01 00:15:00 to 2015-01-01 00:00:00
Columns: 370 entries, MT_001 to MT_370
dtypes: float64(370)
memory usage: 397.0+ MB

In [74]:

data.describe()

Out[74]:

	MT_001	MT_002	MT_003	MT_004	MT_005	MT_006	MT_007	MT_008	MT_009	MT_010	...	MT_361	MT_362	MT_363	MT_364	MT_365	MT_366	MT_367	MT_368	MT_369	MT_370
count	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	...	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000	140256.000000
mean	3.970785	20.768480	2.918308	82.184490	37.240309	141.227385	4.521338	191.401476	39.975354	42.205152	...	218.213701	37607.987537	1887.427366	2940.031734	65.413150	9.269709	424.262904	94.704717	625.251734	8722.355145
std	5.983965	13.272415	11.014456	58.248392	26.461327	98.439984	6.485684	121.981187	29.814595	33.401251	...	204.833532	38691.954832	1801.486488	2732.251967	65.007818	10.016782	274.337122	80.297301	380.656042	9195.155777
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	2.844950	0.000000	36.585366	15.853659	71.428571	0.565291	111.111111	13.986014	9.677419	...	5.710207	0.000000	0.000000	0.000000	13.037810	0.000000	0.000000	30.050083	83.944282	0.000000
50%	1.269036	24.893314	1.737619	87.398374	39.024390	157.738095	2.826456	222.222222	40.209790	40.860215	...	131.334761	24100.000000	1050.632911	2136.363636	31.290743	7.021650	525.899912	76.794658	758.064516	0.000000
75%	2.538071	29.871977	1.737619	115.853659	54.878049	205.357143	4.522329	279.461279	57.692308	61.290323	...	403.283369	54800.000000	3312.236287	5363.636364	108.213820	11.702750	627.743635	151.919866	875.366569	17783.783784
max	48.223350	115.220484	151.172893	321.138211	150.000000	535.714286	44.657999	552.188552	157.342657	198.924731	...	852.962170	192800.000000	7751.054852	12386.363636	335.071708	60.269163	1138.718174	362.270451	1549.120235	30918.918919

8 rows × 370 columns

Plot the first 2 days of 2012 for the first 2 clients¶

In [75]:

data_example = data.loc['2012-01-01 00:15:00':'2012-01-03 00:00:00'][['MT_001','MT_002']]
data_example.plot()
plt.show()

We note that the main difference between the curves is the level (also seen on the means in df.describe()). We know we will have to somehow normalize the curves, in order for the clustering technique to capture the behaviour of the consumption throughout the day.¶

Data for the analysis¶

We focus on 2013 and 2014 because these are the years with low number of clients having zero demand

In [76]:

data2011 = data.loc['2011-01-01 00:15:00':'2012-01-01 00:00:00']
data2012 = data.loc['2012-01-01 00:15:00':'2013-01-01 00:00:00']
data2013 = data.loc['2013-01-01 00:15:00':'2014-01-01 00:00:00']
data2014 = data.loc['2014-01-01 00:15:00':'2015-01-01 00:00:00']

In [77]:

# Check number of days
print(data2011.shape[0]/96)
print(data2012.shape[0]/96)
print(data2013.shape[0]/96)
print(data2014.shape[0]/96)

In [78]:

# See number of clients with 0 demand per year
print(sum(data2011.mean()==0))
print(sum(data2012.mean()==0))
print(sum(data2013.mean()==0))
print(sum(data2014.mean()==0))

In [79]:

clients = data2011.columns
clients_no_demand = clients[data2013.mean()==0] # clients with 0 demand
data_13_14 = data2013.append(data2014) # appending 2013 and 2014
data_13_14 = data_13_14.drop(clients_no_demand, axis=1) # drop clients with 0 demand
print(data_13_14.shape)
print(sum(data_13_14.mean()==0)) # check that there are no clients with 0 demand

(70080, 349)
0

Getting average curves per client¶

In [80]:

data = data_13_14.copy() # weekdays weekends, data2011, data2012, data2013, data2014

In [81]:

data['hour'] = data.index.map(lambda x: x[11:])

In [82]:

data.head(3)

Out[82]:

	MT_001	MT_002	MT_003	MT_004	MT_005	MT_006	MT_007	MT_008	MT_009	MT_010	...	MT_362	MT_363	MT_364	MT_365	MT_366	MT_367	MT_368	MT_369	MT_370	hour
2013-01-01 00:15:00	2.538071	22.759602	2.606429	138.211382	63.414634	255.952381	4.522329	239.057239	57.692308	78.494624	...	22300.0	886.075949	1000.000000	16.949153	6.436513	616.330114	76.794658	731.671554	8086.486486	00:15:00
2013-01-01 00:30:00	1.269036	22.759602	2.606429	138.211382	63.414634	264.880952	5.652911	228.956229	57.692308	76.344086	...	21000.0	864.978903	909.090909	18.252934	3.510825	564.530290	76.794658	727.272727	8086.486486	00:30:00
2013-01-01 00:45:00	2.538071	22.759602	2.606429	134.146341	60.975610	250.000000	5.652911	239.057239	54.195804	76.344086	...	18200.0	860.759494	840.909091	16.949153	5.851375	590.869183	68.447412	730.205279	7848.648649	00:45:00

3 rows × 350 columns

Getting average curves per client¶

In [83]:

datagrouped = data.groupby("hour")
average_curves = datagrouped.agg("mean")
average_curves.shape 

Out[83]:

(96, 349)

The dataframe average_curves contains the 349 typical weekday electric demands¶

We now divide each curve by its mean, so that all curves have mean 1¶

In [84]:

average_curves_norm = average_curves/(average_curves.mean())

Plot the first 2 and last 2 clients¶

In [85]:

average_curves_norm[['MT_001','MT_002','MT_369','MT_370']].plot()
plt.show()

Clustering Analysis on the average normalized curves¶

Questions (15 marks total)¶

Q1: (7 marks)

a. Determine what a convenient number of clusters. Justify your choice. Make use of the sklearn's package for k-means for this. You may refer to the module to figure out how to come up with the optimal number of clusters.

b. Make a plot for each cluster, that includes:

- The number of clients in the cluster (you can put this in the title of the plot)
- All the curves in the cluster
- The curve corresponding to the center of the cluster (make this curve thicker to distinguish it from the individual curves).  The center is also sometimes referred to as "centroid".

You have 2 separate plots for each cluster if you prefer (one for the individual curves, one for the centroid)

In [86]:

X = average_curves_norm.copy()  # We call this normalized curve
X = np.array(X.T) # put it on the right format

Question 1 Answer.¶

Answer: The optimal number of clusters is 3, according to two differant Elbow graphs that were plotted using the methods: 1) inertia_ from sklearn 2) a manual function plotting total squared clustering errors

In [24]:

from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn import metrics

def getInertia(X, kmeans):
    inertia = 0
    for j in range(len(X)):
        inertia = inertia + np.linalg.norm(X[j] - kmeans.cluster_centers_[kmeans.labels_[j]]) **2
    return inertia

In [25]:

test_clusters = np.array([2, 3, 4, 5, 6, 7, 8])

In [26]:

inertia_list = []
inertia_list2 = []
sil_scores_list = []
sse = []

for n_clusters in test_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, random_state=42)
    kmeans_model = kmeans.fit(X)
    
    labels = kmeans_model.labels_    
    msl = metrics.silhouette_score(X, labels, metric='euclidean')
    sil_scores_list.append(msl)

    inertia_list.append(kmeans_model.inertia_)  #assign inertia of cluster to list
    inertia_list2.append(getInertia(X, kmeans_model))
    
    print("The silhoutte score for {0} clusters is {1}".format(n_clusters, msl))
    #try manual calc of squared mean error
    y_kmeans = kmeans_model.predict(X)
    centroids = kmeans_model.cluster_centers_
    curr_sse = 0
    for i in range(len(X)):
        curr_center = centroids[y_kmeans[i]]
        curr_sse += (X[i, 0] - curr_center[0]) ** 2 + (X[i, 1] - curr_center[1]) **2
    sse.append(curr_sse)

The silhoutte score for 2 clusters is 0.37191838565388774
The silhoutte score for 3 clusters is 0.42244692513476645
The silhoutte score for 4 clusters is 0.4296355372224772
The silhoutte score for 5 clusters is 0.43445138922045784
The silhoutte score for 6 clusters is 0.42838790129722165
The silhoutte score for 7 clusters is 0.39635138162651284
The silhoutte score for 8 clusters is 0.2795711708141336

In [27]:

plt.figure(figsize=(10, 10))
plt.plot(range(1, 8), inertia_list2)
plt.title('Elbow Method - 1')
plt.xlabel('Number of clusters')
plt.ylabel('Squared Errors')
plt.show()

In [28]:

plt.figure(figsize=(10, 10))
plt.plot(range(1, 8), sse)
plt.title('Elbow Method - 2')
plt.xlabel('Number of clusters')
plt.ylabel('Squared cluster errors')
plt.show()

Question 1. B.¶

In [87]:

kmeans = KMeans(n_clusters = 3, random_state=42)
model = kmeans.fit(X)
y_predict = model.predict(X)
label = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

In [88]:

u_labels = np.unique(label)
u_labels

Out[88]:

array([0, 1, 2])

In [89]:

print("This is the shape of X:", len(X))
print("This is the shape of the averages from original data, but normalised", average_curves_norm.shape)
print("This is the shape of the averages from original data, but not normalised", average_curves.shape)
print("The shape of the labels_ is:", model.labels_.shape)
print("The shape of the predict is:", label.shape)
print("The shape of the y_predict is:", y_predict.shape)

This is the shape of X: 349
This is the shape of the averages from original data, but normalised (96, 349)
This is the shape of the averages from original data, but not normalised (96, 349)
The shape of the labels_ is: (349,)
The shape of the predict is: (349,)
The shape of the y_predict is: (349,)

In [90]:

from scipy import stats

mapping = {}
for class_id in np.unique(label):
    mode, _ = stats.mode(label[label==class_id])
    mapping[mode[0]] = class_id

mapping

Out[90]:

{0: 0, 1: 1, 2: 2}

In [91]:

y_pred = np.array([mapping[cluster_id] for cluster_id in label])

Graph of the clusters¶

In [95]:

plt.figure(figsize=(15, 15))
plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("", fontsize=14)
plt.ylabel("", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
plt.show()

Segmenting the clusters with corresponding data points¶

In [96]:

np.unique(y_predict, return_counts=True)

Out[96]:

(array([0, 1, 2]), array([113, 201,  35], dtype=int64))

In [97]:

x_frame = pd.DataFrame(X)
x_frame['cluster'] = y_predict
#x_frame.head(20)

In [98]:

group1 = x_frame.loc[x_frame['cluster']== 0]
group2 = x_frame.loc[x_frame['cluster']== 1]
group3 = x_frame.loc[x_frame['cluster']== 2]

In [99]:

cent_frame = pd.DataFrame(centroids)

In [100]:

cent1 = cent_frame.loc[0]
cent2 = cent_frame.loc[1]
cent3 = cent_frame.loc[2]

In [101]:

cent2

Out[101]:

0     0.677584
1     0.658765
2     0.640987
3     0.630940
4     0.620766
        ...   
91    0.867243
92    0.835176
93    0.781209
94    0.732462
95    0.697766
Name: 1, Length: 96, dtype: float64

In [102]:

group1.shape
group1

Out[102]:

	0	1	2	3	4	5	6	7	8	9	...	87	88	89	90	91	92	93	94	95	cluster
0	0.964792	0.947378	0.960104	0.959434	0.956085	0.972494	0.970485	0.983211	0.965127	0.965127	...	1.049852	1.022727	0.973164	0.994931	0.996941	1.024736	1.031768	1.006652	1.001629	0
1	0.926674	0.919117	0.880744	0.877243	0.841989	0.806701	0.786076	0.778034	0.759108	0.763579	...	1.049731	1.005708	1.005500	0.979363	0.955618	0.957767	0.954925	0.923346	0.926570	0
2	0.991800	0.977796	0.976579	0.971708	0.953443	0.949181	0.954660	0.947354	0.918739	0.925436	...	1.101391	1.092867	1.081908	1.069122	1.038072	1.009456	1.003976	0.998497	0.993017	0
3	1.140106	1.099159	1.061641	1.030153	0.975761	0.920906	0.897345	0.875850	0.855450	0.836217	...	1.495640	1.462960	1.422622	1.389845	1.347294	1.293558	1.238946	1.209744	1.175290	0
4	1.186498	1.131703	1.090238	1.053960	1.024347	0.992994	0.962757	0.934785	0.918994	0.903958	...	1.376127	1.382463	1.388340	1.385878	1.369364	1.352292	1.324550	1.272842	1.231541	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
344	1.358632	1.365299	1.312627	1.245171	1.292298	1.246425	1.221937	1.160817	1.096199	0.922938	...	1.352362	1.356124	1.300020	1.358302	1.417178	1.406155	1.393548	1.360414	1.324442	0
345	1.215272	1.184899	1.177781	1.156699	1.149629	1.139513	1.125489	1.092430	1.069030	1.043176	...	1.014598	1.021865	1.182977	1.273484	1.265394	1.268087	1.252350	1.247264	1.227273	0
346	1.170441	1.223818	1.211343	1.189829	1.167326	1.105127	0.993529	0.914299	0.842847	0.789775	...	1.280347	1.264828	1.242055	1.223890	1.204952	1.186229	1.174311	1.153788	1.136380	0
347	0.974307	0.969266	0.947969	0.937889	0.931284	0.959546	0.929746	0.903852	0.920795	0.933955	...	1.020555	1.020748	1.032500	1.023658	1.005845	1.000981	1.008670	1.003680	0.974611	0
348	0.992967	1.000548	1.008515	1.007613	1.006770	1.005071	1.006327	1.005100	1.004361	1.003668	...	1.023395	1.024567	1.019028	1.016896	1.013001	1.004462	0.991657	0.993947	0.987431	0

113 rows × 97 columns

Graphs showing curves for each customers data from each cluster¶

In [103]:

dat_plot1 = group1.drop('cluster', axis = 1)
dat_plot1.T.plot(figsize=(15, 15),legend = False)
cent1.plot(linewidth=5, color='red')
plt.title('There are 113 customers in the first cluster')
plt.show()

In [104]:

dat_plot2 = group2.drop('cluster', axis = 1)
dat_plot2.T.plot(figsize=(15, 15),legend = False)
cent2.plot(linewidth=5, color='red')
plt.title('There are 201 customers in the second cluster')
plt.show()

In [105]:

dat_plot3 = group3.drop('cluster', axis = 1)
dat_plot3.T.plot(figsize=(15, 15),legend = False)
cent3.plot(linewidth=5, color='red')
plt.title('There are 35 customers in the second cluster')
plt.show()

Q2: (8 marks)

In this exercise you work with the daily curves of 1 single client. First, create a list of arrays, each array containing a curve for a day. You may use X from the cells above. X = average_curves_norm.copy() The list contains 730 arrays, one for each of the days of 2013 and 2014.

a. Determine the optimal value of k ( number of clusters). This time you may also perform silhoutte analysis as stated in the module. Carrying out silhoutte analysis is left as an exercise. What do you understand about the clusters?

b. Based on your results from your analyses of both methods, what do understand? Interpret it perhaps with different perspectives of timelines like weeks or months.

Q2. b. HINT: You pick the data pertaining to only one user/client. Then you have all the data for each day for two years. 730 items in the list. Perform Clustering excercise. Look at the data to interpret how the clusters are formed. Do the clusters have anything to do with the usage by month, or by a week day or a weekend. Look at the two clusters and analyze by day/hour/time

Question #2. Answer:¶

According the the Silhoutte Analysis the optimal number of clusters is 2. Also according to the inertia graph the same conclusion is that 2 is the optimal number of clusters. See below for the Silouette Analysis graphs.

In [107]:

client = 'MT_022'
oneClient = data_13_14[client]
X = [] # a list of arrays, each array being a normalized curve for a day
for J in range(2*365):
    X.extend([np.array(oneClient[J*96:(J+1)*96])])#/np.mean(oneClient[J*96:(J+1)*96])])
len(X)

Out[107]:

In [108]:

print("The length of one customer", len(oneClient))
print("The shape of customer file before conversion", oneClient.shape)
print("The length of X after conversion", len(X))

The length of one customer 70080
The shape of customer file before conversion (70080,)
The length of X after conversion 730

In [109]:

range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]

In [65]:

inertia_list = [] sil_scores_list = [] cent_list = [] for n_clusters in range_n_clusters: kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, random_state=42) kmeans_model = kmeans.fit(X) labels = kmeans_model.labels_ msl = metrics.silhouette_score(X, labels, metric='euclidean') sil_scores_list.append(msl) inertia_list.append(kmeans_model.inertia_) #assign inertia of cluster to list # inertia_list2.append(getInertia(X, kmeans_model)) print("The silhoutte score for{0} clusters is{1}".format(n_clusters, msl)) y_kmeans = kmeans_model.predict(X) centroids = kmeans_model.cluster_centers_ cent_list.append(centroids) ############### #Plot the silouette analysis graphs X = np.array(X) for n_clusters in range_n_clusters: # Create a subplot with 1 row and 2 columns fig, (ax1, ax2) = plt.subplots(1, 2) fig.set_size_inches(18, 7) # The 1st subplot is the silhouette plot # The silhouette coefficient can range from -1, 1 but in this example all # lie within [-0.1, 1] ax1.set_xlim([-0.1, 1]) # The (n_clusters+1)*10 is for inserting blank space between silhouette # plots of individual clusters, to demarcate them clearly. ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10]) # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. clusterer = KMeans(n_clusters=n_clusters, random_state=10) cluster_labels = clusterer.fit_predict(X) # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(X, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(X, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = \ sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="--") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k') # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k') for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k') ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature") plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters =%d" % n_clusters), fontsize=14, fontweight='bold') plt.show()

The silhoutte score for 2 clusters is 0.5011374240143703 The silhoutte score for 3 clusters is 0.2821977416156188 The silhoutte score for 4 clusters is 0.28049361743450146 The silhoutte score for 5 clusters is 0.21796684044318868 The silhoutte score for 6 clusters is 0.194488579370401 The silhoutte score for 7 clusters is 0.1881151943265972 The silhoutte score for 8 clusters is 0.18308478210564866 The silhoutte score for 9 clusters is 0.1831194843057909 The silhoutte score for 10 clusters is 0.1730434531067609 For n_clusters = 2 The average silhouette_score is : 0.5011374240143703 For n_clusters = 3 The average silhouette_score is : 0.2821977416156188 For n_clusters = 4 The average silhouette_score is : 0.2809290661743727 For n_clusters = 5 The average silhouette_score is : 0.21806182288969875 For n_clusters = 6 The average silhouette_score is : 0.1946808916331403 For n_clusters = 7 The average silhouette_score is : 0.1904034398151954 For n_clusters = 8 The average silhouette_score is : 0.18458539094881474 For n_clusters = 9 The average silhouette_score is : 0.18527992724357673 For n_clusters = 10 The average silhouette_score is : 0.17492561208432203

In [423]:

plt.figure(figsize=(5, 5)) plt.plot(range(1, 10), inertia_list) plt.title('Inertia Values') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.show()

In [424]:

plt.figure(figsize=(5, 5)) plt.plot(range(1, 10), sil_scores_list,) plt.title('Silhoutte Scores') plt.xlabel('Number of clusters') plt.ylabel('Silhoutte') plt.show()

Question #2. B. ANSWER:¶

Cluster # 1:After merging dates with cluster numbers a simple visual inspection indicates that cluster # 1 is based on Weekends and Holidays, a detailed analysis confirms this, as most cluster # 1 data points are on Sat & Sun 77% of the time, there is the odd weekday which when looked up is many times a portuguese holiday, there were 7 national holidays during the week in portugal in 2013. Cluster # 2: Visual inspection indicates that cluster number 2 is weekday based Monday - Friday. A more detailed analysis reveals this to be true with the majority of cluster # 2's data points being on a Monday - Friday 90% of the time. Monthly analysis did not show any discernible pattern between the two clusters. See Below for cluster date analysis.

The following cell generates a list of all the days in the two years 2013-2014, which is helpful to answer part (b). The cells below are only to give you a headstart. You may or may not use these and come up with your own interpretation.¶

In [110]:

from datetime import date, timedelta d1 = date(2013, 1, 1) # start date d2 = date(2014, 12, 31) # end date delta = d2 - d1 # timedelta daysyear = [] D = {0:'mon', 1:'tue', 2:'wed', 3:'thu', 4:'fri', 5:'sat', 6:'sun'} for i in range(delta.days + 1): daysyear.extend([D[(d1 + timedelta(days=i)).weekday()]+"-"+str(d1 + timedelta(days=i))])

Fit and predict KMeans using the optimal k as determined previously, k = 2.

In [488]:

k = 2 kmeans = KMeans(n_clusters = k, init='k-means++', max_iter=300, random_state=42) kmeans_model = kmeans.fit(X) labels = kmeans_model.labels_ y_kmeans = kmeans_model.predict(X)

Analyse some of the dataset dimensions for merging

In [493]:

len(daysyear)

Out[493]:

In [494]:

len(X)

Out[494]:

In [495]:

len(y_kmeans)

Out[495]:

In [496]:

date_an = pd.DataFrame(daysyear, columns=['date']) date_an.head(5)

Out[496]:

	date
0	tue-2013-01-01
1	wed-2013-01-02
2	thu-2013-01-03
3	fri-2013-01-04
4	sat-2013-01-05

In [497]:

customer1 = pd.DataFrame(y_kmeans, columns=['cluster']) customer1.head(5)

Out[497]:

	cluster
0	0
1	1
2	1
3	1
4	0

Cluster Analysis to determine date & time frequencies of cluster creation¶

In [590]:

#Analyze daily / weekly cluster 1 date_analysis = pd.concat([date_an, customer1], axis=1) date1 = date_analysis.loc[date_analysis['cluster']== 0] #date3 = date_analysis.loc[date_analysis['cluster']== 2] mon1 = date1.date.str.count("mon").sum() tue1 = date1.date.str.count("tue").sum() wed1 = date1.date.str.count("wed").sum() thu1 = date1.date.str.count("thu").sum() fri1 = date1.date.str.count("fri").sum() sat1 = date1.date.str.count("sat").sum() sun1 = date1.date.str.count("sun").sum() date1_analysis ={} for var in ["mon1", "tue1", "wed1", "thu1", "fri1", "sat1", "sun1"]: date1_analysis[var] = eval(var) # Analyze daily / weekly cluster 2 date2 = date_analysis.loc[date_analysis['cluster']== 1] mon2 = date2.date.str.count("mon").sum() tue2 = date2.date.str.count("tue").sum() wed2 = date2.date.str.count("wed").sum() thu2 = date2.date.str.count("thu").sum() fri2 = date2.date.str.count("fri").sum() sat2 = date2.date.str.count("sat").sum() sun2 = date2.date.str.count("sun").sum() date2_analysis ={} for var in ["mon2", "tue2", "wed2", "thu2", "fri2", "sat2", "sun2"]: date2_analysis[var] = eval(var) #Cluster 1 monthly analysis month1 = date1.date.str.count("-01-").sum() month2 = date1.date.str.count("-02-").sum() month3 = date1.date.str.count("-03-").sum() month4 = date1.date.str.count("-04-").sum() month5 = date1.date.str.count("-05-").sum() month6 = date1.date.str.count("-06-").sum() month7 = date1.date.str.count("-07-").sum() month8 = date1.date.str.count("-08-").sum() month9 = date1.date.str.count("-09-").sum() month10 = date1.date.str.count("-10-").sum() month11 = date1.date.str.count("-11-").sum() month12 = date1.date.str.count("-12-").sum() monthly_analysis1 ={} for var in ["month1", "month2", "month3", "month4", "month5", "month6", "month7", "month8", "month9", "month10", "month11", "month12"]: monthly_analysis1[var] = eval(var) #Cluster 2 monthly analysis month1_2 = date2.date.str.count("-01-").sum() month2_2 = date2.date.str.count("-02-").sum() month3_2 = date2.date.str.count("-03-").sum() month4_2 = date2.date.str.count("-04-").sum() month5_2 = date2.date.str.count("-05-").sum() month6_2 = date2.date.str.count("-06-").sum() month7_2 = date2.date.str.count("-07-").sum() month8_2 = date2.date.str.count("-08-").sum() month9_2 = date2.date.str.count("-09-").sum() month10_2 = date2.date.str.count("-10-").sum() month11_2 = date2.date.str.count("-11-").sum() month12_2 = date2.date.str.count("-12-").sum() monthly_analysis2 ={} for var in ["month1_2", "month2_2", "month3_2", "month4_2", "month5_2", "month6_2", "month7_2", "month8_2", "month9_2", "month10_2", "month11_2", "month12_2"]: monthly_analysis2[var] = eval(var) #Extract the analysis print("These are days of the week counts from the first cluster" , date1_analysis) print("These are days of the week counts from the second cluster", date2_analysis) print("The length of cluster 1:", len(date1)) print("The length of cluster 2:", len(date2)) print("The percetage of data points that fall on the weekend and holidays for cluster 1:", ((sat1 + sun1 + 7) / len(date1) *100)) #there were 7 wkday holidays in 2013 print("The percetage of data points that fall during the week for cluster 2:", ((mon2 + tue2 + wed2 + thu2 + fri2) / len(date2) *100)) print("These are the data points for cluster 1 monthly counts", monthly_analysis1) print("These are the data points for cluster 2 monthly counts", monthly_analysis2)

These are days of the week counts from the first cluster {'mon1': 12, 'tue1': 12, 'wed1': 12, 'thu1': 10, 'fri1': 11, 'sat1': 58, 'sun1': 104} These are days of the week counts from the second cluster {'mon2': 92, 'tue2': 93, 'wed2': 93, 'thu2': 94, 'fri2': 93, 'sat2': 46, 'sun2': 0} The length of cluster 1: 219 The length of cluster 2: 511 The percetage of data points that fall on the weekend and holidays for cluster 1: 77.1689497716895 The percetage of data points that fall during the week for cluster 2: 90.99804305283757 These are the data points for cluster 1 monthly counts {'month1': 15, 'month2': 14, 'month3': 22, 'month4': 19, 'month5': 21, 'month6': 18, 'month7': 12, 'month8': 47, 'month9': 13, 'month10': 10, 'month11': 9, 'month12': 19} These are the data points for cluster 2 monthly counts {'month1_2': 47, 'month2_2': 42, 'month3_2': 40, 'month4_2': 41, 'month5_2': 41, 'month6_2': 42, 'month7_2': 50, 'month8_2': 15, 'month9_2': 47, 'month10_2': 52, 'month11_2': 51, 'month12_2': 43}

In [589]:

monthly_analysis1

Out[589]:

{'month1': 15, 'month2': 14, 'month3': 22, 'month4': 19, 'month5': 21, 'month6': 18, 'month7': 12, 'month8': 47, 'month9': 13, 'month10': 10, 'month11': 9, 'month12': 19}

In [591]:

monthly_analysis2

Out[591]:

{'month1_2': 47, 'month2_2': 42, 'month3_2': 40, 'month4_2': 41, 'month5_2': 41, 'month6_2': 42, 'month7_2': 50, 'month8_2': 15, 'month9_2': 47, 'month10_2': 52, 'month11_2': 51, 'month12_2': 43}

The following cell shows you how to do the following: given a list L, and a list of booleans B (True, False), get the sublist of L that corresponds to the True values in B. For example if L=[1,2,3,4] and B=[False,True,True,False], then the results is [2,3]. This is also helpful for pard (b)¶

In [ ]:

from itertools import compress L = [1,2,3,4] B = [False, True, True, False] list(compress(L, B))

Do the clusters have anything to do with the usage by month, or by a week day or a weekend. Look at the two clusters and analyze by day/hour/time We need to compare cluster number to date rather then data points, then count by month, day, weekend or hour

In [443]:

date_analysis = list(compress(daysyear, bool_list))

Continue with your analysis here:

Please see above for daily, weekly, monthly analysis of clustering

Clustering using KMeans, and Segmentation Analysis¶

Justin Gabb¶

Learning Outcomes¶

Problem Description¶

Data Preparation¶

Quick data inspection¶

As it can be seen, the dataframe contains a row for each interval of 15 minutes between Jan 1, 2011 to Dec 31 2014. There are 370 columns corresponding 370 clients. The dataframe is indexed by the timestamp.¶

Plot the first 2 days of 2012 for the first 2 clients¶

We note that the main difference between the curves is the level (also seen on the means in df.describe()). We know we will have to somehow normalize the curves, in order for the clustering technique to capture the behaviour of the consumption throughout the day.¶

Data for the analysis¶

Getting average curves per client¶

Getting average curves per client¶

The dataframe average_curves contains the 349 typical weekday electric demands¶

We now divide each curve by its mean, so that all curves have mean 1¶

Plot the first 2 and last 2 clients¶

Clustering Analysis on the average normalized curves¶

Questions (15 marks total)¶

Question 1 Answer.¶

Question 1. B.¶

Graph of the clusters¶

Segmenting the clusters with corresponding data points¶

Graphs showing curves for each customers data from each cluster¶

Question #2. Answer:¶

Question #2. B. ANSWER:¶

The following cell generates a list of all the days in the two years 2013-2014, which is helpful to answer part (b). The cells below are only to give you a headstart. You may or may not use these and come up with your own interpretation.¶

Cluster Analysis to determine date & time frequencies of cluster creation¶

The following cell shows you how to do the following: given a list L, and a list of booleans B (True, False), get the sublist of L that corresponds to the True values in B. For example if L=[1,2,3,4] and B=[False,True,True,False], then the results is [2,3]. This is also helpful for pard (b)¶

Links

Portfolio

Contact