In this assignment, you will do the following:
Explore a dataset and carry out clustering using k-means algorithm
Identify the optimum number of clusters for a given dataset
In this assignment, you will study the electricity demand from clients in Portugal, during 2013 and 2014. You have been provided with the data file, which you should download when you download this assignment file.
The data available contains 370 time series, corresponding to the electric demand for 370 clients, between 2011 and 2014.
In this guided exercise, you will use clustering techniques to understand the typical usage behaviour during 2013-2014.
Both these datasets are publicly available, and can be used to carry out experiments. Their source is below:
Data: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014#
Electric Demand: http://www.think-energy.net/KWvsKWH.htm
We will start by exploring the data set and continue on to the assignment. Consider this as a working notebook, you will add your work to the same notebook.
In this assignment we will use the sklearn package for k-means. Please refer here for the documentation https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
The sklearn package for k-means is one of the many clustering algorithms found in the module "sklearn.cluster". These come with a variety of functions that you can call by importing the package.
For example
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeansStart by downloading the data to a local directory and modify the "pathToFile" and "fileName" variables, if needed. The data file has been provided with this assignment. It is also available at the links provided above.
Assignment #2 - Clustering using KMeans, and Segmentation Analysis
Justin Gabb
June 20 2021
pathToFile = r"C:\\Users\\Z33\Desktop\\Data Science\\U of T - Data Science\\Data Science 4 - Machine Learning\\Assignment 2 - Clustering\\"
#pathToFile = r"C:\\Users\\Lenovo\\Desktop\\Data Science\\U of T - Data Science\\Data Science 4 - Machine Learning\\Assignment 2 - Clustering\\"
fileName = 'LD2011_2014.txt'
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import random
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
random.seed(42)
# Replace "," by ".", otherwise the numbers will be in the form 2,3445 instead of 2.3445
import fileinput
with fileinput.FileInput(pathToFile+fileName, inplace=True, backup='.bak') as file:
for line in file:
print(line.replace(",", "."), end='')
# Create dataframe
import pandas as pd
data = pd.read_csv(pathToFile+fileName, sep=";", index_col=0)
data.head(2)
| MT_001 | MT_002 | MT_003 | MT_004 | MT_005 | MT_006 | MT_007 | MT_008 | MT_009 | MT_010 | ... | MT_361 | MT_362 | MT_363 | MT_364 | MT_365 | MT_366 | MT_367 | MT_368 | MT_369 | MT_370 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011-01-01 00:15:00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2011-01-01 00:30:00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 rows × 370 columns
data.tail(2)
| MT_001 | MT_002 | MT_003 | MT_004 | MT_005 | MT_006 | MT_007 | MT_008 | MT_009 | MT_010 | ... | MT_361 | MT_362 | MT_363 | MT_364 | MT_365 | MT_366 | MT_367 | MT_368 | MT_369 | MT_370 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014-12-31 23:45:00 | 1.269036 | 21.337127 | 1.737619 | 166.666667 | 85.365854 | 285.714286 | 10.17524 | 225.589226 | 64.685315 | 72.043011 | ... | 246.252677 | 28000.0 | 1443.037975 | 909.090909 | 26.075619 | 4.095963 | 664.618086 | 146.911519 | 646.627566 | 6540.540541 |
| 2015-01-01 00:00:00 | 2.538071 | 19.914651 | 1.737619 | 178.861789 | 84.146341 | 279.761905 | 10.17524 | 249.158249 | 62.937063 | 69.892473 | ... | 188.436831 | 27800.0 | 1409.282700 | 954.545455 | 27.379400 | 4.095963 | 628.621598 | 131.886477 | 673.020528 | 7135.135135 |
2 rows × 370 columns
data.shape
(140256, 370)
Since the frequency is 15 minutes, each day provides $24\times 4 = 96$ datapoints, which multiplied by 365 days and 4 years (plus 1 day in Feb 29, 2012) gives: $96 \times 365 \times 4 + 96 = 140256$, as observed in data.shape
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 140256 entries, 2011-01-01 00:15:00 to 2015-01-01 00:00:00 Columns: 370 entries, MT_001 to MT_370 dtypes: float64(370) memory usage: 397.0+ MB
data.describe()
| MT_001 | MT_002 | MT_003 | MT_004 | MT_005 | MT_006 | MT_007 | MT_008 | MT_009 | MT_010 | ... | MT_361 | MT_362 | MT_363 | MT_364 | MT_365 | MT_366 | MT_367 | MT_368 | MT_369 | MT_370 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | ... | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 | 140256.000000 |
| mean | 3.970785 | 20.768480 | 2.918308 | 82.184490 | 37.240309 | 141.227385 | 4.521338 | 191.401476 | 39.975354 | 42.205152 | ... | 218.213701 | 37607.987537 | 1887.427366 | 2940.031734 | 65.413150 | 9.269709 | 424.262904 | 94.704717 | 625.251734 | 8722.355145 |
| std | 5.983965 | 13.272415 | 11.014456 | 58.248392 | 26.461327 | 98.439984 | 6.485684 | 121.981187 | 29.814595 | 33.401251 | ... | 204.833532 | 38691.954832 | 1801.486488 | 2732.251967 | 65.007818 | 10.016782 | 274.337122 | 80.297301 | 380.656042 | 9195.155777 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.844950 | 0.000000 | 36.585366 | 15.853659 | 71.428571 | 0.565291 | 111.111111 | 13.986014 | 9.677419 | ... | 5.710207 | 0.000000 | 0.000000 | 0.000000 | 13.037810 | 0.000000 | 0.000000 | 30.050083 | 83.944282 | 0.000000 |
| 50% | 1.269036 | 24.893314 | 1.737619 | 87.398374 | 39.024390 | 157.738095 | 2.826456 | 222.222222 | 40.209790 | 40.860215 | ... | 131.334761 | 24100.000000 | 1050.632911 | 2136.363636 | 31.290743 | 7.021650 | 525.899912 | 76.794658 | 758.064516 | 0.000000 |
| 75% | 2.538071 | 29.871977 | 1.737619 | 115.853659 | 54.878049 | 205.357143 | 4.522329 | 279.461279 | 57.692308 | 61.290323 | ... | 403.283369 | 54800.000000 | 3312.236287 | 5363.636364 | 108.213820 | 11.702750 | 627.743635 | 151.919866 | 875.366569 | 17783.783784 |
| max | 48.223350 | 115.220484 | 151.172893 | 321.138211 | 150.000000 | 535.714286 | 44.657999 | 552.188552 | 157.342657 | 198.924731 | ... | 852.962170 | 192800.000000 | 7751.054852 | 12386.363636 | 335.071708 | 60.269163 | 1138.718174 | 362.270451 | 1549.120235 | 30918.918919 |
8 rows × 370 columns
data_example = data.loc['2012-01-01 00:15:00':'2012-01-03 00:00:00'][['MT_001','MT_002']]
data_example.plot()
plt.show()
We focus on 2013 and 2014 because these are the years with low number of clients having zero demand
data2011 = data.loc['2011-01-01 00:15:00':'2012-01-01 00:00:00']
data2012 = data.loc['2012-01-01 00:15:00':'2013-01-01 00:00:00']
data2013 = data.loc['2013-01-01 00:15:00':'2014-01-01 00:00:00']
data2014 = data.loc['2014-01-01 00:15:00':'2015-01-01 00:00:00']
# Check number of days
print(data2011.shape[0]/96)
print(data2012.shape[0]/96)
print(data2013.shape[0]/96)
print(data2014.shape[0]/96)
365.0 366.0 365.0 365.0
# See number of clients with 0 demand per year
print(sum(data2011.mean()==0))
print(sum(data2012.mean()==0))
print(sum(data2013.mean()==0))
print(sum(data2014.mean()==0))
210 37 21 1
clients = data2011.columns
clients_no_demand = clients[data2013.mean()==0] # clients with 0 demand
data_13_14 = data2013.append(data2014) # appending 2013 and 2014
data_13_14 = data_13_14.drop(clients_no_demand, axis=1) # drop clients with 0 demand
print(data_13_14.shape)
print(sum(data_13_14.mean()==0)) # check that there are no clients with 0 demand
(70080, 349) 0
data = data_13_14.copy() # weekdays weekends, data2011, data2012, data2013, data2014
data['hour'] = data.index.map(lambda x: x[11:])
data.head(3)
| MT_001 | MT_002 | MT_003 | MT_004 | MT_005 | MT_006 | MT_007 | MT_008 | MT_009 | MT_010 | ... | MT_362 | MT_363 | MT_364 | MT_365 | MT_366 | MT_367 | MT_368 | MT_369 | MT_370 | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-01-01 00:15:00 | 2.538071 | 22.759602 | 2.606429 | 138.211382 | 63.414634 | 255.952381 | 4.522329 | 239.057239 | 57.692308 | 78.494624 | ... | 22300.0 | 886.075949 | 1000.000000 | 16.949153 | 6.436513 | 616.330114 | 76.794658 | 731.671554 | 8086.486486 | 00:15:00 |
| 2013-01-01 00:30:00 | 1.269036 | 22.759602 | 2.606429 | 138.211382 | 63.414634 | 264.880952 | 5.652911 | 228.956229 | 57.692308 | 76.344086 | ... | 21000.0 | 864.978903 | 909.090909 | 18.252934 | 3.510825 | 564.530290 | 76.794658 | 727.272727 | 8086.486486 | 00:30:00 |
| 2013-01-01 00:45:00 | 2.538071 | 22.759602 | 2.606429 | 134.146341 | 60.975610 | 250.000000 | 5.652911 | 239.057239 | 54.195804 | 76.344086 | ... | 18200.0 | 860.759494 | 840.909091 | 16.949153 | 5.851375 | 590.869183 | 68.447412 | 730.205279 | 7848.648649 | 00:45:00 |
3 rows × 350 columns
datagrouped = data.groupby("hour")
average_curves = datagrouped.agg("mean")
average_curves.shape
(96, 349)
average_curves_norm = average_curves/(average_curves.mean())
average_curves_norm[['MT_001','MT_002','MT_369','MT_370']].plot()
plt.show()
Q1: (7 marks)
a. Determine what a convenient number of clusters. Justify your choice. Make use of the sklearn's package for k-means for this. You may refer to the module to figure out how to come up with the optimal number of clusters.
b. Make a plot for each cluster, that includes:
- The number of clients in the cluster (you can put this in the title of the plot)
- All the curves in the cluster
- The curve corresponding to the center of the cluster (make this curve thicker to distinguish it from the individual curves). The center is also sometimes referred to as "centroid".
You have 2 separate plots for each cluster if you prefer (one for the individual curves, one for the centroid)
X = average_curves_norm.copy() # We call this normalized curve
X = np.array(X.T) # put it on the right format
Answer: The optimal number of clusters is 3, according to two differant Elbow graphs that were plotted using the methods: 1) inertia_ from sklearn 2) a manual function plotting total squared clustering errors
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn import metrics
def getInertia(X, kmeans):
inertia = 0
for j in range(len(X)):
inertia = inertia + np.linalg.norm(X[j] - kmeans.cluster_centers_[kmeans.labels_[j]]) **2
return inertia
test_clusters = np.array([2, 3, 4, 5, 6, 7, 8])
inertia_list = []
inertia_list2 = []
sil_scores_list = []
sse = []
for n_clusters in test_clusters:
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, random_state=42)
kmeans_model = kmeans.fit(X)
labels = kmeans_model.labels_
msl = metrics.silhouette_score(X, labels, metric='euclidean')
sil_scores_list.append(msl)
inertia_list.append(kmeans_model.inertia_) #assign inertia of cluster to list
inertia_list2.append(getInertia(X, kmeans_model))
print("The silhoutte score for {0} clusters is {1}".format(n_clusters, msl))
#try manual calc of squared mean error
y_kmeans = kmeans_model.predict(X)
centroids = kmeans_model.cluster_centers_
curr_sse = 0
for i in range(len(X)):
curr_center = centroids[y_kmeans[i]]
curr_sse += (X[i, 0] - curr_center[0]) ** 2 + (X[i, 1] - curr_center[1]) **2
sse.append(curr_sse)
The silhoutte score for 2 clusters is 0.37191838565388774 The silhoutte score for 3 clusters is 0.42244692513476645 The silhoutte score for 4 clusters is 0.4296355372224772 The silhoutte score for 5 clusters is 0.43445138922045784 The silhoutte score for 6 clusters is 0.42838790129722165 The silhoutte score for 7 clusters is 0.39635138162651284 The silhoutte score for 8 clusters is 0.2795711708141336
plt.figure(figsize=(10, 10))
plt.plot(range(1, 8), inertia_list2)
plt.title('Elbow Method - 1')
plt.xlabel('Number of clusters')
plt.ylabel('Squared Errors')
plt.show()
plt.figure(figsize=(10, 10))
plt.plot(range(1, 8), sse)
plt.title('Elbow Method - 2')
plt.xlabel('Number of clusters')
plt.ylabel('Squared cluster errors')
plt.show()
kmeans = KMeans(n_clusters = 3, random_state=42)
model = kmeans.fit(X)
y_predict = model.predict(X)
label = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
u_labels = np.unique(label)
u_labels
array([0, 1, 2])
print("This is the shape of X:", len(X))
print("This is the shape of the averages from original data, but normalised", average_curves_norm.shape)
print("This is the shape of the averages from original data, but not normalised", average_curves.shape)
print("The shape of the labels_ is:", model.labels_.shape)
print("The shape of the predict is:", label.shape)
print("The shape of the y_predict is:", y_predict.shape)
This is the shape of X: 349 This is the shape of the averages from original data, but normalised (96, 349) This is the shape of the averages from original data, but not normalised (96, 349) The shape of the labels_ is: (349,) The shape of the predict is: (349,) The shape of the y_predict is: (349,)
from scipy import stats
mapping = {}
for class_id in np.unique(label):
mode, _ = stats.mode(label[label==class_id])
mapping[mode[0]] = class_id
mapping
{0: 0, 1: 1, 2: 2}y_pred = np.array([mapping[cluster_id] for cluster_id in label])
plt.figure(figsize=(15, 15))
plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], "yo", label="Cluster 1")
plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], "bs", label="Cluster 2")
plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], "g^", label="Cluster 3")
plt.xlabel("", fontsize=14)
plt.ylabel("", fontsize=14)
plt.legend(loc="upper left", fontsize=12)
plt.show()
np.unique(y_predict, return_counts=True)
(array([0, 1, 2]), array([113, 201, 35], dtype=int64))
x_frame = pd.DataFrame(X)
x_frame['cluster'] = y_predict
#x_frame.head(20)
group1 = x_frame.loc[x_frame['cluster']== 0]
group2 = x_frame.loc[x_frame['cluster']== 1]
group3 = x_frame.loc[x_frame['cluster']== 2]
cent_frame = pd.DataFrame(centroids)
cent1 = cent_frame.loc[0]
cent2 = cent_frame.loc[1]
cent3 = cent_frame.loc[2]
cent2
0 0.677584
1 0.658765
2 0.640987
3 0.630940
4 0.620766
...
91 0.867243
92 0.835176
93 0.781209
94 0.732462
95 0.697766
Name: 1, Length: 96, dtype: float64group1.shape
group1
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.964792 | 0.947378 | 0.960104 | 0.959434 | 0.956085 | 0.972494 | 0.970485 | 0.983211 | 0.965127 | 0.965127 | ... | 1.049852 | 1.022727 | 0.973164 | 0.994931 | 0.996941 | 1.024736 | 1.031768 | 1.006652 | 1.001629 | 0 |
| 1 | 0.926674 | 0.919117 | 0.880744 | 0.877243 | 0.841989 | 0.806701 | 0.786076 | 0.778034 | 0.759108 | 0.763579 | ... | 1.049731 | 1.005708 | 1.005500 | 0.979363 | 0.955618 | 0.957767 | 0.954925 | 0.923346 | 0.926570 | 0 |
| 2 | 0.991800 | 0.977796 | 0.976579 | 0.971708 | 0.953443 | 0.949181 | 0.954660 | 0.947354 | 0.918739 | 0.925436 | ... | 1.101391 | 1.092867 | 1.081908 | 1.069122 | 1.038072 | 1.009456 | 1.003976 | 0.998497 | 0.993017 | 0 |
| 3 | 1.140106 | 1.099159 | 1.061641 | 1.030153 | 0.975761 | 0.920906 | 0.897345 | 0.875850 | 0.855450 | 0.836217 | ... | 1.495640 | 1.462960 | 1.422622 | 1.389845 | 1.347294 | 1.293558 | 1.238946 | 1.209744 | 1.175290 | 0 |
| 4 | 1.186498 | 1.131703 | 1.090238 | 1.053960 | 1.024347 | 0.992994 | 0.962757 | 0.934785 | 0.918994 | 0.903958 | ... | 1.376127 | 1.382463 | 1.388340 | 1.385878 | 1.369364 | 1.352292 | 1.324550 | 1.272842 | 1.231541 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 344 | 1.358632 | 1.365299 | 1.312627 | 1.245171 | 1.292298 | 1.246425 | 1.221937 | 1.160817 | 1.096199 | 0.922938 | ... | 1.352362 | 1.356124 | 1.300020 | 1.358302 | 1.417178 | 1.406155 | 1.393548 | 1.360414 | 1.324442 | 0 |
| 345 | 1.215272 | 1.184899 | 1.177781 | 1.156699 | 1.149629 | 1.139513 | 1.125489 | 1.092430 | 1.069030 | 1.043176 | ... | 1.014598 | 1.021865 | 1.182977 | 1.273484 | 1.265394 | 1.268087 | 1.252350 | 1.247264 | 1.227273 | 0 |
| 346 | 1.170441 | 1.223818 | 1.211343 | 1.189829 | 1.167326 | 1.105127 | 0.993529 | 0.914299 | 0.842847 | 0.789775 | ... | 1.280347 | 1.264828 | 1.242055 | 1.223890 | 1.204952 | 1.186229 | 1.174311 | 1.153788 | 1.136380 | 0 |
| 347 | 0.974307 | 0.969266 | 0.947969 | 0.937889 | 0.931284 | 0.959546 | 0.929746 | 0.903852 | 0.920795 | 0.933955 | ... | 1.020555 | 1.020748 | 1.032500 | 1.023658 | 1.005845 | 1.000981 | 1.008670 | 1.003680 | 0.974611 | 0 |
| 348 | 0.992967 | 1.000548 | 1.008515 | 1.007613 | 1.006770 | 1.005071 | 1.006327 | 1.005100 | 1.004361 | 1.003668 | ... | 1.023395 | 1.024567 | 1.019028 | 1.016896 | 1.013001 | 1.004462 | 0.991657 | 0.993947 | 0.987431 | 0 |
113 rows × 97 columns
dat_plot1 = group1.drop('cluster', axis = 1)
dat_plot1.T.plot(figsize=(15, 15),legend = False)
cent1.plot(linewidth=5, color='red')
plt.title('There are 113 customers in the first cluster')
plt.show()
dat_plot2 = group2.drop('cluster', axis = 1)
dat_plot2.T.plot(figsize=(15, 15),legend = False)
cent2.plot(linewidth=5, color='red')
plt.title('There are 201 customers in the second cluster')
plt.show()
dat_plot3 = group3.drop('cluster', axis = 1)
dat_plot3.T.plot(figsize=(15, 15),legend = False)
cent3.plot(linewidth=5, color='red')
plt.title('There are 35 customers in the second cluster')
plt.show()
Q2: (8 marks)
In this exercise you work with the daily curves of 1 single client. First, create a list of arrays, each array containing a curve for a day. You may use X from the cells above. X = average_curves_norm.copy() The list contains 730 arrays, one for each of the days of 2013 and 2014.
a. Determine the optimal value of k ( number of clusters). This time you may also perform silhoutte analysis as stated in the module. Carrying out silhoutte analysis is left as an exercise. What do you understand about the clusters?
b. Based on your results from your analyses of both methods, what do understand? Interpret it perhaps with different perspectives of timelines like weeks or months.
Q2. b. HINT: You pick the data pertaining to only one user/client. Then you have all the data for each day for two years. 730 items in the list. Perform Clustering excercise. Look at the data to interpret how the clusters are formed. Do the clusters have anything to do with the usage by month, or by a week day or a weekend. Look at the two clusters and analyze by day/hour/time
According the the Silhoutte Analysis the optimal number of clusters is 2. Also according to the inertia graph the same conclusion is that 2 is the optimal number of clusters. See below for the Silouette Analysis graphs.
client = 'MT_022'
oneClient = data_13_14[client]
X = [] # a list of arrays, each array being a normalized curve for a day
for J in range(2*365):
X.extend([np.array(oneClient[J*96:(J+1)*96])])#/np.mean(oneClient[J*96:(J+1)*96])])
len(X)
730
print("The length of one customer", len(oneClient))
print("The shape of customer file before conversion", oneClient.shape)
print("The length of X after conversion", len(X))
The length of one customer 70080 The shape of customer file before conversion (70080,) The length of X after conversion 730
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]
inertia_list = [] sil_scores_list = [] cent_list = [] for n_clusters in range_n_clusters: kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, random_state=42) kmeans_model = kmeans.fit(X) labels = kmeans_model.labels_ msl = metrics.silhouette_score(X, labels, metric='euclidean') sil_scores_list.append(msl) inertia_list.append(kmeans_model.inertia_) #assign inertia of cluster to list # inertia_list2.append(getInertia(X, kmeans_model)) print("The silhoutte score for{0} clusters is{1}".format(n_clusters, msl)) y_kmeans = kmeans_model.predict(X) centroids = kmeans_model.cluster_centers_ cent_list.append(centroids) ############### #Plot the silouette analysis graphs X = np.array(X) for n_clusters in range_n_clusters: # Create a subplot with 1 row and 2 columns fig, (ax1, ax2) = plt.subplots(1, 2) fig.set_size_inches(18, 7) # The 1st subplot is the silhouette plot # The silhouette coefficient can range from -1, 1 but in this example all # lie within [-0.1, 1] ax1.set_xlim([-0.1, 1]) # The (n_clusters+1)*10 is for inserting blank space between silhouette # plots of individual clusters, to demarcate them clearly. ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10]) # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. clusterer = KMeans(n_clusters=n_clusters, random_state=10) cluster_labels = clusterer.fit_predict(X) # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(X, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(X, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = \ sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="--") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k') # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k') for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k') ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature") plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters =%d" % n_clusters), fontsize=14, fontweight='bold') plt.show()The silhoutte score for 2 clusters is 0.5011374240143703 The silhoutte score for 3 clusters is 0.2821977416156188 The silhoutte score for 4 clusters is 0.28049361743450146 The silhoutte score for 5 clusters is 0.21796684044318868 The silhoutte score for 6 clusters is 0.194488579370401 The silhoutte score for 7 clusters is 0.1881151943265972 The silhoutte score for 8 clusters is 0.18308478210564866 The silhoutte score for 9 clusters is 0.1831194843057909 The silhoutte score for 10 clusters is 0.1730434531067609 For n_clusters = 2 The average silhouette_score is : 0.5011374240143703 For n_clusters = 3 The average silhouette_score is : 0.2821977416156188 For n_clusters = 4 The average silhouette_score is : 0.2809290661743727 For n_clusters = 5 The average silhouette_score is : 0.21806182288969875 For n_clusters = 6 The average silhouette_score is : 0.1946808916331403 For n_clusters = 7 The average silhouette_score is : 0.1904034398151954 For n_clusters = 8 The average silhouette_score is : 0.18458539094881474 For n_clusters = 9 The average silhouette_score is : 0.18527992724357673 For n_clusters = 10 The average silhouette_score is : 0.17492561208432203
plt.figure(figsize=(5, 5)) plt.plot(range(1, 10), inertia_list) plt.title('Inertia Values') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.show()plt.figure(figsize=(5, 5)) plt.plot(range(1, 10), sil_scores_list,) plt.title('Silhoutte Scores') plt.xlabel('Number of clusters') plt.ylabel('Silhoutte') plt.show()Cluster # 1:After merging dates with cluster numbers a simple visual inspection indicates that cluster # 1 is based on Weekends and Holidays, a detailed analysis confirms this, as most cluster # 1 data points are on Sat & Sun 77% of the time, there is the odd weekday which when looked up is many times a portuguese holiday, there were 7 national holidays during the week in portugal in 2013. Cluster # 2: Visual inspection indicates that cluster number 2 is weekday based Monday - Friday. A more detailed analysis reveals this to be true with the majority of cluster # 2's data points being on a Monday - Friday 90% of the time. Monthly analysis did not show any discernible pattern between the two clusters. See Below for cluster date analysis.
from datetime import date, timedelta d1 = date(2013, 1, 1) # start date d2 = date(2014, 12, 31) # end date delta = d2 - d1 # timedelta daysyear = [] D = {0:'mon', 1:'tue', 2:'wed', 3:'thu', 4:'fri', 5:'sat', 6:'sun'} for i in range(delta.days + 1): daysyear.extend([D[(d1 + timedelta(days=i)).weekday()]+"-"+str(d1 + timedelta(days=i))])Fit and predict KMeans using the optimal k as determined previously, k = 2.
k = 2 kmeans = KMeans(n_clusters = k, init='k-means++', max_iter=300, random_state=42) kmeans_model = kmeans.fit(X) labels = kmeans_model.labels_ y_kmeans = kmeans_model.predict(X)Analyse some of the dataset dimensions for merging
len(daysyear)730
len(X)730
len(y_kmeans)730
date_an = pd.DataFrame(daysyear, columns=['date']) date_an.head(5)| date | |
|---|---|
| 0 | tue-2013-01-01 |
| 1 | wed-2013-01-02 |
| 2 | thu-2013-01-03 |
| 3 | fri-2013-01-04 |
| 4 | sat-2013-01-05 |
customer1 = pd.DataFrame(y_kmeans, columns=['cluster']) customer1.head(5)| cluster | |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
#Analyze daily / weekly cluster 1 date_analysis = pd.concat([date_an, customer1], axis=1) date1 = date_analysis.loc[date_analysis['cluster']== 0] #date3 = date_analysis.loc[date_analysis['cluster']== 2] mon1 = date1.date.str.count("mon").sum() tue1 = date1.date.str.count("tue").sum() wed1 = date1.date.str.count("wed").sum() thu1 = date1.date.str.count("thu").sum() fri1 = date1.date.str.count("fri").sum() sat1 = date1.date.str.count("sat").sum() sun1 = date1.date.str.count("sun").sum() date1_analysis ={} for var in ["mon1", "tue1", "wed1", "thu1", "fri1", "sat1", "sun1"]: date1_analysis[var] = eval(var) # Analyze daily / weekly cluster 2 date2 = date_analysis.loc[date_analysis['cluster']== 1] mon2 = date2.date.str.count("mon").sum() tue2 = date2.date.str.count("tue").sum() wed2 = date2.date.str.count("wed").sum() thu2 = date2.date.str.count("thu").sum() fri2 = date2.date.str.count("fri").sum() sat2 = date2.date.str.count("sat").sum() sun2 = date2.date.str.count("sun").sum() date2_analysis ={} for var in ["mon2", "tue2", "wed2", "thu2", "fri2", "sat2", "sun2"]: date2_analysis[var] = eval(var) #Cluster 1 monthly analysis month1 = date1.date.str.count("-01-").sum() month2 = date1.date.str.count("-02-").sum() month3 = date1.date.str.count("-03-").sum() month4 = date1.date.str.count("-04-").sum() month5 = date1.date.str.count("-05-").sum() month6 = date1.date.str.count("-06-").sum() month7 = date1.date.str.count("-07-").sum() month8 = date1.date.str.count("-08-").sum() month9 = date1.date.str.count("-09-").sum() month10 = date1.date.str.count("-10-").sum() month11 = date1.date.str.count("-11-").sum() month12 = date1.date.str.count("-12-").sum() monthly_analysis1 ={} for var in ["month1", "month2", "month3", "month4", "month5", "month6", "month7", "month8", "month9", "month10", "month11", "month12"]: monthly_analysis1[var] = eval(var) #Cluster 2 monthly analysis month1_2 = date2.date.str.count("-01-").sum() month2_2 = date2.date.str.count("-02-").sum() month3_2 = date2.date.str.count("-03-").sum() month4_2 = date2.date.str.count("-04-").sum() month5_2 = date2.date.str.count("-05-").sum() month6_2 = date2.date.str.count("-06-").sum() month7_2 = date2.date.str.count("-07-").sum() month8_2 = date2.date.str.count("-08-").sum() month9_2 = date2.date.str.count("-09-").sum() month10_2 = date2.date.str.count("-10-").sum() month11_2 = date2.date.str.count("-11-").sum() month12_2 = date2.date.str.count("-12-").sum() monthly_analysis2 ={} for var in ["month1_2", "month2_2", "month3_2", "month4_2", "month5_2", "month6_2", "month7_2", "month8_2", "month9_2", "month10_2", "month11_2", "month12_2"]: monthly_analysis2[var] = eval(var) #Extract the analysis print("These are days of the week counts from the first cluster" , date1_analysis) print("These are days of the week counts from the second cluster", date2_analysis) print("The length of cluster 1:", len(date1)) print("The length of cluster 2:", len(date2)) print("The percetage of data points that fall on the weekend and holidays for cluster 1:", ((sat1 + sun1 + 7) / len(date1) *100)) #there were 7 wkday holidays in 2013 print("The percetage of data points that fall during the week for cluster 2:", ((mon2 + tue2 + wed2 + thu2 + fri2) / len(date2) *100)) print("These are the data points for cluster 1 monthly counts", monthly_analysis1) print("These are the data points for cluster 2 monthly counts", monthly_analysis2)These are days of the week counts from the first cluster {'mon1': 12, 'tue1': 12, 'wed1': 12, 'thu1': 10, 'fri1': 11, 'sat1': 58, 'sun1': 104} These are days of the week counts from the second cluster {'mon2': 92, 'tue2': 93, 'wed2': 93, 'thu2': 94, 'fri2': 93, 'sat2': 46, 'sun2': 0} The length of cluster 1: 219 The length of cluster 2: 511 The percetage of data points that fall on the weekend and holidays for cluster 1: 77.1689497716895 The percetage of data points that fall during the week for cluster 2: 90.99804305283757 These are the data points for cluster 1 monthly counts {'month1': 15, 'month2': 14, 'month3': 22, 'month4': 19, 'month5': 21, 'month6': 18, 'month7': 12, 'month8': 47, 'month9': 13, 'month10': 10, 'month11': 9, 'month12': 19} These are the data points for cluster 2 monthly counts {'month1_2': 47, 'month2_2': 42, 'month3_2': 40, 'month4_2': 41, 'month5_2': 41, 'month6_2': 42, 'month7_2': 50, 'month8_2': 15, 'month9_2': 47, 'month10_2': 52, 'month11_2': 51, 'month12_2': 43}monthly_analysis1{'month1': 15, 'month2': 14, 'month3': 22, 'month4': 19, 'month5': 21, 'month6': 18, 'month7': 12, 'month8': 47, 'month9': 13, 'month10': 10, 'month11': 9, 'month12': 19}monthly_analysis2{'month1_2': 47, 'month2_2': 42, 'month3_2': 40, 'month4_2': 41, 'month5_2': 41, 'month6_2': 42, 'month7_2': 50, 'month8_2': 15, 'month9_2': 47, 'month10_2': 52, 'month11_2': 51, 'month12_2': 43}from itertools import compress L = [1,2,3,4] B = [False, True, True, False] list(compress(L, B))Do the clusters have anything to do with the usage by month, or by a week day or a weekend. Look at the two clusters and analyze by day/hour/time We need to compare cluster number to date rather then data points, then count by month, day, weekend or hour
date_analysis = list(compress(daysyear, bool_list))Continue with your analysis here:
Please see above for daily, weekly, monthly analysis of clustering

Welcome to JGAnalytics, we offer services for hire for Data Analytics, Data Engineering, and Machine Learning.
Copyright © JGAnalytics 2024