# Clustering

There are billions of stars in the galaxy and we are in the way of finding new constellations but how can we find them as there are no labels, well clustering is the solution.

In order to solve unsupervised problems in machine learning, we use clustering algorithms. We can classify clustering as :

• Partitioned-based clustering
• K-Means
• K-Medians or Fuzzy c-Means
• (used for medium and large-sized databases)
• Hierarchical clustering
• Agglomerative
• Divisive algorithms.
• (used for small size datasets.)
• Density-based clustering algorithms
• DB scan algorithm
• (when there is noise in data/spatial clustering)

The main objective of the clustering algorithm is to form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters. In order to do this, it will minimize the “intra-cluster” distances and maximize the “inter-cluster” distances.

#### Partitioned-based clustering

There are basically following types of partitioned based clustering and use according to the following conditions,

• If your distance is squared Euclidean distance, use k-means
• If your distance is Taxicab metric/Manhattan, use k-medians
• If you have any other distance, use k-medoids

Algorithm:

```Initialize k means with random values

For a given number of iterations:
Iterate through items:
Find the mean closest to the item
Assign item to mean
Update mean```

Let us take an example:

```import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline
```
```import pandas as pd<br>

Now, standardize the data,

```from sklearn.preprocessing import StandardScaler
X = df.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet```
```clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)<br>
k_means.fit(X)
labels = k_means.labels_```
```area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)<br>
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)```
`plt.show()`
```area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 1], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)<br>
plt.xlabel('Education', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()```
```from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')
ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))```

Thus we find three categories of the cluster:

1-AFFLUENT, EDUCATED AND OLD AGED
2-MIDDLE AGED AND MIDDLE INCOME
3-YOUNG AND LOW INCOME

#### Hierarchical-based clustering

Hierarchical based clustering is divided into two categories:

• Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
• Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In agglomerative clustering we merge clusters. But, how can we calculate the distance between clusters when there are multiple points in each cluster? We can use different criteria to find the closest clusters and merge them.

There are different ways to calculate the distances of the cluster:

• The minimum distance between clusters.
• The maximum distance between clusters.
• The average distance between clusters.
• Distance between cluster centroids.

In Hierarchical clustering, we don’t need a number of clusters to be defined so it’s easy to implement. but it is slow for large datasets and it always generates the same cluster.

#### Density-based clustering algorithms

Density-based clustering algorithms locate regions of high density that are separated from one another by regions of low density. Density-based clustering algorithms are proper for arbitrary shape clusters. It works on two parameters, 1-radius and 2-minimum points

```from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)```
```Clus_dataSet = df[['lat','long']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)```
```db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)