Pages

Density Based Clustering : DBSCAN


It stands for Density spatial clustering of application with noise. The main benefit of DBSCAN that it does not require the user to set the number of clusters a priori and it can capture clusters of complex shapes. It can identify a point that is not part of any cluster.
DBSCAN is somewhat slower than agglomerative clustering and k-Means, but still
scales to relatively large datasets. The way DBSCAN works is by identifying points that are in “crowded” regions of the feature space, where many data points are close together. These regions are referred to as dense regions in feature space. The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty.

Points that are within a dense region are called core samples, and they are defined as
follows. There are two parameters in DBSCAN, min_samples, and eps. If there are at least
min_samples many data points within a distance of eps to a given data point, it’s
called a core sample. Core samples that are closer than the distance eps are put into
the same cluster by DBSCAN. The algorithm works by picking a point to start with.
It then finds all points with distance eps or less. If there are less than min_samples
points within distance eps or less, this point is labeled as noise, meaning that this
the point doesn’t belong to any cluster. If there are more than min_samples points within a distance of eps, the point is labeled a core sample and assigned a new cluster label. Then, all neighbors (within eps) of the point are visited. If they have not been assigned a cluster yet, they are assigned the new cluster label we just created. If they are core samples, their neighbors are visited in turn, and so on. The cluster grows until there are no more core-samples within distance eps of the cluster. Then another point, which hasn’t yet been visited, is picked, and the same procedure is repeated.

In the end, there are three kinds of points: core points, points that are within distance
eps of core points (called boundary points), and noise. When running the DBSCAN
algorithm on a particular dataset multiple times, the clustering of the core points is
always the same, and the same points will always be labeled as noise. However, a
boundary point might be neighbor to core samples of more than one cluster. Therefore,
the cluster membership of boundary points depends on the order in which
points are visited. Usually, there are only a few boundary points, and this slight dependence
on the order of points is not important.

No comments:

Post a Comment

If you have any doubt, let me know

Email Subscription

Enter your email address:

Delivered by FeedBurner

INSTAGRAM FEED