It stands for Density spatial clustering of application with noise. The main benefit of DBSCAN that it does not require
the user to set the number of clusters a priori and it can capture
clusters of complex shapes. It can identify a point that is not part of any cluster.
DBSCAN is somewhat slower than agglomerative clustering and k-Means, but
still
scales to relatively large datasets. The way DBSCAN works is by identifying points that are in “crowded”
regions of the feature space, where many data points are close together. These regions
are referred to as dense regions in feature space. The idea behind DBSCAN is
that clusters form dense regions of data, separated by regions that are relatively empty.
Points that are within a dense region are called core samples, and they
are defined as
follows. There are two parameters in DBSCAN, min_samples, and eps. If there
are at least
min_samples many data
points within a distance of eps to a given data point, it’s
called a core sample. Core samples that are closer than the distance eps are put into
the same cluster by DBSCAN. The algorithm works by picking a point to start with.
It then finds all points with distance eps or less. If there are less than min_samples
points within distance eps or less, this point is labeled as noise, meaning
that this
the point doesn’t belong to any cluster. If there are more than min_samples
points within a distance of eps, the point
is labeled a core sample and assigned a new cluster label. Then, all
neighbors (within eps) of the point are visited. If they
have not been assigned a cluster yet, they are assigned the new cluster label we just created. If they are core
samples, their neighbors are visited in turn, and so on. The cluster grows until there are no more core-samples within distance eps of the cluster. Then another point, which hasn’t yet been visited, is picked, and the
same procedure is repeated.
In the end, there are three kinds of points: core points, points that
are within distance
eps of core points (called boundary
points), and noise. When running the DBSCAN
algorithm on a particular dataset multiple times, the clustering of the
core points is
always the same, and the same points will always be labeled as noise.
However, a
boundary point might be neighbor to core samples of more than one
cluster. Therefore,
the cluster membership of boundary points depends on the order in which
points are visited. Usually, there are only a few boundary points, and this
slight dependence
on the order of points is not important.
No comments:
Post a Comment
If you have any doubt, let me know