In statistics and data mining, kmedians clustering^{[1]}^{[2]} is a cluster analysis algorithm. It is a variation of kmeans clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 1norm distance metric, as opposed to the square of the 2norm distance metric (which kmeans does.)
This relates directly to the kmedian problem which is the problem of finding k centers such that the clusters formed by them are the most compact. Formally, given a set of data points x, the k centers c_{i} are to be chosen so as to minimize the sum of the distances from each x to the nearest c_{i}.
The criterion function formulated in this way is sometimes a better criterion than that used in the kmeans clustering algorithm, in which the sum of the squared distances is used. The sum of distances is widely used in applications such as facility location.
The proposed algorithm uses Lloydstyle iteration which alternates between an expectation (E) and maximization (M) step, making this an Expectation–maximization algorithm. In the E step, all objects are assigned to their nearest median. In the M step, the medians are recomputed by using the median in each single dimension.
Medians and medoids
The median is computed in each single dimension in the Manhattandistance formulation of the kmedians problem, so the individual attributes will come from the dataset. This makes the algorithm more reliable for discrete or even binary data sets. In contrast, the use of means or Euclideandistance medians will not necessarily yield individual attributes from the dataset. Even with the Manhattandistance formulation, the individual attributes may come from different instances in the dataset; thus, the resulting median may not be a member of the input dataset.
This algorithm is often confused with the kmedoids algorithm. However, a medoid has to be an actual instance from the dataset, while for the multivariate Manhattandistance median this only holds for single attribute values. The actual median can thus be a combination of multiple instances. For example, given the vectors (0,1), (1,0) and (2,2), the Manhattandistance median is (1,1), which does not exist in the original data, and thus cannot be a medoid.
Software

ELKI includes various kmeans variants, including kmedians.

FORTRAN kmedians

GNU R includes kmedians in the "flexclust" package.

Stata kmedians
See also
References

^ A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. PrenticeHall, 1988.

^ P. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, MA: MIT Press, 1997, pp. 368–374.
This article was sourced from Creative Commons AttributionShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, EGovernment Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a nonprofit organization.