1 Introduction

K-means clustering is a form of unsupervised machine learning. It can be used to separate unlabeled data into groups. A marketing team could use K-means on customer data to perform customer segmentation. However, in this example I use K-means clustering to perform image compression: I group pixels in an image by their similarity in color in order to reduce the total number of colors within that image.

1.1 How it Works

  1. K-means clustering begins by randomly assigning n datapoints to be cluster centroids (where n is the desired number of clusters or groups to separate).
  2. Each datapoint is assigned to the cluster with the closest cluster centroid.
  3. The average value of each cluster is calculated and becomes the cluster center.
  4. Repeat steps 2 and 3 until convergence.

You can download my code at github to follow along. This guide is written with Python in mind, but I also provide a brief R walkthrough in the final section of this post.

2 Python K-means Walkthrough

2.1 Setup

I will be performing image compression on this image:

First, I load all relevant packages:

from skimage import io
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as image

2.2 Load Image

img = io.imread('bird_small.png')
img_r = (img / 255.0).reshape(-1,3)

I reshape the image into a 16384x3 array so that each row represents a pixel and the three columns represent the Red, Green, and Blue values.

2.3 Run K-means

#Fit K-means on resized image. n_clusters is the desired number of colors 
k_colors = KMeans(n_clusters=128).fit(img_r)
#Assign colors to pixels based on their cluster center
#Each row in k_colors.cluster_centers_ represents the RGB value of a cluster centroid
#k_colors.labels_ contains the cluster that a pixel is assigned to
#The following assigns every pixel the color of the centroid it is assigned to
img128=k_colors.cluster_centers_[k_colors.labels_]
#Reshape the image back to 128x128x3 to save
img128=np.reshape(img128, (img.shape))
#Save image
image.imsave('img128.png',img128)

2.4 Results

The following is a shiny web app I created in R to visualize how the image changes as the number of clusters fed to the K-means algorithm changes:

3 R Code

The following code performs the same K-means image compression in R. It only differs from the Python code in syntax:

library(png)
library(reshape)
library(stats)

img <- readPNG("bird_small.png")
origdim <- dim(img)

dim(img) <- c(dim(img)[1]*dim(img)[2],3)

k_colors = kmeans(img,centers=128,iter.max=100)
img128 <- k_colors$centers[k_colors$cluster,]
dim(img128) <- origdim
writePNG(img128,"img128.png")