Kernel density estimation
In statistics, kernel density estimation (KDE) is a nonparametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.^{[1]}^{[2]}
Contents
Definition
Let (x_{1}, x_{2}, …, x_{n}) be an independent and identically distributed sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is
 \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x  x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{xx_i}{h}\Big),
where K(•) is the kernel — a nonnegative function that integrates to one and has mean zero — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined as K_{h}(x) = 1/h K(x/h). Intuitively one wants to choose h as small as the data allow, however there is always a tradeoff between the bias of the estimator and its variance; more on the choice of bandwidth below.
A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense,^{[3]} though the loss of efficiency is small for the kernels listed previously,^{[4]} and due to its convenient mathematical properties, the normal kernel is often used K(x) = ϕ(x), where ϕ is the standard normal density function.
The construction of a kernel density estimate finds interpretations in fields outside of density estimation. For example, in thermodynamics, this is equivalent to the amount of heat generated when heat kernels (the fundamental solution to the heat equation) are placed at each data point locations x_{i}. Similar methods are used to construct discrete Laplace operators on point clouds for manifold learning.
Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points: x_{1} = −2.1, x_{2} = −1.3, x_{3} = −0.4, x_{4} = 1.9, x_{5} = 5.1, x_{6} = 6.2. For the histogram, first the horizontal axis is divided into subintervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.
For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points x_{i}. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.^{[5]}
Bandwidth selection
The bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated random sample from the standard normal distribution (plotted at the blue spikes in the rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth h = 0.05, which is too small. The green curve is oversmoothed since using the bandwidth h = 2 obscures much of the underlying structure. The black curve with a bandwidth of h = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density.
The most common optimality criterion used to select this parameter is the expected L_{2} risk function, also termed the mean integrated squared error:
 \operatorname{MISE} (h) = E \int (\hat{f}_h(x)  f(x))^2 \, dx.
Under weak assumptions on ƒ and K,^{[1]}^{[2]} MISE (h) = AMISE(h) + o(1/(nh) + h^{4}) where o is the little o notation. The AMISE is the Asymptotic MISE which consists of the two leading terms
 \operatorname{AMISE}(h) = \frac{R(K)}{nh} + \frac{1}{4} m_2(K)^2 h^4 R(f'')
where R(g) = \int g(x)^2 \, dx for a function g, m_2(K) = \int x^2 K(x) \, dx and ƒ'' is the second derivative of ƒ. The minimum of this AMISE is the solution to this differential equation
 \frac{\partial}{\partial h} \operatorname{AMISE}(h) = \frac{R(K)}{nh^2} + m_2(K)^2 h^3 R(f'') = 0
or
 h_{\operatorname{AMISE}} = \frac{ R(K)^{1/5}}{m_2(K)^{2/5}R(f'')^{1/5} n^{1/5}}.
Neither the AMISE nor the h_{AMISE} formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', so a variety of automatic, databased methods have been developed for selecting the bandwidth. Many review studies have been carried out to compare their efficacities,^{[6]}^{[7]}^{[8]}^{[9]}^{[10]} with the general consensus that the plugin selectors^{[11]} and cross validation selectors^{[12]}^{[13]}^{[14]} are the most useful over a wide range of data sets.
Substituting any bandwidth h which has the same asymptotic order n^{−1/5} as h_{AMISE} into the AMISE gives that AMISE(h) = O(n^{−4/5}), where O is the big o notation. It can be shown that, under weak assumptions, there cannot exist a nonparametric estimator that converges at a faster rate than the kernel estimator.^{[15]} Note that the n^{−4/5} rate is slower than the typical n^{−1} convergence rate of parametric methods.
If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (pointwise estimator), this produces a particularly powerful method termed adaptive or variable bandwidth kernel density estimation.
Practical estimation of the bandwidth
If Gaussian basis functions are used to approximate univariate data, and the underlying density being estimated is Gaussian then it can be shown that the optimal choice for h is^{[16]}
 h = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{\frac{1}{5}} \approx 1.06 \hat{\sigma} n^{1/5},
where \hat{\sigma} is the standard deviation of the samples. This approximation is termed the normal distribution approximation, Gaussian approximation, or Silverman's rule of thumb.
Relation to the characteristic function density estimator
Given the sample (x_{1}, x_{2}, …, x_{n}), it is natural to estimate the characteristic function φ(t) = E[e^{itX}] as
 \hat\varphi(t) = \frac{1}{n} \sum_{j=1}^n e^{itx_j}
Knowing the characteristic function, it is possible to find the corresponding probability density function through the Fourier transform formula. One difficulty with applying this inversion formula is that it leads to a diverging integral, since the estimate \scriptstyle\hat\varphi(t) is unreliable for large t’s. To circumvent this problem, the estimator \scriptstyle\hat\varphi(t) is multiplied by a damping function ψ_{h}(t) = ψ(ht), which is equal to 1 at the origin and then falls to 0 at infinity. The “bandwidth parameter” h controls how fast we try to dampen the function \scriptstyle\hat\varphi(t). In particular when h is small, then ψ_{h}(t) will be approximately one for a large range of t’s, which means that \scriptstyle\hat\varphi(t) remains practically unaltered in the most important region of t’s.
The most common choice for function ψ is either the uniform function ψ(t) = 1{−1 ≤ t ≤ 1}, which effectively means truncating the interval of integration in the inversion formula to [−1/h, 1/h], or the gaussian function ψ(t) = e^{−π t2}. Once the function ψ has been chosen, the inversion formula may be applied, and the density estimator will be
 \begin{align} \hat{f}(x) &= \frac{1}{2\pi} \int_{\infty}^{+\infty} \hat\varphi(t)\psi_h(t) e^{itx}dt = \frac{1}{2\pi} \int_{\infty}^{+\infty} \frac{1}{n} \sum_{j=1}^n e^{it(x_jx)} \psi(ht) dt \\ &= \frac{1}{nh} \sum_{j=1}^n \frac{1}{2\pi} \int_{\infty}^{+\infty} e^{i(ht)\frac{xx_j}{h}} \psi(ht) d(ht) = \frac{1}{nh} \sum_{j=1}^n K\Big(\frac{xx_j}{h}\Big), \end{align}
where K is the Fourier transform of the damping function ψ. Thus the kernel density estimator coincides with the characteristic function density estimator.
Statistical implementation
A nonexhaustive list of software implementations of kernel density estimators includes:

In Analytica release 4.4, the Smoothing option for PDF results uses KDE, and from expressions it is available via the builtin
Pdf
function.  In C/C++, FIGTree is a library that can be used to compute kernel density estimates using normal kernels. MATLAB interface available.
 In C++, libagf is a library for variable kernel density estimation.
 In CrimeStat, kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single and dualkernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a twodimensional Journeytocrime density function, and in estimating a threedimensional Bayesian Journeytocrime estimate.

In ELKI, kernel density functions can be found in the package
de.lmu.ifi.dbs.elki.math.statistics.kernelfunctions
 In ESRI products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Quartic(biweight) kernel.
 In Excel, the Royal Society of Chemistry has created an addin to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4.

In gnuplot, kernel density estimation is implemented by the
smooth kdensity
option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically^{[17]} according to "Silverman's rule of thumb" (see above).  In Haskell, kernel density is implemented in the statistics package.
 In Java, the Weka (machine learning) package provides weka.estimators.KernelEstimator, among others.
 In JavaScript, the visualization package D3.js offers a KDE package in its science.stats package.
 In JMP, The Fit Y by X platform can be used to estimate univariate and bivariate kernel densitities.
 In Julia, kernel density estimation is implemented in the KernelDensity.jl package.

In MATLAB, kernel density estimation is implemented through the
ksdensity
function (Statistics Toolbox). This function does not provide an automatic datadriven bandwidth but uses a rule of thumb, which is optimal only when the target density is normal. A free MATLAB software package which implements an automatic bandwidth selection method^{[18]} is available from the MATLAB Central File Exchange for 1dimensional data and for 2dimensional data. 
In Mathematica, numeric kernel density estimation is implemented by the function
SmoothKernelDistribution
here and symbolic estimation is implemented using the functionKernelMixtureDistribution
here both of which provide datadriven bandwidths.  In Minitab, the Royal Society of Chemistry has created a macro to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4.

In the NAG Library, kernel density estimation is implemented via the
g10ba
routine (available in both the Fortran^{[19]} and the C^{[20]} versions of the Library). 
In Octave, kernel density estimation is implemented by the
kernel_density
option (econometrics package).  In Origin, 2D kernel density plot can be made from its user interface, and two functions, Ksdensity for 1D and Ks2density for 2D can be used from its LabTalk, Python, or C code.
 In Perl, an implementation can be found in the StatisticsKernelEstimation module

In Python, many implementations exist: SciPy (
scipy.stats.gaussian_kde
), Statsmodels (KDEUnivariate
andKDEMultivariate
), and Scikitlearn (KernelDensity
) (see comparison^{[21]}).

In R, it is implemented through the
density
and thebkde
function in the KernSmooth library (both included in the base distribution), thekde
function in the ks library, thedkden
anddbckden
functions in the evmix library (latter for boundary corrected kernel density estimation for bounded support), thenpudens
function in the np library (numeric and categorical data), thesm.density
function in the sm library. For an implementation of thekde.R
function, which does not require installing any packages or libraries, see kde.R. 
In SAS,
proc kde
can be used to estimate univariate and bivariate kernel densities. 
In Stata, it is implemented through
kdensity
; for examplehistogram x, kdensity
. Alternatively a free Stata module KDENS is available from here allowing a user to estimate 1D or 2D density functions.
Examples
Example in MATLABOctave
For this example, the data are a synthetic sample of 50 points drawn from the standard normal and 50 points from a normal distribution with mean 3.5 and variance 1. The automatic bandwidth selection and density estimation with normal kernels is carried out by kde.m. This function implements an automatic bandwidth selector that does not rely on the commonly used Gaussian plugin rule of thumb heuristic.^{[18]}
randn('seed',8192); x = [randn(50,1); randn(50,1)+3.5]; [h, fhat, xgrid] = kde(x, 401); figure; hold on; plot(xgrid, fhat, 'linewidth', 2, 'color', 'black'); plot(x, zeros(100,1), 'b+'); xlabel('x') ylabel('Density function') hold off;
Example in R
This example is based on the Old Faithful Geyser, a tourist attraction located in Yellowstone National Park. This famous dataset containing 272 records consists of two variables, eruption duration, and waiting time until next eruption, both in minutes, included in the base distribution of R. We analyse the waiting times, using the ks library since it has a wide range of visualisation options. The bandwidth function is hpi
which in turn calls the dpik
function in the KernSmooth
library: these functions implement the plugin selector.^{[11]} The kernel density estimate using the normal kernel is computed using kde
which calls bkde
from KernSmooth
. The plot
function allows the addition of the data points as a rug plot on the horizontal axis. The bimodal structure in the density estimate of the waiting times is clearly seen, in contrast to the rug plot where this structure is not apparent.
library(KernSmooth) attach(faithful) fhat < bkde(x=waiting) plot (fhat, xlab="x", ylab="Density function")
Example in Python
To demonstrate how kernel density estimation is performed in Python, we simulate some data from a mixture of normals, where 50 observations are generated from a normal distribution with mean zero and standard deviation 3 and another 50 from a normal with mean 4 and standard deviation 1.
import numpy as np x1 = np.random.normal(0, 3, 50) x2 = np.random.normal(4, 1, 50) x = np.r_[x1, x2]
The gaussian_kde
function from the SciPy package implements a kerneldensity estimate using Gaussian kernels, and includes automatic determination of bandwidth. By default, gaussian_kde
uses Scott's rule to select the appropriate bandwidth.^{[22]}
from scipy.stats import kde import matplotlib.pyplot as plt density = kde.gaussian_kde(x) xgrid = np.linspace(x.min(), x.max(), 100) plt.hist(x, bins=8, normed=True) plt.plot(xgrid, density(xgrid), 'r') plt.show()
The plot shows both a histogram of the simulated data, along with a red line that shows the Gaussian KDE.
See also
 Kernel (statistics)
 Kernel smoothing
 Kernel regression
 Density estimation (with presentation of other examples)
 Meanshift
 Scale space The triplets {(x, h, KDE with bandwidth h evaluated at x: all x, h > 0} form a scale space representation of the data.
 Multivariate kernel density estimation
 Variable kernel density estimation
References
 ^ ^{a} ^{b}
 ^ ^{}a ^{b}
 ^ Epanechnikov, V.A. (1969). "Nonparametric estimation of a multivariate probability density". Theory of Probability and its Applications 14: 153–158.
 ^ Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall/CRC.
 ^ Scott, D. (1979). "On optimal and databased histograms". Biometrika 66 (3): 605–610.
 ^ Park, B.U.; Marron, J.S. (1990). "Comparison of datadriven bandwidth selectors". Journal of the American Statistical Association 85 (409): 66–72.
 ^ Park, B.U.; Turlach, B.A. (1992). "Practical performance of several data driven bandwidth selectors (with discussion)". Computational Statistics 7: 251–270.
 ^ Cao, R.; Cuevas, A.; Manteiga, W. G. (1994). "A comparative study of several smoothing methods in density estimation". Computational Statistics and Data Analysis 17 (2): 153–176.
 ^ Jones, M.C.; Marron, J.S.; Sheather, S. J. (1996). "A brief survey of bandwidth selection for density estimation". Journal of the American Statistical Association 91 (433): 401–407.
 ^ Sheather, S.J. (1992). "The performance of six popular bandwidth selection methods on some real data sets (with discussion)". Computational Statistics 7: 225–250, 271–281.
 ^ ^{a} ^{b} Sheather, S.J.; Jones, M.C. (1991). "A reliable databased bandwidth selection method for kernel density estimation". Journal of the Royal Statistical Society, Series B 53 (3): 683–690.
 ^ Rudemo, M. (1982). "Empirical choice of histograms and kernel density estimators". Scandinavian Journal of Statistics 9 (2): 65–78.
 ^ Bowman, A.W. (1984). "An alternative method of crossvalidation for the smoothing of density estimates". Biometrika 71 (2): 353–360.
 ^ Hall, P.; Marron, J.S.; Park, B.U. (1992). "Smoothed crossvalidation". Probability Theory and Related Fields 92: 1–20.
 ^ Wahba, G. (1975). "Optimal convergence properties of variable knot, kernel, and orthogonal series methods for density estimation".
 ^
 ^ Janert, Philipp K (2009). Gnuplot in action : understanding data with graphs. Connecticut, USA: Manning Publications. See section 13.2.2 entitled Kernel density estimates.
 ^ ^{}a ^{b} Botev, Z.I.; Grotowski, J.F.; Kroese, D.P. (2010). "Kernel density estimation via diffusion".
 ^ The Numerical Algorithms Group. "NAG Library Routine Document: nagf_smooth_kerndens_gauss (g10baf)". NAG Library Manual, Mark 23. Retrieved 20120216.
 ^ The Numerical Algorithms Group. "NAG Library Routine Document: nag_kernel_density_estim (g10bac)". NAG Library Manual, Mark 9. Retrieved 20120216.
 ^ Vanderplas, Jake (20131201). "Kernel Density Estimation in Python". Retrieved 20140312.
 ^ scipy.stats.gaussian_kde, SciPy.org
External links
 Introduction to kernel density estimation A short tutorial which motivates kernel density estimators as an improvement over histograms.
 Kernel Bandwidth Optimization A free online tool that instantly generates an optimized kernel density estimate of your data.
 Free Online Software (Calculator) computes the Kernel Density Estimation for any data series according to the following Kernels: Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine.
 Kernel Density Estimation Applet An online interactive example of kernel density estimation. Requires .NET 3.0 or later.