deephyper.skopt.space.gaussian_kde#
- class deephyper.skopt.space.gaussian_kde(dataset, bw_method=None, weights=None)[source]#
Bases:
object
Representation of a kernel-density estimate using Gaussian kernels.
Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
- Parameters:
dataset (array_like) – Datapoints to estimate from. In case of univariate data this is a 1-D array, otherwise a 2-D array with shape (# of dims, # of data).
bw_method (str, scalar or callable, optional) – The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If None (default), ‘scott’ is used. See Notes for more details.
weights (array_like, optional) – weights of datapoints. This must be the same shape as dataset. If None (default), the samples are assumed to be equally weighted
- dataset#
The dataset with which gaussian_kde was initialized.
- Type:
ndarray
- factor#
The bandwidth factor, obtained from kde.covariance_factor. The square of kde.factor multiplies the covariance matrix of the data in the kde estimation.
- Type:
- covariance#
The covariance matrix of dataset, scaled by the calculated bandwidth (kde.factor).
- Type:
ndarray
- inv_cov#
The inverse of covariance.
- Type:
ndarray
- __call__()#
- covariance_factor()#
Notes
Bandwidth selection strongly influences the estimate obtained from the KDE (much more so than the actual shape of the kernel). Bandwidth selection can be done by a “rule of thumb”, by cross-validation, by “plug-in methods” or by other means; see [3], [4] for reviews. gaussian_kde uses a rule of thumb, the default is Scott’s Rule.
Scott’s Rule [1], implemented as scotts_factor, is:
n**(-1./(d+4)),
with
n
the number of data points andd
the number of dimensions. In the case of unequally weighted points, scotts_factor becomes:neff**(-1./(d+4)),
with
neff
the effective number of datapoints. Silverman’s Rule [2], implemented as silverman_factor, is:(n * (d + 2) / 4.)**(-1. / (d + 4)).
or in the case of unequally weighted points:
(neff * (d + 2) / 4.)**(-1. / (d + 4)).
Good general descriptions of kernel density estimation can be found in [1] and [2], the mathematics for this multi-dimensional implementation can be found in [1].
With a set of weighted samples, the effective number of datapoints
neff
is defined by:neff = sum(weights)^2 / sum(weights^2)
as detailed in [5].
gaussian_kde does not currently support data that lies in a lower-dimensional subspace of the space in which it is expressed. For such data, consider performing principle component analysis / dimensionality reduction and using gaussian_kde with the transformed data.
References
Examples
Generate some random two-dimensional data:
>>> import numpy as np >>> from scipy import stats >>> def measure(n): ... "Measurement model, return two coupled measurements." ... m1 = np.random.normal(size=n) ... m2 = np.random.normal(scale=0.5, size=n) ... return m1+m2, m1-m2
>>> m1, m2 = measure(2000) >>> xmin = m1.min() >>> xmax = m1.max() >>> ymin = m2.min() >>> ymax = m2.max()
Perform a kernel density estimate on the data:
>>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] >>> positions = np.vstack([X.ravel(), Y.ravel()]) >>> values = np.vstack([m1, m2]) >>> kernel = stats.gaussian_kde(values) >>> Z = np.reshape(kernel(positions).T, X.shape)
Plot the results:
>>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots() >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r, ... extent=[xmin, xmax, ymin, ymax]) >>> ax.plot(m1, m2, 'k.', markersize=2) >>> ax.set_xlim([xmin, xmax]) >>> ax.set_ylim([ymin, ymax]) >>> plt.show()
Methods
Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix.
Evaluate the estimated pdf on a set of points.
Computes the integral of a pdf over a rectangular interval.
Computes the integral of a 1D pdf between two bounds.
Multiply estimated density by a multivariate Gaussian and integrate over the whole space.
Computes the integral of the product of this kernel density estimate with another.
Evaluate the log of the estimated pdf on a provided set of points.
Return a marginal KDE distribution
Evaluate the estimated pdf on a provided set of points.
Randomly sample a dataset from the estimated pdf.
Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix.
Compute the estimator bandwidth with given method.
Compute the Silverman factor.
Attributes
- __call__(points)#
Evaluate the estimated pdf on a set of points.
- Parameters:
points ((# of dimensions, # of points)-array) – Alternatively, a (# of dimensions,) vector can be passed in and treated as a single point.
- Returns:
values – The values at each point.
- Return type:
(# of points,)-array
:raises ValueError : if the dimensionality of the input points is different than: the dimensionality of the KDE.
- covariance_factor()#
Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix. The default is scotts_factor. A subclass can overwrite this method to provide a different method, or set it through a call to kde.set_bandwidth.
- evaluate(points)[source]#
Evaluate the estimated pdf on a set of points.
- Parameters:
points ((# of dimensions, # of points)-array) – Alternatively, a (# of dimensions,) vector can be passed in and treated as a single point.
- Returns:
values – The values at each point.
- Return type:
(# of points,)-array
:raises ValueError : if the dimensionality of the input points is different than: the dimensionality of the KDE.
- integrate_box(low_bounds, high_bounds, maxpts=None)[source]#
Computes the integral of a pdf over a rectangular interval.
- Parameters:
low_bounds (array_like) – A 1-D array containing the lower bounds of integration.
high_bounds (array_like) – A 1-D array containing the upper bounds of integration.
maxpts (int, optional) – The maximum number of points to use for integration.
- Returns:
value – The result of the integral.
- Return type:
scalar
- integrate_box_1d(low, high)[source]#
Computes the integral of a 1D pdf between two bounds.
- Parameters:
low (scalar) – Lower bound of integration.
high (scalar) – Upper bound of integration.
- Returns:
value – The result of the integral.
- Return type:
scalar
- Raises:
ValueError – If the KDE is over more than one dimension.
- integrate_gaussian(mean, cov)[source]#
Multiply estimated density by a multivariate Gaussian and integrate over the whole space.
- Parameters:
mean (aray_like) – A 1-D array, specifying the mean of the Gaussian.
cov (array_like) – A 2-D array, specifying the covariance matrix of the Gaussian.
- Returns:
result – The value of the integral.
- Return type:
scalar
- Raises:
ValueError – If the mean or covariance of the input Gaussian differs from the KDE’s dimensionality.
- integrate_kde(other)[source]#
Computes the integral of the product of this kernel density estimate with another.
- Parameters:
other (gaussian_kde instance) – The other kde.
- Returns:
value – The result of the integral.
- Return type:
scalar
- Raises:
ValueError – If the KDEs have different dimensionality.
- marginal(dimensions)[source]#
Return a marginal KDE distribution
- Parameters:
dimensions (int or 1-d array_like) – The dimensions of the multivariate distribution corresponding with the marginal variables, that is, the indices of the dimensions that are being retained. The other dimensions are marginalized out.
- Returns:
marginal_kde – An object representing the marginal distribution.
- Return type:
Notes
Added in version 1.10.0.
- pdf(x)[source]#
Evaluate the estimated pdf on a provided set of points.
Notes
This is an alias for gaussian_kde.evaluate. See the
evaluate
docstring for more details.
- resample(size=None, seed=None)[source]#
Randomly sample a dataset from the estimated pdf.
- Parameters:
size (int, optional) – The number of samples to draw. If not provided, then the size is the same as the effective number of samples in the underlying dataset.
seed ({None, int, numpy.random.Generator, numpy.random.RandomState}, optional) – If seed is None (or np.random), the numpy.random.RandomState singleton is used. If seed is an int, a new
RandomState
instance is used, seeded with seed. If seed is already aGenerator
orRandomState
instance then that instance is used.
- Returns:
resample – The sampled dataset.
- Return type:
(self.d, size) ndarray
- scotts_factor()[source]#
Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix. The default is scotts_factor. A subclass can overwrite this method to provide a different method, or set it through a call to kde.set_bandwidth.
- set_bandwidth(bw_method=None)[source]#
Compute the estimator bandwidth with given method.
The new bandwidth calculated after a call to set_bandwidth is used for subsequent evaluations of the estimated density.
- Parameters:
bw_method (str, scalar or callable, optional) – The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If None (default), nothing happens; the current kde.covariance_factor method is kept.
Notes
Added in version 0.11.
Examples
>>> import numpy as np >>> import scipy.stats as stats >>> x1 = np.array([-7, -5, 1, 4, 5.]) >>> kde = stats.gaussian_kde(x1) >>> xs = np.linspace(-10, 10, num=50) >>> y1 = kde(xs) >>> kde.set_bandwidth(bw_method='silverman') >>> y2 = kde(xs) >>> kde.set_bandwidth(bw_method=kde.factor / 3.) >>> y3 = kde(xs)
>>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots() >>> ax.plot(x1, np.full(x1.shape, 1 / (4. * x1.size)), 'bo', ... label='Data points (rescaled)') >>> ax.plot(xs, y1, label='Scott (default)') >>> ax.plot(xs, y2, label='Silverman') >>> ax.plot(xs, y3, label='Const (1/3 * Silverman)') >>> ax.legend() >>> plt.show()