Conditional Density Estimation

Conditional density estimation is similar to kernel density estimation. In kernel density estimation we are given some data \(X\) and we would like to estimate the density of that data with a function \(f(X)\). In conditional density estimation we are also given some data \(X\), but we would like to estimate the density of some unknown \(y\) given x with a function \(f(y|X)\). This project implements conditional density estimation using neural networks as described by Ambrogioni et al. [AGGM17]

The density estimate in this model is given by:

\[p(y|X) = \sum_{i}^{K} \phi_{i} \mathcal{N}(y|\mu_{i}, \sigma^{2}, X)\]

Where \(K\) is the number of Normal distributions in the mixture. \(\phi_{i}\) is the weight for the \(i\) th distribution. \(\mu_{i}\) is the mean value for the \(i\) th distribution. And \(\sigma^{2}\) is the variance for the \(i\) th distribution

\(K\) and \(\mu\) are hyperparameters of the model and must be given in advance. The project provides two methods for generating the kernel centers (\(\mu\)) given some value of \(K\) using the function make_centers(). The first methods splits the target region into uniform intervals and uses the split points as the mean values (i.e., kernel centers). The alternate (and default) method uses the Jenks algorithm to find a more optimal partition of the data.

The weights and variances will be learned by the network using the log loss:

\[\mathcal{L}(y|X) = -\log(p(y|X))\]

This is implemented in the loss() function.

Network Architecture

There are many ways a network could be constructed given this formulation of the problem, but any architecture must have the following properties. First, it must take an arbitrary vector of inputs (as any network must). Second, the output of the network must have exactly \(2 \cdot K\) elements. The first \(K\) represent the weights for each kernel and the second \(K\) represent the variance for each kernel.

_images/kmn_with_bandwidths.svg

Fig. 1 Suggested model architecture.

For reasons explained in [NiWe95], I suggest an architecture similar to the one given Fig. 1. The input vector is shown at the bottom in yellow. The hidden layers are then split into two paths. On the left, in orange, are the layers for learning the kernel weights. The network can be arbitrarily deep, but the final layer must use a softmax activation to ensure that the weights sum to 1.

The right side of the network, in blue, are the layers for learning the variance values. The first hidden layer uses both the input data and the first hidden layer from the weights side. This provides some additional flexibility in learning the variances. This side can also be arbitrarily deep, but the final layer must use an activation that guarantees values greater than 0. There are several possibilities, but a softplus activation seems to work well in practice.

The output of these two sides are then concatenated into a final output layer that can be input to the loss() function.

For a complete example continue to the demo application.

Alternative Implementations

An alternate implementation is described in this blog post. This implementation seemed overly complex for my purposes and relies on the Bayesian neural network toolkit Edward, which led me develop this implementation using standard TensorFlow and Keras functionality.

References

AGGM17

Luca Ambrogioni, Umut Güçlü, Marcel A. J. van Gerven, Eric Maris. 2017. The Kernel Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous Random Variables. arXiv:1705.07111. Retrieved from https://arxiv.org/abs/1705.07111

NiWe95

David A. Nix and Andreas S. Weigend. 1994. Learning Local Error Bars for Nonlinear Regression. In pages 489–496. January.