A Contrastive Divergence for Combining Variational Inference and MCMC We develop a method to combine Markov chain Monte Carlo (MCMC) and variational inference (VI), leveraging the advantages of both inference approaches. Specifically, we improve the variational distribution by running a few MCMC steps. To make inference tractable, we introduce the variational contrastive divergence (VCD), a new divergence that replaces the standard Kullback-Leibler (KL) divergence used in VI. The VCD captures a notion of discrepancy between the initial variational distribution and its improved version (obtained after running the MCMC steps), and it converges asymptotically to the symmetrized KL divergence between the variational distribution and the posterior of interest. The VCD objective can be optimized efficiently with respect to the variational parameters via stochastic optimization. We show experimentally that optimizing the VCD leads to better predictive performance on two latent variable models: logistic matrix factorization and variational autoencoders (VAEs). Calibrated Approximate Bayesian Inference We give a general purpose computational framework for estimating the bias in coverage resulting from making approximations in Bayesian inference. Coverage is the probability credible sets cover true parameter values. We show how to estimate the actual coverage an approximation scheme achieves when the ideal observation model and the prior can be simulated, but have been replaced, in the Monte Carlo, with approximations as they are intractable. Coverage estimation procedures given in Lee et al. (2018) work well on simple problems, but are biased, and do not scale well, as those authors note. For example, the methods of Lee et al. (2018) fail for calibration of an approximate completely collapsed MCMC algorithm for partition structure in a Dirichlet process for clustering group labels in a hierarchical model. By exploiting the symmetry of the coverage error under permutation of low level group labels and smoothing with Bayesian Additive Regression Trees, we are able to show that the original approximate inference had poor coverage and should not be trusted. Moment-Based Variational Inference for Markov Jump Processes We propose moment-based variational inference as a flexible framework for approximate smoothing of latent Markov jump processes. The main ingredient of our approach is to partition the set of all transitions of the latent process into classes. This allows to express the Kullback-Leibler divergence from the approximate to the posterior process in terms of a set of moment functions that arise naturally from the chosen partition. To illustrate possible choices of the partition, we consider special classes of jump processes that frequently occur in applications. We then extend the results to latent parameter inference and demonstrate the method on several examples. Understanding MCMC Dynamics as Flows on the Wasserstein Space It is known that the Langevin dynamics used in MCMC is the gradient flow of the KL divergence on the Wasserstein space, which helps convergence analysis and inspires recent particle-based variational inference methods (ParVIs). But no more MCMC dynamics is understood in this way. In this work, by developing novel concepts, we propose a theoretical framework that recognizes a general MCMC dynamics as the fiber-gradient Hamiltonian flow on the Wasserstein space of a fiber-Riemannian Poisson manifold. The conservation + convergence'' structure of the flow gives a clear picture on the behavior of general MCMC dynamics. We analyse existing MCMC instances under the framework. The framework also enables ParVI simulation of MCMC dynamics, which enriches the ParVI family with more efficient dynamics, and also adapts ParVI advantages to MCMCs. We develop two ParVI methods for a particular MCMC dynamics and demonstrate the benefits in experiments. LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations Due to the ease of modern data collection, practitioners often face a large collection of covariates and the need to understand their relation to some response. Generalized linear models (GLMs) offer a particularly interpretable framework for this analysis. In the high-dimensional case without an overwhelming amount of data per parameter, we expect uncertainty to be non-trivial; a Bayesian approach allows coherent quantification of this uncertainty. Unfortunately existing methods for Bayesian inference in GLMs require running times roughly cubic in parameter dimension, thus limiting their applicability in increasingly widespread settings with tens of thousands of parameters. We propose to reduce time and memory costs with a low-rank approximation of the data. We show that our method, which we call LR-GLM, still provides a full Bayesian posterior approximation and admits running time reduced by a full factor of the parameter dimension. We theoretically establish the quality of our approximation via interpretable error bounds and show how the choice of rank allows a tunable computational-statistical trade-off. Experiments support our theory and demonstrate the efficacy of LR-GLM in on real, large-scale datasets. Amortized Monte Carlo Integration Current approaches to amortizing Bayesian inference focus solely on approximating the posterior distribution. Typically, this approximation is, in turn, used to calculate expectations for one or more target functions---a computational pipeline which is inefficient when the target function(s) are known upfront. In this paper, we address this inefficiency by introducing AMCI, a method for amortizing Monte Carlo integration directly. AMCI operates similarly to amortized inference but produces three distinct amortized proposals, each tailored to a different component of the overall expectation calculation. At run-time, samples are produced separately from each amortized proposal, before being combined to an overall estimate of the expectation. We show that while existing approaches are fundamentally limited in the level of accuracy they can achieve, AMCI can theoretically produce arbitrarily small errors for any integrable target function using only a single sample from each proposal at run-time. Furthermore, AMCI allows not only for amortizing over datasets but also amortizing over target functions. Stein Point Markov Chain Monte Carlo An important task in machine learning and statistics is the approximation of a probability measure by an empirical measure supported on a discrete point set. Stein Points are a class of algorithms for this task, which proceed by sequentially minimising a Stein discrepancy between the empirical measure and the target and, hence, require the solution of a non-convex optimisation problem to obtain each new point. This paper removes the need to solve this optimisation problem by, instead, selecting each new point based on a Markov chain sample path. This significantly reduces the computational cost of Stein Points and leads to a suite of algorithms that are straightforward to implement. The new algorithms are illustrated on a set of challenging Bayesian inference problems, and rigorous theoretical guarantees of consistency are established. Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations Natural-gradient methods enable fast and simple algorithms for variational inference, but due to computational difficulties, their use is mostly limited to minimal exponential-family (EF) approximations. In this paper, we extend the application of natural-gradient methods to estimate structured approximations such as mixture of EF distribution. Such approximations can fit complex, multimodal posterior distributions and are generally more accurate than unimodal EF approximations. By using a minimal conditional-EF representation of such approximations, we derive simple natural-gradient updates. Our empirical results demonstrate a faster convergence of our natural-gradient method compared to black-box gradient-based methods. Our work expands the scope of natural gradients for Bayesian inference and makes them more widely applicable than before. Particle Flow Bayes' Rule We present a particle flow realization of Bayes' rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists and its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to online Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples. Correlated Variational Auto-Encoders Variational Auto-Encoders (VAEs) are capable of learning latent representations for high dimensional data. However, due to the i.i.d. assumption, VAEs only optimize the singleton variational distributions and fail to account for the correlations between data points, which might be crucial for learning latent representations from dataset where a priori we know correlations exist. We propose Correlated Variational Auto-Encoders (CVAEs) that can take the correlation structure into consideration when learning latent representations with VAEs. CVAEs apply a prior based on the correlation structure. To address the intractability introduced by the correlated prior, we develop an approximation by average of a set of tractable lower bounds over all maximal acyclic subgraphs of the undirected correlation graph. Experimental results on matching and link prediction on public benchmark rating datasets and spectral clustering on a synthetic dataset show the effectiveness of the proposed method over baseline algorithms.