
Counterfactual Mean Embedding: A Kernel Method for Nonparametric Causal Inference
Authors:
Krikamol Muandet,
Motonobu Kanagawa,
Sorawit Saengkyongam,
Sanparith Marukatat
Abstract:
This paper introduces a novel Hilbert space representation of a counterfactual distributioncalled counterfactual mean embedding (CME)with applications in nonparametric causal inference. Counterfactual prediction has become an ubiquitous tool in machine learning applications, such as online advertisement, recommendation systems, and medical diagnosis, whose performance relies on certain inter…
▽ More
This paper introduces a novel Hilbert space representation of a counterfactual distributioncalled counterfactual mean embedding (CME)with applications in nonparametric causal inference. Counterfactual prediction has become an ubiquitous tool in machine learning applications, such as online advertisement, recommendation systems, and medical diagnosis, whose performance relies on certain interventions. To infer the outcomes of such interventions, we propose to embed the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel. Under appropriate assumptions, the CME allows us to perform causal inference over the entire landscape of the counterfactual distribution. The CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also derive a rate of convergence which depends on the smoothness of the conditional mean and the RadonNikodym derivative of the underlying marginal distributions. Our framework can deal with not only realvalued outcome, but potentially also more complex and structured outcomes such as images, sequences, and graphs. Lastly, our experimental results on offpolicy evaluation tasks demonstrate the advantages of the proposed estimator.
△ Less
Submitted 22 May, 2018; originally announced May 2018.

Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces
Authors:
Stefan Klus,
Ingmar Schuster,
Krikamol Muandet
Abstract:
Transfer operators such as the PerronFrobenius or Koopman operator play an important role in the global analysis of complex dynamical systems. The eigenfunctions of these operators can be used to detect metastable sets, to project the dynamics onto the dominant slow processes, or to separate superimposed signals. We extend transfer operator theory to reproducing kernel Hilbert spaces and show th…
▽ More
Transfer operators such as the PerronFrobenius or Koopman operator play an important role in the global analysis of complex dynamical systems. The eigenfunctions of these operators can be used to detect metastable sets, to project the dynamics onto the dominant slow processes, or to separate superimposed signals. We extend transfer operator theory to reproducing kernel Hilbert spaces and show that these operators are related to Hilbert space representations of conditional distributions, known as conditional mean embeddings in the machine learning community. Moreover, numerical methods to compute empirical estimates of these embeddings are akin to datadriven methods for the approximation of transfer operators such as extended dynamic mode decomposition and its variants. One main benefit of the presented kernelbased approaches is that these methods can be applied to any domain where a similarity measure given by a kernel is available. We illustrate the results with the aid of guiding examples and highlight potential applications in molecular dynamics as well as video and text data analysis.
△ Less
Submitted 16 May, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

Design and Analysis of the NIPS 2016 Review Process
Authors:
Nihar B. Shah,
Behzad Tabibian,
Krikamol Muandet,
Isabelle Guyon,
Ulrike von Luxburg
Abstract:
Neural Information Processing Systems (NIPS) is a toptier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees. This represents a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. The massive scale as we…
▽ More
Neural Information Processing Systems (NIPS) is a toptier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees. This represents a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. The massive scale as well as rapid growth of the conference calls for a thorough quality assessment of the peerreview process and novel means of improvement. In this paper, we analyze several aspects of the data collected during the review process, including an experiment investigating the efficacy of collecting ordinal rankings from reviewers. Our goal is to check the soundness of the review process, and provide insights that may be useful in the design of the review process of subsequent conferences.
△ Less
Submitted 23 April, 2018; v1 submitted 31 August, 2017; originally announced August 2017.

Kernel Mean Embedding of Distributions: A Review and Beyond
Authors:
Krikamol Muandet,
Kenji Fukumizu,
Bharath Sriperumbudur,
Bernhard Schölkopf
Abstract:
A Hilbert space embedding of a distributionin short, a kernel mean embeddinghas recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the orig…
▽ More
A Hilbert space embedding of a distributionin short, a kernel mean embeddinghas recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original "feature map" common to support vector machines (SVMs) and other kernel methods. While initially closely associated with the latter, it has meanwhile found application in fields ranging from kernel machines and probabilistic modeling to statistical inference, causal discovery, and deep learning. The goal of this survey is to give a comprehensive review of existing work and recent advances in this research area, and to discuss the most challenging issues and open problems that could lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel twosample testing, independent testing, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' ruleswhich are ubiquitous in graphical model, probabilistic inference, and reinforcement learningin a nonparametric way. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.
△ Less
Submitted 25 January, 2017; v1 submitted 31 May, 2016; originally announced May 2016.

Minimax Estimation of Kernel Mean Embeddings
Authors:
Ilya Tolstikhin,
Bharath Sriperumbudur,
Krikamol Muandet
Abstract:
In this paper, we study the minimax estimation of the Bochner integral $$μ_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hatθ_n$ of $μ_k(P)…
▽ More
In this paper, we study the minimax estimation of the Bochner integral $$μ_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hatθ_n$ of $μ_k(P)$ are studied in the literature wherein all of them satisfy $\bigl\ \hatθ_nμ_k(P)\bigr\_{\mathcal{H}_k}=O_P(n^{1/2})$ with $\mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{1/2}$ is minimax in $\\cdot\_{\mathcal{H}_k}$ and $\\cdot\_{L^2(\mathbb{R}^d)}$norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translationinvariant kernel on $\mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in nonparametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance).
△ Less
Submitted 31 July, 2017; v1 submitted 13 February, 2016; originally announced February 2016.

Towards a Learning Theory of CauseEffect Inference
Authors:
David LopezPaz,
Krikamol Muandet,
Bernhard Schölkopf,
Ilya Tolstikhin
Abstract:
We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in tw…
▽ More
We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in two steps. First, we featurize each $S_i$ using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves stateoftheart causeeffect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables.
△ Less
Submitted 18 May, 2015; v1 submitted 9 February, 2015; originally announced February 2015.

Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations
Authors:
Bernhard Schölkopf,
Krikamol Muandet,
Kenji Fukumizu,
Jonas Peters
Abstract:
We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on sy…
▽ More
We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on synthetic data, and show how it can be used for nonparametric structural equation models, with an application to causal inference.
△ Less
Submitted 27 January, 2015; originally announced January 2015.

Kernel Mean Estimation via Spectral Filtering
Authors:
Krikamol Muandet,
Bharath Sriperumbudur,
Bernhard Schölkopf
Abstract:
The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernelbased nonparametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has sho…
▽ More
The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernelbased nonparametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has shown that shrinkage can help in constructing "better" estimators of the kernel mean than the empirical estimator. The present paper studies the consistency and admissibility of the estimators in Muandet et al. (2014), and proposes a wider class of shrinkage estimators that improve upon the empirical estimator by considering appropriate basis functions. Using the kernel PCA basis, we show that some of these estimators can be constructed using spectral filtering algorithms which are shown to be consistent under some technical assumptions. Our theoretical analysis also reveals a fundamental connection to the kernelbased supervised learning framework. The proposed estimators are simple to implement and perform well in practice.
△ Less
Submitted 4 November, 2014; originally announced November 2014.

The Randomized Causation Coefficient
Authors:
David LopezPaz,
Krikamol Muandet,
Benjamin Recht
Abstract:
We are interested in learning causal relationships between pairs of random variables, purely from observational data. To effectively address this task, the stateoftheart relies on strong assumptions regarding the mechanisms mapping causes to effects, such as invertibility or the existence of additive noise, which only hold in limited situations. On the contrary, this short paper proposes to lea…
▽ More
We are interested in learning causal relationships between pairs of random variables, purely from observational data. To effectively address this task, the stateoftheart relies on strong assumptions regarding the mechanisms mapping causes to effects, such as invertibility or the existence of additive noise, which only hold in limited situations. On the contrary, this short paper proposes to learn how to perform causal inference directly from data, and without the need of feature engineering. In particular, we pose causality as a kernel mean embedding classification problem, where inputs are samples from arbitrary probability distributions on pairs of random variables, and labels are types of causal relationships. We validate the performance of our method on synthetic and realworld data against the stateoftheart. Moreover, we submitted our algorithm to the ChaLearn's "Fast Causation Coefficient Challenge" competition, with which we won the fastest code prize and ranked third in the overall leaderboard.
△ Less
Submitted 15 September, 2014; originally announced September 2014.

OneClass Support Measure Machines for Group Anomaly Detection
Authors:
Krikamol Muandet,
Bernhard Schoelkopf
Abstract:
We propose oneclass support measure machines (OCSMMs) for group anomaly detection which aims at recognizing anomalous aggregate behaviors of data points. The OCSMMs generalize wellknown oneclass support vector machines (OCSVMs) to a space of probability measures. By formulating the problem as quantile estimation on distributions, we can establish an interesting connection to the OCSVMs and vari…
▽ More
We propose oneclass support measure machines (OCSMMs) for group anomaly detection which aims at recognizing anomalous aggregate behaviors of data points. The OCSMMs generalize wellknown oneclass support vector machines (OCSVMs) to a space of probability measures. By formulating the problem as quantile estimation on distributions, we can establish an interesting connection to the OCSVMs and variable kernel density estimators (VKDEs) over the input space on which the distributions are defined, bridging the gap between largemargin methods and kernel density estimators. In particular, we show that various types of VKDEs can be considered as solutions to a class of regularization problems studied in this paper. Experiments on Sloan Digital Sky Survey dataset and High Energy Particle Physics dataset demonstrate the benefits of the proposed framework in realworld applications.
△ Less
Submitted 9 August, 2014; originally announced August 2014.

Kernel Mean Shrinkage Estimators
Authors:
Krikamol Muandet,
Bharath Sriperumbudur,
Kenji Fukumizu,
Arthur Gretton,
Bernhard Schölkopf
Abstract:
A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a…
▽ More
A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the wellknown Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a "large d, small n" paradigm.
△ Less
Submitted 25 February, 2016; v1 submitted 21 May, 2014; originally announced May 2014.

Kernel Mean Estimation and Stein's Effect
Authors:
Krikamol Muandet,
Kenji Fukumizu,
Bharath Sriperumbudur,
Arthur Gretton,
Bernhard Schölkopf
Abstract:
A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbertspace embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a wellknown phenomenon in statistics called Stein'…
▽ More
A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbertspace embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a wellknown phenomenon in statistics called Stein's phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several benchmark applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator.
△ Less
Submitted 6 June, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

OneClass Support Measure Machines for Group Anomaly Detection
Authors:
Krikamol Muandet,
Bernhard Schölkopf
Abstract:
We propose oneclass support measure machines (OCSMMs) for group anomaly detection which aims at recognizing anomalous aggregate behaviors of data points. The OCSMMs generalize wellknown oneclass support vector machines (OCSVMs) to a space of probability measures. By formulating the problem as quantile estimation on distributions, we can establish an interesting connection to the OCSVMs and vari…
▽ More
We propose oneclass support measure machines (OCSMMs) for group anomaly detection which aims at recognizing anomalous aggregate behaviors of data points. The OCSMMs generalize wellknown oneclass support vector machines (OCSVMs) to a space of probability measures. By formulating the problem as quantile estimation on distributions, we can establish an interesting connection to the OCSVMs and variable kernel density estimators (VKDEs) over the input space on which the distributions are defined, bridging the gap between largemargin methods and kernel density estimators. In particular, we show that various types of VKDEs can be considered as solutions to a class of regularization problems studied in this paper. Experiments on Sloan Digital Sky Survey dataset and High Energy Particle Physics dataset demonstrate the benefits of the proposed framework in realworld applications.
△ Less
Submitted 1 June, 2013; v1 submitted 1 March, 2013; originally announced March 2013.

Domain Generalization via Invariant Feature Representation
Authors:
Krikamol Muandet,
David Balduzzi,
Bernhard Schölkopf
Abstract:
This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose DomainInvariant Component Analysis (DICA), a kernelbased optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between…
▽ More
This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose DomainInvariant Component Analysis (DICA), a kernelbased optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. A learningtheoretic analysis shows that reducing dissimilarity improves the expected generalization ability of classifiers on new domains, motivating the proposed algorithm. Experimental results on synthetic and realworld datasets demonstrate that DICA successfully learns invariant features and improves classifier performance in practice.
△ Less
Submitted 10 January, 2013; originally announced January 2013.

Hilbert Space Embedding for Dirichlet Process Mixtures
Authors:
Krikamol Muandet
Abstract:
This paper proposes a Hilbert space embedding for Dirichlet Process mixture models via a stickbreaking construction of Sethuraman. Although Bayesian nonparametrics offers a powerful approach to construct a prior that avoids the need to specify the model size/complexity explicitly, an exact inference is often intractable. On the other hand, frequentist approaches such as kernel machines, which suf…
▽ More
This paper proposes a Hilbert space embedding for Dirichlet Process mixture models via a stickbreaking construction of Sethuraman. Although Bayesian nonparametrics offers a powerful approach to construct a prior that avoids the need to specify the model size/complexity explicitly, an exact inference is often intractable. On the other hand, frequentist approaches such as kernel machines, which suffer from the model selection/comparison problems, often benefit from efficient learning algorithms. This paper discusses the possibility to combine the best of both worlds by using the Dirichlet Process mixture model as a case study.
△ Less
Submitted 16 October, 2012; originally announced October 2012.

Learning from Distributions via Support Measure Machines
Authors:
Krikamol Muandet,
Kenji Fukumizu,
Francesco Dinuzzo,
Bernhard Schölkopf
Abstract:
This paper presents a kernelbased discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hil…
▽ More
This paper presents a kernelbased discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernelbased learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and realworld data demonstrate the effectiveness of our proposed framework.
△ Less
Submitted 12 January, 2013; v1 submitted 29 February, 2012; originally announced February 2012.