On Learning Domain-Invariant Representations for Transfer Learning with Multiple Sources
Introduction The central question of transfer learning from one domain to another domain is how coul(...)
23/11/2021 Machine Learning
Huy Nguyen (Research Resident)
Optimal Transport (OT) has recently become a useful tool in many machine learning and deep learning problems, including Wasserstein GAN [Arjovsky et al., 2017], multilevel clustering [Ho et al., 2017], domain adaptation [Courty et al., 2017], etc. Its goal is to find a minimal cost of moving masses between supports of two probability distributions. However, when there are outliers in the input datasets, two marginal constraints still force the transport plan to move masses to the outliers, which makes the transport cost unexpectedly high.
To deal with this issue, we consider a robust variant of the standard OT, named Robust Optimal Transport (ROT). Additionally, we also investigate in this work the robust barycenter problem where we seek a probability distribution that minimizes its robust optimal cost to a given set of two or more probability distributions.
To address the sensitivity to outliers of optimal transport (OT), [Esteban et al., 2008] proposed a trimmed version of OT. In particular, they search for truncated probability distributions such that the transport cost between them is minimized. However, their trimmed optimal transport is non-trivial to compute, which hinders its usage in practical applications. Another line of works utilized robust optimal transport whose main idea is to relax both marginal constraints by some particular divergences. [Balaji et al., 2020] proposed using Chi-squared divergence, but this leads to plenty of challenges during the optimization. Meanwhile, [Mukherjee et al., 2021] penalized the two marginal constraints by Total Variation, which makes it exceptionally difficult to compute the complexity of their proposed algorithm.
In this work, we still leverage the approach of robust optimal transport (ROT) into making optimal transport robust to outliers. Nevertheless, we decide to relax two marginal constraints by another penalty function, namely Kullback-Leibler (KL) divergence. By doing so, we are able to utilize the Sinkhorn scheme which not only facilitates our optimization problems but also has theoretical grounds for computational complexity.
We will approximate ROT by adding an entropic regularization term to the objective function and refer this problem to as entropic ROT. Interestingly, there is a connection between the optimal solutions of entropic ROT and unbalanced optimal transport (UOT) in entropic formulation [Pham et al., 2020]. More specifically, normalizing the minimizer of entropic UOT will lead to that of entropic ROT. Based on that result, we derive an algorithm called Robust-Sinkhorn which returns an epsilon-approximation (see Definition 1 in our paper) of the optimizer of entropic ROT in time O(n^{2}/epsilon), which is better than that of standard OT, at O(n^{2}/epsilon^{2}) where n is the number of supports of the probability distributions while epsilon denotes the tolerance.
In this section, we consider the barycenter problem corresponding to robust optimal transport. Given m possibly corrupted probability distributions P_{1}, …, P_{m}, we wish to seek the barycenter P of uncontaminated probability distributions which are approximations of the aforementioned m distributions. As the desired barycenter P is a clean probability distribution, it does not make sense to use robust optimal transport mentioned in Section 3, which is essential when both given distributions are noisy, to compute the transport cost from P to P_{i} in this case. Instead, we propose another robust variant of optimal transport called Robust Semi-constrained Optimal Transport (RSOT) where we relax only one marginal constraint and still keep the other one. Therefore, we refer this barycenter problem to as Robust Semi-constrained Barycenter Problem (RSBP) in which we find a weighted Fréchet mean of m given probability distributions over RSOT divergence.
To tackle this problem, we still utilize the entropic regularization approach for the sake of computing. However, when setting the gradient of the dual form of this problem to zero, we obtain an intractable system of equations. Inspired by the connection between ROT and UOT in Section 3, we also find out a similar relationship between entropic RSBP and its version without norm constraint. Thus, we derive an algorithm called RobustIBP based on the Iterative Bregman Projection (IBP) algorithm to solve the entropic RSBP without norm constraint first, and then add a normalizing step at the end. Consequently,
In this section, we firstly compare the marginals induced by using different variants of optimal transport in the presence of corrupted distributions. Let a and b be two 1-D Gaussian distributions containing outliers (10%) from other Gaussians. It is obvious from Figure 2 that all the variants approximate the corrupted distributions well with a proper choice of corresponding hyperparameters. In addition, while those with Chi-squared divergence and Total Variation have various behaviors when their hyperparameters go to infinity, the behavior of ROT with KL divergence remains stable.
Figure 2: Comparison among robust optimal transport (ours, KL divergence), partial optimal transport [Figalli et al., 2010] and robust formulations in [Mukherjee et al., 2021] (using Total Variation) and [Balaji et al., 2020] (using Chi-squared divergence) in that order from the first row to the fourth row, with different hyperparameter settings.
Next, we demonstrate the robustness of our Robust Optimal Transport in Color Transfer problem. Here, the optimal transport problem is conducted between the RGB histograms of source image and target image which are corrupted by replacing pixels at random positions by green (or red) pixels (see Figure 3).
Figure 3: The first row, from left to right, consists of source image, target image and the three last ones that are source images with each pixel replaced by its mapped value via standard OT, Relaxed-OT [Rabin et al., 2014] and Robust OT (ours), respectively. The second row comprises corresponding (RGB) histograms of images on the first rows.
It can be seen from Figure 3 that the transferred color histogram induced by the OT solution still contains noisy values. Meanwhile, although there are almost no noisy values in the histogram induced by Relaxed-OT, the structure of its image is significantly changed. By contrast, the transferred histogram resulted from ROT is not only clean as expected but it also preserves the original structure pretty well.
Huy Nguyen (Research Resident)