On the Information Bottleneck Problems: An Information Theoretic Perspective

semanticscholar(2020)

引用 0|浏览7
暂无评分
摘要
This paper focuses on variants of the bottleneck problem taking an information theoretic perspective. The intimate connections of this setting to: remote source-coding, information combining, common reconstruction, the Wyner-AhlswedeKorner problem, the efficiency of investment information, CEO source coding under logarithmic-loss distortion measure and others are highlighted. We discuss the distributed information bottleneck problem with emphasis on the Gaussian model. For this model, the optimal tradeoffs between relevance (i.e., information) and complexity (i.e., rates) in the discrete and vector Gaussian frameworks is determined. I. STATISTICAL INFERENCE Let a measurable variable X ∈ X and a target variable Y ∈ Y with unknown joint distribution PX,Y be given. In the classic problem of statistical learning, one wishes to infer an accurate predictor of the target variable Y ∈ Y based on observed realizations of X ∈ X . That is, for a given class F of admissible predictors ψ ∶ X → Ŷ and a loss function ` ∶ Y → Ŷ that measures discrepancies between true values and their estimated fits, one aims at finding the mapping ψ ∈ F that minimizes the expected (population) risk CPX,Y (ψ, `) = EPX,Y [`(Y,ψ(X))]. (1) An abstract inference model is shown in Figure 1. PX|Y Y ∈ Y ψ Ŷ ∈ Y X ∈ X Fig. 1. An abstract inference model for learning. The choice of a “good” loss function `(⋅) is often controversial in statistical learning theory. There is however numerical evidence that models that are trained to minimize the error’s entropy often outperform ones that are trained using other criteria such as mean-square error (MSE) and higher-order statistics [1], [2]. This corresponds to choosing the loss function given by the logarithmic loss, which is defined as `log(y, ŷ) ∶= log 1 ŷ(y) (2) for y ∈ Y and ŷ ∈ P(Y) designates here a probability distribution on Y and ŷ(y) is the value of that distribution evaluated at the outcome y ∈ Y . Although a complete and rigorous justification of the usage of the logarithmic loss as distortion measure in learning is still awaited, recently a partial explanation appeared in [3] where Painsky and Wornell show that, for binary classification problems, by minimizing the logarithmic-loss one actually minimizes an upper bound to any choice of loss function that is smooth, proper (i.e., unbiased and Fisher consistent) and convex. Along the same line of work, the authors of [4] show that under some natural data processing property Shannon’s mutual information uniquely quantifies the reduction of prediction risk due to side information. Perhaps, this justifies partially why the logarithmic-loss fidelity measure is widely used in learning theory and has already been adopted in many algorithms in practice such as the infomax criterion [5]. The logarithmic loss measure also plays a central role in the theory of prediction [6, Ch. 09], where it is often referred to as the self-information loss function, as well as in Bayesian modeling [7] where priors are usually designed so as to maximize the mutual information between the parameter to be estimated and the observations. Let for every x ∈ X , ψ(x) = Q(⋅∣x) ∈ P(Y). It is easy to see that EPX,Y [`log(Y,Q)] = ∑ x∈X , y∈Y PX,Y (x, y) log ( 1 Q(y∣x)) (3a) = H(Y ∣X) +D(PY ∣X∥Q) (3b) ≥ H(Y ∣X) (3c) with equality iff ψ(X) = PY ∣X . That is, min ψ CPX,Y (ψ, `log) = H(Y ∣X). (4) If the joint distribution PX,Y is unknown, which is most often the case in practice, the population risk as given by (1) cannot be computed directly; and, in the standard approach, one usually resorts to choosing the predictor with minimal risk on a training dataset consisting of n labeled samples {(xi, yi)}ni=1 that are drawn independently from the unknown joint distribution PX,Y . In this case, it is important to restrict the set F of admissible predictors to a low-complexity class to prevent overfitting. One way to reduce the model’s complexity is by restricting the range of the prediction function as shown in Figure 2. Here, the stochastic mapping φ ∶ X Ð→ U is a compressor with ∥φ∥ ≤ R (5) for some prescribed ’input-complexity’ value R. Let U = φ(X). The expected logarithmic loss is now given by CPX,Y (φ,ψ; `log) = EPX,Y [`log(Y,ψ(U))] (6) and takes its minimum value with the choice ψ(U) = PY ∣U , min ψ CPX,Y (φ,ψ; `log) = H(Y ∣U) (7) International Zurich Seminar on Information and Communication (IZS), February 26 – 28, 2020
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要