Differentiable Unconstrained Minimization

min subject to f (x) x \in R^{n}

其中 $f$ 可微

Convergence Rate

计算方法:

假设给定objective function能够转换成一个序列 $r_{k}$ , 计算极限

q = k \to \infty lim \frac{r _{k + 1}}{r _{k}}

convergence rate就是 $q$ .

Convergence Type

$q = 0$ : superlinearly
$0 < q < 1$ : linearly
$q = 1$ : sublinearly
$q > 1$ : non-convergence

Convergence Rate of Quadratic Minimization

假设 $λ_{1} (Q)$ 是 $Q$ 的最大的eigenvalue, $λ_{n} (Q)$ 是最小eigenvalue, 那么可以认为:
$r = \frac{λ _{1} ( Q )}{λ _{n} ( Q )} = \frac{max _{x} λ _{1} ( \nabla ^{2} f ( x ))}{min _{x} λ _{n} ( \nabla ^{2} f ( x ))}$
$η_{t} \equiv η = \frac{2}{λ _{1} ( Q ) + λ _{n} ( Q )}$ , 那么
$∥ x^{t} - x^{*} ∥_{2} \leq (\frac{λ _{1} ( Q ) - λ _{n} ( Q )}{λ _{1} ( Q ) + λ _{n} ( Q )})^{t} ∥ x^{0} - x^{*} ∥_{2} = ε$
Convergence Analysis:
$1. 2. 3. ∥ x^{t} - x^{*} ∥_{2} \leq ε ∥ f (x^{t}) - f (x^{*}) ∥_{2} \leq ε ∥\nabla f (x) ∥_{2} \leq ε$
Convergence Rate:

sublinear: $T > \frac{1}{ε ^{k}}, o (\frac{1}{ε ^{k}})$

linear: $T > lo g \frac{1}{ε}, o (lo g \frac{1}{ε})$

quadratic(super linear): $T > lo g (lo g \frac{1}{ε}), o (lo g (lo g \frac{1}{ε}))$

Iteration Function: 4. sublinear: $∥ x^{t} - x^{*} ∥_{2} \leq \frac{1}{t ^{\frac{1}{k}}} ∥ x^{0} - x^{*} ∥_{2}$ 5. linear: $∥ x^{t} - x^{*} ∥_{2} \leq ∥ x^{q - 1} - x^{*} ∥_{2} \Rightarrow ∥ x^{t} - x^{*} ∥_{2} \leq q^{t} ∥ x^{0} - x^{*} ∥_{2}$ 6. quadratic: $∥ x^{t} - x^{*} ∥_{2} ≦ ∥ x^{q - 1} - x^{*} ∥_{2}^{2}$

Iterative descend algorithm

从 $x_{0}$ 开始, 构造序列 ${x^{t}}$ 满足 $f (x^{t + 1}) < f (x^{t}), t = 0, 1, \dots$

下降方向(descend direction) $d$ 满足:

f^{'} (x; d) := τ ↓ 0 lim \frac{f ( x + τ d ) - f ( x )}{τ} = \nabla f (x)^{⊤} d \leq 0

每一次迭代中, 有 $x^{t + 1} = x^{t} - η d^{t}$ , 其中 $d^{t}$ 是在 $x^{t}$ 的时候的descend direction, $η$ 是步长.

在机器学习中, $f$ 通常是loss函数, $x$ 通常是loss函数中的参数, $η$ 是学习率

Note

Steepest Descend 最陡下降法

最快优化objective function的方向:

mathop{\arg\min}_{\mathbf d:|\mathbf d|2\leq1}f’(\mathbf x;\mathbf d)=\mathop{\arg\min}{\mathbf d:|\mathbf d|_2\leq1}\nabla f(\mathbf x)^\top\mathbf d=-|\nabla f(\mathbf x)|_2$$
$- ∥\nabla f (x) ∥ \cdot ∥ d ∥_{2} \leq ⟨ \nabla f (x), d ⟩ \leq ∥\nabla f (x) ∥ \cdot ∥ d ∥_{2}$

Quadratic Minimization

min subject to f (x) := \frac{1}{2} (x - x^{*})^{⊤} Q (x - x^{*}) Q ≻ 0

该方程的梯度为 $\nabla f (x) = Q (x - x^{*})$

参数更新:

x^{t + 1} = (I - η_{t} Q) x^{t} + η_{t} Q x^{*}

step size( $η$ ) rule:

\Rightarrow \Rightarrow \Rightarrow x^{t + 1} - x^{*} = (I - η_{t} Q) (x^{t} - x^{*}) ∥ x^{t + 1} - x^{*} ∥ \leq ∥ I - η_{t} Q ∥ \cdot ∥ x^{t} - x^{*} ∥ ∥ I - η Q ∥ = \frac{λ _{1} ( Q ) - λ _{n} ( Q )}{λ _{1} ( Q ) + λ _{n} ( Q )} η \equiv η_{t} = \frac{2}{λ _{1} ( Q ) + λ _{n} ( Q )}

Exact Line Search

$η_{t} = ar g min_{η \geq 0} f (x^{t} - η \nabla f (x^{t}))$

假设有 $g^{t} = \nabla f (x^{t}) = Q (x^{t} - x^{*})$ , 那么 $η_{t} = \frac{g ^{t} ^{⊤} g ^{t}}{g ^{t} ^{⊤} Q g ^{t}}$

Kantorovich’s inequality:

\frac{∥ y ∥ _{2}^{4}}{( y ^{⊤} Qy ) ( y ^{⊤} Q ^{- 1} y )} \geq \frac{4 λ _{1} ( Q ) λ _{n} ( Q )}{( λ _{1} ( Q ) + λ _{n} ( Q ) ) ^{2}}

Smooth problem

$μ$ -strong和 $L$ -smooth定义:

0 ⪯ μ I ⪯ \nabla^{2} f (x) ⪯ L I

或者:

λ_{n} (Q) I ⪯ Q ⪯ λ_{1} (Q) I

Theorem

对于 $μ$ -strong和 $L$ -smooth的问题, 有:

step size: $η = \frac{2}{μ + L}$ (v.s. $\frac{2}{λ _{1} + λ _{n}}$ )

contraction rate: $\frac{κ - 1}{κ + 1}, κ = \frac{L}{μ}$ (v.s. $\frac{λ _{1} - λ _{n}}{λ _{n} + λ _{1}}$ )

iteration complexity: $o (\frac{l o g \frac{1}{ε}}{l o g \frac{κ - 1}{κ + 1}})$

f (x^{t}) - f (x^{*}) \leq \frac{L}{2} (\frac{κ - 1}{κ + 1})^{2 t} ∥ x^{0} - x^{*} ∥_{2}

Backtracking Line Search

Armijo Condition:

f (x^{t} - η \nabla f (x^{t})) < f (x^{t}) - α η ∥\nabla f (x^{t}) ∥_{2}^{2}, 0 < α < 1

algorithm:

initialize $η = 1$ , $0 < α < \frac{1}{2}$ , $0 < β < 1$
while $f (x^{t} - η \nabla f (x^{t})) < f (x^{t}) - α η ∥\nabla f (x^{t}) ∥_{2}^{2}$ , do:
- $η \leftarrow β η$

上界: $f (x^{t}) - η ∥\nabla f (x^{t}) ∥_{2}^{2} + \frac{L η ^{2}}{2} ∥\nabla f (x^{t}) ∥_{2}^{2}$

Theorem

$f (x^{t}) - f (x^{*}) \leq (1 - min {2 μα, \frac{2 β αμ}{L}})^{t} (f (x^{0}) - f (x^{*}))$

收敛性:

∥ x^{t} - x^{*} ∥ \leq \frac{κ - 1}{κ + 1} ∥ x^{t - 1} - x^{0} ∥

Regularity Condition

$μ$ -strong + $L$ -smooth

η \equiv η_{t} = \frac{1}{L}

∥ x^{t} - x^{*} ∥ \leq (1 - \frac{μ}{L})^{t} ∥ x^{0} - x^{*} ∥

Polyak-Lojasiewicz Condition

∥\nabla f (x) ∥_{2}^{2} \geq 2 μ (f (x) - f (x^{*}))

f (x^{t}) - f (x^{*}) \leq (1 - \frac{μ}{L})^{t} (f (x^{0}) - f (x^{*}))

Over-parameterized linear regression

over-parameterize: model dimension > sample size

定义

f (x) = \frac{1}{2} i = 1 \sum m (a_{i}^{⊤} x - y_{i})^{2} = \frac{1}{2} (AX - Y)^{2}

有

\nabla f (x) = 0 \Leftrightarrow X = (A^{⊤} A)^{- 1} A^{⊤} Y

\nabla^{2} f (x) = i = 1 \sum m a_{i} a_{i}^{⊤}

认为如果 $A = [a_{1}, \dots, a_{m}]^{⊤} \in R^{m \times n}$ , 其rank为 $m$ , 且满足step size $η_{t} \equiv η = \frac{1}{λ _{max} ( AA ^{⊤} )}$ , 有:

f (x^{t}) - f (x^{*}) \leq (1 - \frac{λ _{min} ( AA ^{⊤} )}{λ _{max} ( AA ^{⊤} )})^{t} (f (x^{0}) - f (x^{*})), \forall t

Convex and Smooth Problem

L-smooth

majorization-minimization

Theorem

如果是 $L$ -smooth的, 且有: $η = \frac{1}{L}$
$f (x^{t + 1}) \leq f (x^{t}) - \frac{1}{2 L} ∥\nabla f (x^{t}) ∥_{2}^{2}$ $∥ x^{t} - x^{*} ∥ \leq ∥ x^{t - 1} - x^{*} ∥ - \frac{1}{L ^{2}} ∥\nabla f (x^{t - 1}) ∥_{2}^{2}$ $f (x^{t}) - f (x^{*}) \leq \frac{2 L ∥ x ^{0} - x ^{*} ∥}{t}$

Non-convex Problem

Theorem

for general: $ $min_{0 \leq k < t} ∥\nabla f (x^{k}) ∥_{2} \leq \frac{2 L ( f ( x ^{0} ) - f ( x ^{*} ))}{t}$ $

for convex: $ $min_{\frac{t}{2} \leq k < t} ∥\nabla f (x^{k}) ∥_{2} = \frac{4 L ∥ x ^{0} - x ^{*} ∥ _{2}}{t}$ $

Gradient methods for Constrained Problems

Frank-Wolfe algorithm

$y^{t} := ar g min_{x \in C} ⟨ \nabla f (x^{t}), x^{t} ⟩$
$x^{t + 1} = (1 - η_{t}) x^{t} + η_{t} y^{t}$

over a convex set: $f (x^{t}) + ⟨ \nabla f (x^{t}), x - x^{t} ⟩$ . 步长类似Exact Line Search: $η_{t} = \frac{2}{t + 2}$

对于non-convex:

minimize subject to - x^{⊤} Qx ∥ x ∥_{2} \leq 1

有:

y^{t} \Rightarrow x^{t + 1} = ar g min_{x : ∥ x ∥_{2} \leq 1} ⟨ \nabla f (x^{t}), x ⟩ = - \frac{\nabla f ( x ^{t} )}{∥\nabla f ( x ^{t} ) ∥ _{2}} = \frac{Qx ^{t}}{∥ Qx ^{t} ∥ _{2}} = (1 - η_{t}) x^{t} + η_{t} \frac{Qx ^{t}}{∥ Qx ^{t} ∥ _{2}}

Convergence

Theorem

假设 $f$ 是convex的, 且是L-smooth的, 假设有 $η_{t} = \frac{2}{t + 2}$ , 那么有:
$f (x^{t}) - f (x^{*}) \leq \frac{2 L d _{C}^{2}}{t + 2}$
其中 $d_{C} = sup_{x, y \in C} ∥ x - y ∥_{2}$

对于compact约束集合, 效率可以达到 $ε$ -accuracy, 在 $O (\frac{1}{ε})$ 个迭代中

Example

假设有集合 $C$ 是 $μ$ -convex的, 假设 $\forall λ \in [0, 1]$ , $\forall x, z \in C$ , 定义 $B (a, r) := {y ∣∥ y - a ∥_{2} \leq r}$ , 那么有:
$B (λ x + (1 - λ) z, \frac{μ}{2} λ (1 - λ) ∥ x - z ∥_{2}^{2}) \in C$

Theorem

假设 $f$ 是convex and L-smooth的, 假设 $C$ 是 $μ$ -strongly convex的, 那么 $0 \leq c \leq ∥\nabla f (x) ∥_{2}, \forall x \in C$

Projected Gradient Method

将一个在集合外的点映射到集合中

Definition

Euclidean projection(quadratic minimization):
$P_{C} (x) := ar g min_{z \in C} ∥ x - z ∥_{2}^{2}$

循环:

x^{t + 1} = P_{C} (x^{t} - η_{t} \nabla f (x^{t}))

Theorem

假设有集合 $C$ 是close且convex的, 那么有:
$(x - P_{C} (x))^{⊤} (z - P_{C} (x)) \leq 0, \forall z \in C$

从上图可知, 有 $- \nabla f (x^{t})^{⊤} (x^{t + 1} - x^{t}) \geq 0$ , 即 $x^{t + 1} - x^{t}$ 和最速下降的方向是正相关的

Strongly Convex

Theorem

假设 $x^{*} \in int (C)$ , 假设 $f$ 是 $μ$ -strongly convex且L-smooth的. 令 $η_{t} = \frac{2}{μ + L}, κ = \frac{L}{μ}$ , 有
$∥ x^{t} - x^{*} ∥_{2} \leq (\frac{κ - 1}{κ + 1})^{t} ∥ x_{0} - x^{*} ∥_{2}$

一些其他情况参见Smooth problem

Knowledge Base

Explorer

Ch6.Gradient

Differentiable Unconstrained Minimization

Convergence Rate

Convergence Type

Iterative descend algorithm

Quadratic Minimization

Exact Line Search

Smooth problem

Backtracking Line Search

Regularity Condition

Polyak-Lojasiewicz Condition

Over-parameterized linear regression

Convex and Smooth Problem

L-smooth

Non-convex Problem

Gradient methods for Constrained Problems

Frank-Wolfe algorithm

Convergence

Projected Gradient Method

Strongly Convex

Graph View

Table of Contents

Backlinks