回想一下2.4 節(jié),計(jì)算導(dǎo)數(shù)是我們將用于訓(xùn)練深度網(wǎng)絡(luò)的所有優(yōu)化算法中的關(guān)鍵步驟。雖然計(jì)算很簡(jiǎn)單,但手工計(jì)算可能很乏味且容易出錯(cuò),而且這個(gè)問(wèn)題只會(huì)隨著我們的模型變得更加復(fù)雜而增長(zhǎng)。
幸運(yùn)的是,所有現(xiàn)代深度學(xué)習(xí)框架都通過(guò)提供自動(dòng)微分(通常簡(jiǎn)稱為 autograd )來(lái)解決我們的工作。當(dāng)我們通過(guò)每個(gè)連續(xù)的函數(shù)傳遞數(shù)據(jù)時(shí),該框架會(huì)構(gòu)建一個(gè)計(jì)算圖來(lái)跟蹤每個(gè)值如何依賴于其他值。為了計(jì)算導(dǎo)數(shù),自動(dòng)微分通過(guò)應(yīng)用鏈?zhǔn)椒▌t通過(guò)該圖向后工作。以這種方式應(yīng)用鏈?zhǔn)椒▌t的計(jì)算算法稱為反向傳播。
雖然 autograd 庫(kù)在過(guò)去十年中成為熱門話題,但它們的歷史悠久。事實(shí)上,對(duì) autograd 的最早引用可以追溯到半個(gè)多世紀(jì)以前(Wengert,1964 年)。現(xiàn)代反向傳播背后的核心思想可以追溯到 1980 年的一篇博士論文 ( Speelpenning, 1980 ),并在 80 年代后期得到進(jìn)一步發(fā)展 ( Griewank, 1989 )。雖然反向傳播已成為計(jì)算梯度的默認(rèn)方法,但它并不是唯一的選擇。例如,Julia 編程語(yǔ)言采用前向傳播 (Revels等人,2016 年). 在探索方法之前,我們先來(lái)掌握autograd這個(gè)包。
import tensorflow as tf
2.5.1. 一個(gè)簡(jiǎn)單的函數(shù)
假設(shè)我們有興趣區(qū)分函數(shù) y=2x?x關(guān)于列向量x. 首先,我們分配x
一個(gè)初始值。
在我們計(jì)算梯度之前y關(guān)于 x,我們需要一個(gè)地方來(lái)存放它。通常,我們避免每次求導(dǎo)時(shí)都分配新內(nèi)存,因?yàn)樯疃葘W(xué)習(xí)需要針對(duì)相同參數(shù)連續(xù)計(jì)算導(dǎo)數(shù)數(shù)千或數(shù)百萬(wàn)次,并且我們可能會(huì)面臨內(nèi)存耗盡的風(fēng)險(xiǎn)。請(qǐng)注意,標(biāo)量值函數(shù)相對(duì)于向量的梯度x是向量值的并且具有相同的形狀x.
# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # The gradient is None by default
array([0., 1., 2., 3.])
Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters thousands or millions of times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued and has the same shape as x.
array([0., 0., 0., 0.])
Array([0., 1., 2., 3.], dtype=float32)
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0., 1., 2., 3.], dtype=float32)>
Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters thousands or millions of times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued and has the same shape as x.
我們現(xiàn)在計(jì)算我們的函數(shù)x
并將結(jié)果分配給y
。
tensor(28., grad_fn=<MulBackward0>)
我們現(xiàn)在可以通過(guò)調(diào)用它的方法來(lái)獲取y
關(guān)于的梯度。接下來(lái),我們可以通過(guò)的 屬性訪問(wèn)漸變。x
backward
x
grad
tensor([ 0., 4., 8., 12.])
array(28.)
We can now take the gradient of y
with respect to x
by calling its backward
method. Next, we can access the gradient via x
’s grad
attribute.
[09:38:36] src/base.cc:49: GPU context requested, but no GPUs found.
array([ 0., 4., 8., 12.])
Array(28., dtype=float32)
We can now take the gradient of y
with respect to x
by passing through the grad
transform.
from jax import grad
# The `grad` transform returns a Python function that
# computes the gradient of the original function
x_grad = grad(y)(x)
x_grad
Array([ 0., 4., 8., 12.], dtype=float32)
<tf.Tensor: shape=(), dtype=float32, numpy=28.0>
We can now calculate the gradient of y
with respect to x
by calling the gradient
method.
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([
評(píng)論
查看更多