当前位置：首页 > news >正文

永州网站建设资产管理公司网站建设费用怎么入账

news 2025/10/14 12:44:24

永州网站建设,资产管理公司网站建设费用怎么入账,it公司网站模板,贵阳网站商城建设正文学习一下 transformers 库中#xff0c;Llama 模型的代码#xff0c;学习过程中写下这篇笔记#xff0c;一来加深印象#xff0c;二来可以多次回顾。笔者小白#xff0c;里面错误之处请不吝指出。层归一化 LlamaRMSNorm transformers 中对于 LlamaRMSNorm 类的…正文学习一下 transformers 库中Llama 模型的代码学习过程中写下这篇笔记一来加深印象二来可以多次回顾。笔者小白里面错误之处请不吝指出。层归一化 LlamaRMSNorm transformers 中对于 LlamaRMSNorm 类的定义如下 class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps1e-6): LlamaRMSNorm is equivalent to T5LayerNorm super().__init__() self.weight nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon eps def forward(self, hidden_states): input_dtype hidden_states.dtype hidden_states hidden_states.to(torch.float32) variance hidden_states.pow(2).mean(-1, keepdimTrue) hidden_states hidden_states * torch.rsqrt(variance self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) 这里采用了 RMS(Root Mean Square) 归一化其中 RMS 计算公式为 RMS(x)1n∑xi2RMS(x)\sqrt{\frac{1}{n}\sum{x_i^2}}RMS(x)n1∑xi2 则 RMSNorm 归一化的计算公式为 RMS(x)xRMS(x)ϵ∗WRMS(x)\frac{x}{\sqrt{RMS(x)\epsilon}} * WRMS(x)RMS(x)ϵx∗W 加上一个小常数确保分母不为零保持数据稳定性。旋转位置编码 LlamaRotaryEmbedding 绝对位置编码计算高效效果欠佳相对位置编码满足 NLP 领域在序列长度方向上具有平移不变性计算效率低。旋转位置编码采用绝对位置编码达到相位置编码的效果 transformers 中对于 LlamaRotaryEmbedding 类的定义如下它用于实现旋转位置嵌入 class LlamaRotaryEmbedding(nn.Module): def __init__(self, dim, max_position_embeddings2048, base10000, deviceNone): super().__init__() self.dim dim self.max_position_embeddings max_position_embeddings self.base base inv_freq 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)) self.register_buffer(inv_freq, inv_freq, persistentFalse) # Build here to make torch.jit.trace work. self._set_cos_sin_cache( seq_lenmax_position_embeddings, deviceself.inv_freq.device, dtypetorch.get_default_dtype() ) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached seq_len t torch.arange(self.max_seq_len_cached, devicedevice, dtypeself.inv_freq.dtype) freqs torch.einsum(i,j-ij, t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb torch.cat((freqs, freqs), dim-1) self.register_buffer(cos_cached, emb.cos().to(dtype), persistentFalse) self.register_buffer(sin_cached, emb.sin().to(dtype), persistentFalse) def forward(self, x, seq_lenNone): # x: [bs, num_attention_heads, seq_len, head_size] if seq_len self.max_seq_len_cached: self._set_cos_sin_cache(seq_lenseq_len, devicex.device, dtypex.dtype) return ( self.cos_cached[:seq_len].to(dtypex.dtype), self.sin_cached[:seq_len].to(dtypex.dtype), ) 其中定义的变量意义如下 dim表示模型输出维度max_position_embeddings最大编码长度默认为2048base基数默认为10000 inv_freq 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)) 实现公式为 inv_freq1base2i/diminv\_freq\frac{1}{base^{2i/dim}}inv_freqbase2i/dim1 在上面代码中t 的维度为[max_position_embeddings], inv_freq 的维度为[dim/2]。经过 torch.einsum(i,j-ij, t, self.inv_freq) 之后维度为 [max_position_embeddings, dim/2]。然后经过 emb torch.cat((freqs, freqs), dim-1) 操作维度变为 [max_position_embeddings, dim]。二维情况下旋转矩阵如下 R(k)(coskθ−sinkθsinkθcoskθ)R(k)\begin{pmatrix} cosk\theta -sink\theta \\ sink\theta cosk\theta \\ \end{pmatrix}R(k)(coskθsinkθ−sinkθcoskθ) 旋转位置编码计算公式如下 R(k)x(coskθ0coskθ0coskθ1coskθ1…coskθd/2−1coskθd/2−1)∘(x0x1x2x3…xd−2xd−1)(sinkθ0sinkθ0sinkθ1sinkθ1…sinkθd/2−1sinkθd/2−1)∘(−x1x0−x3x2…−xd−1xd−2)R(k)x \begin{pmatrix} cos{k\theta_0} \\ cos{k\theta_0} \\ cos{k\theta_1} \\ cos{k\theta_1} \\ … \\ cos{k\theta_{d/2-1}} \\ cos{k\theta_{d/2-1}} \end{pmatrix} \circ \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ … \\ x_{d-2} \\ x_{d-1} \end{pmatrix} \begin{pmatrix} sin{k\theta_0} \\ sin{k\theta_0} \\ sin{k\theta_1} \\ sin{k\theta_1} \\ … \\ sin{k\theta_{d/2-1}} \\ sin{k\theta_{d/2-1}} \end{pmatrix} \circ \begin{pmatrix} -x_1 \\ x_0 \\ -x_3 \\ x_2 \\ … \\ -x_{d-1} \\ x_{d-2} \end{pmatrix} R(k)x⎝⎛coskθ0coskθ0coskθ1coskθ1…coskθd/2−1coskθd/2−1⎠⎞∘⎝⎛x0x1x2x3…xd−2xd−1⎠⎞⎝⎛sinkθ0sinkθ0sinkθ1sinkθ1…sinkθd/2−1sinkθd/2−1⎠⎞∘⎝⎛−x1x0−x3x2…−xd−1xd−2⎠⎞ 在使用 LLM 时我们希望对上下文长度进行拓展以便能进行多轮对话由此有下面几种方法外推法直接沿用当前公式计算计算更长位置的编码。这种方法比较简单但是存在相关性衰减问题如果模型训练语料在 2k 长度左右模型能够学习到 2k 长度左右的 token 之间相关性关系的规律。如果直接将此规律沿用到 5k 上下文可能导致在中间某个位置相关性衰减到零从而无法捕捉两个 token 之间的相关性。线性内插线性内插会改变编码公式将 token 之间的距离等比例缩小。例如在 2k 上下文情况下两个 token 之间距离为 16那么在 32k 上下文下这两个 token 之间距离缩短到 1。对于短距离的衰减规律可能造成非常大的变化但是线性内插没有改变模型学习到的衰减规律的应用范围不考虑微调的话其效果一般好于直接外推方案。 transformers 中对于线性内插的实现如下 class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding): LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev def __init__(self, dim, max_position_embeddings2048, base10000, deviceNone, scaling_factor1.0): self.scaling_factor scaling_factor super().__init__(dim, max_position_embeddings, base, device) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached seq_len t torch.arange(self.max_seq_len_cached, devicedevice, dtypeself.inv_freq.dtype) t t / self.scaling_factor freqs torch.einsum(i,j-ij, t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb torch.cat((freqs, freqs), dim-1) self.register_buffer(cos_cached, emb.cos().to(dtype), persistentFalse) self.register_buffer(sin_cached, emb.sin().to(dtype), persistentFalse) 可以看到在 t t / self.scaling_factor 这行代码中除以一个缩放因子从而达到线性缩放的效果。动态 NTK 扩展外推法对于长距离的 token 不能很好计算相关性线性内插对于短距离 token 计算相关性会产生很大变化因此可以综合两者进行扩展。为了在短距离情况下具有外推特性长距离情况下具有内插特性可以设置一个与位置序号有关频率因子动态调整。 transformers 中对于动态 NTK 扩展的实现如下 class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding): LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla def __init__(self, dim, max_position_embeddings2048, base10000, deviceNone, scaling_factor1.0): self.scaling_factor scaling_factor super().__init__(dim, max_position_embeddings, base, device) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached seq_len if seq_len self.max_position_embeddings: base self.base * ( (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1) ) ** (self.dim / (self.dim - 2)) inv_freq 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)) self.register_buffer(inv_freq, inv_freq, persistentFalse) t torch.arange(self.max_seq_len_cached, devicedevice, dtypeself.inv_freq.dtype) freqs torch.einsum(i,j-ij, t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb torch.cat((freqs, freqs), dim-1) self.register_buffer(cos_cached, emb.cos().to(dtype), persistentFalse) self.register_buffer(sin_cached, emb.sin().to(dtype), persistentFalse) 可以看到如果长度超过 max_position_embeddings对于 base 做出了如下公式操作 basebase∗(factor∗seq_lenmax_len−(factor−1))dimdim−2basebase*(factor*\frac{seq\_len}{max\_len}-(factor-1))^{\frac{dim}{dim-2}}basebase∗(factor∗max_lenseq_len−(factor−1))dim−2dim 如果 seq_len max_position_embeddings在 factor 1 的情况下base 变大。显然 base 1则 inv_freq 值变小这样将短距离的规律扩展到了长距离。具体计算位置编码的代码如下 def rotate_half(x): Rotates half the hidden dims of the input. x1 x[..., : x.shape[-1] // 2] x2 x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim-1) # Copied from transformers.models.gpt_neox.modeling_gpt_neox.apply_rotary_pos_emb def apply_rotary_pos_emb(q, k, cos, sin, position_ids): cos cos[position_ids].unsqueeze(1) # [seq_len, dim] - [batch_size, 1, seq_len, head_dim] sin sin[position_ids].unsqueeze(1) q_embed (q * cos) (rotate_half(q) * sin) k_embed (k * cos) (rotate_half(k) * sin) return q_embed, k_embed 在 rotate_half() 中将输入 x 沿着最后一维分隔成两部分然后进行拼接。 Llama 中对 Q 的旋转位置编码按照如下方式计算 R(k)Q(coskθ0coskθ1…coskθd/2−1coskθ0coskθ1…coskθd/2−1)∘(q0q1…qd/2−1qd/2qd/21…qd−1)(sinkθ0sinkθ1…sinkθd/2−1sinkθ0sinkθ1…sinkθd/2−1)∘(−qd/2−qd/21…−qd−1q0q1…qd−1)R(k)Q \begin{pmatrix} cos{k\theta_0} \\ cos{k\theta_1} \\ … \\ cos{k\theta_{d/2-1}} \\ cos{k\theta_0} \\ cos{k\theta_1} \\ … \\ cos{k\theta_{d/2-1}} \end{pmatrix} \circ \begin{pmatrix} q_0 \\ q_1 \\ … \\ q_{d/2-1} \\ q_{d/2} \\ q_{d/21} \\ … \\ q_{d-1} \end{pmatrix} \begin{pmatrix} sin{k\theta_0} \\ sin{k\theta_1} \\ … \\ sin{k\theta_{d/2-1}} \\ sin{k\theta_0} \\ sin{k\theta_1} \\ … \\ sin{k\theta_{d/2-1}} \end{pmatrix} \circ \begin{pmatrix} -q_{d/2} \\ -q_{d/21} \\ … \\ -q_{d-1} \\ q_0 \\ q_1 \\ … \\ q_{d-1} \end{pmatrix} R(k)Q⎝⎛coskθ0coskθ1…coskθd/2−1coskθ0coskθ1…coskθd/2−1⎠⎞∘⎝⎛q0q1…qd/2−1qd/2qd/21…qd−1⎠⎞⎝⎛sinkθ0sinkθ1…sinkθd/2−1sinkθ0sinkθ1…sinkθd/2−1⎠⎞∘⎝⎛−qd/2−qd/21…−qd−1q0q1…qd−1⎠⎞ 这里只对 Q 和 K 加入位置编码信息。前馈网络 LlamaMLP transformers 中对于前馈网络的定义如下 class LlamaMLP(nn.Module): def __init__(self, config): super().__init__() self.config config self.hidden_size config.hidden_size self.intermediate_size config.intermediate_size self.gate_proj nn.Linear(self.hidden_size, self.intermediate_size, biasFalse) self.up_proj nn.Linear(self.hidden_size, self.intermediate_size, biasFalse) self.down_proj nn.Linear(self.intermediate_size, self.hidden_size, biasFalse) self.act_fn ACT2FN[config.hidden_act] def forward(self, x): if self.config.pretraining_tp 1: slice self.intermediate_size // self.config.pretraining_tp gate_proj_slices self.gate_proj.weight.split(slice, dim0) up_proj_slices self.up_proj.weight.split(slice, dim0) down_proj_slices self.down_proj.weight.split(slice, dim1) gate_proj torch.cat( [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim-1 ) up_proj torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim-1) intermediate_states (self.act_fn(gate_proj) * up_proj).split(slice, dim2) down_proj [ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp) ] down_proj sum(down_proj) else: down_proj self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) return down_proj 在 __init__() 函数中定义了 hidden_size 和 intermediate_size 控制模型尺寸。同时定义了三个全连接层 gate_proj将 hidden_size 投影到 intermediate_sizeup_proj将 hidden_size 投影到 intermediate_sizedown_proj将 intermediate_size 投影到 hidden_size 这里会将输入通过 up_proj 先从 hidden_size 维度转换到 intermediate_size 维度然后通过 down_proj 从 intermediate_size 维度转换到 hidden_size 维度。同时里面采用 gate_proj 配合激活函数实现了一个门控作用。在 forward() 函数中会根据 config.pretraining_tp 选择不同的执行策略。这里是将三个全连接层切分为若干块分别与输入 x 进行映射操作得到多个子投影然后将多个子投影拼接起来。多头注意力 LlamaAttention transformers 中对于多头注意力的定义如下 class LlamaAttention(nn.Module): Multi-headed attention from Attention Is All You Need paper def __init__(self, config: LlamaConfig): super().__init__() self.config config self.hidden_size config.hidden_size self.num_heads config.num_attention_heads self.head_dim self.hidden_size // self.num_heads self.num_key_value_heads config.num_key_value_heads self.num_key_value_groups self.num_heads // self.num_key_value_heads self.max_position_embeddings config.max_position_embeddings self.rope_theta config.rope_theta if (self.head_dim * self.num_heads) ! self.hidden_size: raise ValueError( fhidden_size must be divisible by num_heads (got hidden_size: {self.hidden_size} f and num_heads: {self.num_heads}). ) self.q_proj nn.Linear(self.hidden_size, self.num_heads * self.head_dim, biasconfig.attention_bias) self.k_proj nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, biasconfig.attention_bias) self.v_proj nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, biasconfig.attention_bias) self.o_proj nn.Linear(self.num_heads * self.head_dim, self.hidden_size, biasconfig.attention_bias) self._init_rope() def _init_rope(self): if self.config.rope_scaling is None: self.rotary_emb LlamaRotaryEmbedding( self.head_dim, max_position_embeddingsself.max_position_embeddings, baseself.rope_theta, ) else: scaling_type self.config.rope_scaling[type] scaling_factor self.config.rope_scaling[factor] if scaling_type linear: self.rotary_emb LlamaLinearScalingRotaryEmbedding( self.head_dim, max_position_embeddingsself.max_position_embeddings, scaling_factorscaling_factor, baseself.rope_theta, ) elif scaling_type dynamic: self.rotary_emb LlamaDynamicNTKScalingRotaryEmbedding( self.head_dim, max_position_embeddingsself.max_position_embeddings, scaling_factorscaling_factor, baseself.rope_theta, ) else: raise ValueError(fUnknown RoPE scaling type {scaling_type}) 这里主要定义了下面几种属性 hidden_size隐藏层的大小num_heads注意力头的数量head_dim每个注意力头的维度它通过 hidden_size // num_heads 得到num_key_value_heads键值注意力头的数量num_key_value_groups键值注意力头分组数量它通过 num_heads // num_key_value_heads 得到rope_theta即前面 RoPE 的 base 此外还定义了四个线性变换的全连接层分别用于计算查询Q、键K、值V和输出O。注意到键值注意力头的数量与查询注意力头的数量不同。键值注意力头数量可以是查询注意力头数量的几分之一这样可以减少参数规模。多头注意力的计算代码如下 def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] None, position_ids: Optional[torch.LongTensor] None, past_key_value: Optional[Tuple[torch.Tensor]] None, output_attentions: bool False, use_cache: bool False, padding_mask: Optional[torch.LongTensor] None, ) - Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: bsz, q_len, _ hidden_states.size() if self.config.pretraining_tp 1: key_value_slicing (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp query_slices self.q_proj.weight.split( (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim0 ) key_slices self.k_proj.weight.split(key_value_slicing, dim0) value_slices self.v_proj.weight.split(key_value_slicing, dim0) query_states [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)] query_states torch.cat(query_states, dim-1) key_states [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)] key_states torch.cat(key_states, dim-1) value_states [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)] value_states torch.cat(value_states, dim-1) else: query_states self.q_proj(hidden_states) key_states self.k_proj(hidden_states) value_states self.v_proj(hidden_states) query_states query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) value_states value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) kv_seq_len key_states.shape[-2] if past_key_value is not None: kv_seq_len past_key_value[0].shape[-2] cos, sin self.rotary_emb(value_states, seq_lenkv_seq_len) query_states, key_states apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) if past_key_value is not None: # reuse k, v, self_attention key_states torch.cat([past_key_value[0], key_states], dim2) value_states torch.cat([past_key_value[1], value_states], dim2) past_key_value (key_states, value_states) if use_cache else None key_states repeat_kv(key_states, self.num_key_value_groups) value_states repeat_kv(value_states, self.num_key_value_groups) attn_weights torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) if attn_weights.size() ! (bsz, self.num_heads, q_len, kv_seq_len): raise ValueError( fAttention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is f {attn_weights.size()} ) if attention_mask is not None: if attention_mask.size() ! (bsz, 1, q_len, kv_seq_len): raise ValueError( fAttention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()} ) attn_weights attn_weights attention_mask # upcast attention to fp32 attn_weights nn.functional.softmax(attn_weights, dim-1, dtypetorch.float32).to(query_states.dtype) attn_output torch.matmul(attn_weights, value_states) if attn_output.size() ! (bsz, self.num_heads, q_len, self.head_dim): raise ValueError( fattn_output should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is f {attn_output.size()} ) attn_output attn_output.transpose(1, 2).contiguous() attn_output attn_output.reshape(bsz, q_len, self.hidden_size) if self.config.pretraining_tp 1: attn_output attn_output.split(self.hidden_size // self.config.pretraining_tp, dim2) o_proj_slices self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim1) attn_output sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)]) else: attn_output self.o_proj(attn_output) if not output_attentions: attn_weights None return attn_output, attn_weights, past_key_value 多头注意力基本与《Attention Is All You Need》中一致计算公式如下 Attention(Q,K,V)softmax(QKTdk)VAttention(Q,K,V)softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)softmax(dkQKT)V 在 llama 中每进行一次注意力计算都会对 Q 和 K 计算一次位置编码RoPE。因为 K 和 V 注意力头数是 Q 的几分之一所以在计算前首先进行 repeat 操作对应代码如下 key_states repeat_kv(key_states, self.num_key_value_groups) value_states repeat_kv(value_states, self.num_key_value_groups) 计算注意力的代码如下 attn_weights torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)attn_weights attn_weights attention_mask # 可选操作attn_weights nn.functional.softmax(attn_weights, dim-1, dtypetorch.float32).to(query_states.dtype) attn_output torch.matmul(attn_weights, value_states) 最终 attn_output 经过 o_proj 的线性变换之后输出。与前馈网络类似如果 config 中设置 pretraining_tp会对输入进行切片后分块操作。解码层 LlamaDecoderLayer transfromers 中对解码层的定义如下 class LlamaDecoderLayer(nn.Module): def __init__(self, config: LlamaConfig): super().__init__() self.hidden_size config.hidden_size self.self_attn ( LlamaAttention(configconfig) if not getattr(config, _flash_attn_2_enabled, False) else LlamaFlashAttention2(configconfig) ) self.mlp LlamaMLP(config) self.input_layernorm LlamaRMSNorm(config.hidden_size, epsconfig.rms_norm_eps) self.post_attention_layernorm LlamaRMSNorm(config.hidden_size, epsconfig.rms_norm_eps) 解码层由 AttentionLayer、MLP 和两个 LayerNorm 组成。前向计算代码如下 def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] None, position_ids: Optional[torch.LongTensor] None, past_key_value: Optional[Tuple[torch.Tensor]] None, output_attentions: Optional[bool] False, use_cache: Optional[bool] False, padding_mask: Optional[torch.LongTensor] None, ) - Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]: Args: hidden_states (torch.FloatTensor): input to the layer of shape (batch, seq_len, embed_dim) attention_mask (torch.FloatTensor, *optional*): attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values. output_attentions (bool, *optional*): Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail. use_cache (bool, *optional*): If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). past_key_value (Tuple(torch.FloatTensor), *optional*): cached past key and value projection states residual hidden_states hidden_states self.input_layernorm(hidden_states) # Self Attention hidden_states, self_attn_weights, present_key_value self.self_attn( hidden_stateshidden_states, attention_maskattention_mask, position_idsposition_ids, past_key_valuepast_key_value, output_attentionsoutput_attentions, use_cacheuse_cache, padding_maskpadding_mask, ) hidden_states residual hidden_states # Fully Connected residual hidden_states hidden_states self.post_attention_layernorm(hidden_states) hidden_states self.mlp(hidden_states) hidden_states residual hidden_states outputs (hidden_states,) if output_attentions: outputs (self_attn_weights,) if use_cache: outputs (present_key_value,) return outputs 在解码器层中输入 hidden_states 依次经历如下计算经过 input_layernorm 进行层归一化。计算一次自注意力。做一次残差连接。经过 post_attention_layernorm 进行层归一化。经过 mlp并将结果与步骤3结果做一次残差连接。模型 LlamaModel transformers 中对模型定义如下 class LlamaModel(LlamaPreTrainedModel): Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [LlamaDecoderLayer] Args: config: LlamaConfig def __init__(self, config: LlamaConfig): super().__init__(config) self.padding_idx config.pad_token_id self.vocab_size config.vocab_size self.embed_tokens nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) self.layers nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]) self.norm LlamaRMSNorm(config.hidden_size, epsconfig.rms_norm_eps) self.gradient_checkpointing False # Initialize weights and apply final processing self.post_init() Llama 模型是由若干个解码层堆叠而成。在前向传播时设置 gradient_checkpointingTrue 可以节约显存。但是这个参数不能和 use_cacheTrue 同时设置这两个参数不兼容。 if self.gradient_checkpointing and self.training: if use_cache: logger.warning_once( use_cacheTrue is incompatible with gradient checkpointing. Setting use_cacheFalse... ) use_cache False 在前向传播中自定义了前向传播函数 def create_custom_forward(module): def custom_forward(*inputs): # None for past_key_value return module(*inputs, past_key_value, output_attentions, padding_maskpadding_mask) return custom_forward 使用 torch.utils.checkpoint.checkpoint() 函数它允许将前向传播的一部分分成小块以减小内存占用并且可以在反向传播时实现显存优化。前提是设置 gradient_checkpointingTrue。 layer_outputs torch.utils.checkpoint.checkpoint( create_custom_forward(decoder_layer), hidden_states, attention_mask, position_ids ) 代码中的 decode_layer 为前文中提到的解码器层。经过多层解码器层后将输出经过 RMSNorm 层得到最终结果。语言模型 LlamaForCausalLM transformers 中对语言模型的定义如下 class LlamaForCausalLM(LlamaPreTrainedModel): _tied_weights_keys [lm_head.weight] def __init__(self, config): super().__init__(config) self.model LlamaModel(config) self.vocab_size config.vocab_size self.lm_head nn.Linear(config.hidden_size, config.vocab_size, biasFalse) # Initialize weights and apply final processing self.post_init() 实质是在前文提到的 LlamaModel 基础上加入一个 llm_head 来生成结果。前向传播核心计算代码如下 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) outputs self.model( input_idsinput_ids, attention_maskattention_mask, position_idsposition_ids, past_key_valuespast_key_values, inputs_embedsinputs_embeds, use_cacheuse_cache, output_attentionsoutput_attentions, output_hidden_statesoutput_hidden_states, return_dictreturn_dict, ) hidden_states outputs[0] if self.config.pretraining_tp 1: lm_head_slices self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim0) logits [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)] logits torch.cat(logits, dim-1) else: logits self.lm_head(hidden_states) logits logits.float() 如果输入 labels 会自动计算交叉熵损失。分类模型 LlamaForSequenceClassification 分类模型也是由 LlamaModel 加上一个 score 的线性层构成。在计算损失的时候会根据不同的类型选择不同的损失函数 if self.config.problem_type regression: loss_fct MSELoss() if self.num_labels 1: loss loss_fct(pooled_logits.squeeze(), labels.squeeze()) else: loss loss_fct(pooled_logits, labels) elif self.config.problem_type single_label_classification: loss_fct CrossEntropyLoss() loss loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type multi_label_classification: loss_fct BCEWithLogitsLoss() loss loss_fct(pooled_logits, labels) 总结以 LlamaModel 为例总结数据流向输入的如果是 input_ids会首先计算 inputs_embeds然后作为 hidden_states经过若干个 LlamaDecoderLayer、LlamaRMSNorm 组合后输出。在 LlamaDecoderLayer 中经历如下步骤先记录原始输入然后对于输入的 hidden_states 先做一次 LlamaRMSNorm。对步骤1的结果做一次 LlamaAttention。将步骤2的结果与原始输入做一次残差连接并记录这次结果。将步骤3的结果做一次 LlamaRMSNorm。将步骤4的结果做一次 LlamaMLP。将步骤5的结果与步骤3的结果做一次残差连接将结果输出。在 LlamaAttention 中经历如下步骤将输入的 hidden_states 做 Q、K、V 变换。计算 Q、K 的旋转位置编码。根据公式计算自注意力。注意力经过线性变换后输出。在 LlamaMLP 中经历如下步骤原始输入经过线性变换得到上投影。原始输入经过门函数和激活函数得到门控投影。将步骤1的上投影和步骤2的门控投影对应元素相乘。将步骤3的结果经过线性变换得到下投影输出这个结果。那么我们该如何学习大模型作为一名热心肠的互联网老兵我决定把宝贵的AI知识分享给大家。至于能学习到多少就看你的学习毅力和能力了。我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。一、大模型全套的学习路线学习大型人工智能模型如GPT-3、BERT或任何其他先进的神经网络模型需要系统的方法和持续的努力。既然要系统的学习大模型那么学习路线是必不可少的下面的这份路线能帮助你快速梳理知识形成自己的体系。 L1级别:AI大模型时代的华丽登场 L2级别AI大模型API应用开发工程 L3级别大模型应用架构进阶实践 L4级别大模型微调与私有化部署一般掌握到第四个级别市场上大多数岗位都是可以胜任但要还不是天花板天花板级别要求更加严格对于算法和实战是非常苛刻的。建议普通人掌握到L4级别即可。以上的AI大模型学习路线不知道为什么发出来就有点糊高清版可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】二、640套AI大模型报告合集这套包含640份报告的合集涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师还是对AI大模型感兴趣的爱好者这套报告合集都将为您提供宝贵的信息和启示。三、大模型经典PDF籍随着人工智能技术的飞速发展AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型如GPT-3、BERT、XLNet等以其强大的语言理解和生成能力正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。四、AI大模型商业化落地方案作为普通人入局大模型时代需要持续学习和实践不断提高自己的技能和认知水平同时也需要有责任感和伦理意识为人工智能的健康发展贡献力量。

查看全文

http://www.yingshimen.cn/news/52664/