## 机器学习代写|自然语言处理代写NLP代考|Fine-Tuning BERT Models

In Chapter 1, Getting Started with the Model Architecture of the Transformer, we defined the building blocks of the architecture of the original Transformer. Think of the contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using $\mathrm{LEGO}^{\infty}$ components.
BERT added a new piece to the Transformer building kit: a bidirectional multihead attention sub-layer. When we humans are having problems understanding a sentence, we do not just look at the past words. BERT, like us, looks at all the words in the same sentence at the same time.

In this chapter, we will first explore the architecture of Bidirectional Encoder Representations from Transformers (BERT). BERT only uses the blocks of the encoders of the Transformer in a novel way and does not use the decoder stack.
Then we will fine-tune a pretrained BERT model. The BERT model we will fine-tune was trained by a third party and uploaded to Hugging Face. Transformers can be pretrained. Then, a pretrained BERT, for example, can be fine-tuned on several NLP tasks. We will go through this fascinating experience of downstream Transformer usage using Hugging Face modules.
This chapter covers the following topics:

• Bidirectional Encoder Representations from Transformers (BERT)
• The architecture of BERT
• The two-step BERT framework
• Preparing the pretraining environment
• Defining pretraining encoder layers
• Defining fine-tuning
• Building a fine-tuning BERT model
• BERT model configuration
• Measuring the performance of the fine-tuned model
Our first step will be to explore the background of the Transformer.

## 机器学习代写|自然语言处理代写NLP代考|The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter 1, Getting Started with the Model Architecture of the Transformer, is shown in Figure 2.1:

The BERT model does not use decoder layers. A BERT model has an encoder stack but no decoder stacks. The masked tokens (hiding the tokens to predict) are in the attention layers of the encoder, as we will see when we zoom into a BERT encoder layer in the following sections.

The original Transformer contains a stack of $N=6$ layers. The number of dimensions of the original Transformer is $d_{\text {mudd }}=512$. The number of attention heads of the original Transformer is $A=8$. The dimensions of a head of the original Transformer is:
$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{512}{8}=64$$
BERT encoder layers are larger than the original Transformer model.
Two BERT models can be built with the encoder layers: also be expressed as $H=768$, as in the BERT paper. A multi-head attention sub-layer contains $A=12$ heads. The dimensions of each head $z_{A}$ remains 64 as in the original Transformer model:
$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{768}{12}=64$$

The output of each multi-head attention sub-layer before concatenation will be the output of the 12 heads:
output_multi-head_attention $=\left{z_{0}, z_{1}, z_{2}, \ldots, z_{11}\right}$ multi-head attention sub-layer contains $A=16$ heads. The dimensions of each head $z_{A}$ also remains 64 as in the original Transformer model:
$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{1024}{16}=64$$
The output of each multi-head attention sub-layer before concatenation will be the output of the 16 heads:
$$\text { output_multi-head_attention }=\left{z_{0}, z_{1}, z_{2}, \ldots, z_{15}\right}$$

## 机器学习代写|自然语言处理代写NLP代考|Fine-Tuning BERT Models

BERT 在 Transformer 构建工具包中添加了一个新部分：双向多头注意力子层。当我们人类在理解句子时遇到问题时，我们不会只看过去的单词。BERT 和我们一样，会同时查看同一个句子中的所有单词。

• 来自 Transformers (BERT) 的双向编码器表示
• BERT的架构
• 两步 BERT 框架
• 准备预训练环境
• 定义预训练编码器层
• 定义微调
• 下游多任务处理
• 构建微调的 BERT 模型
• 加载可访问性判断数据集
• 创建注意力面具
• BERT模型配置
• 测量微调模型的性能
我们的第一步将是探索 Transformer 的背景。

## 机器学习代写|自然语言处理代写NLP代考|The encoder stack

BERT 模型不使用解码器层。BERT 模型有一个编码器堆栈，但没有解码器堆栈。掩码标记（隐藏要预测的标记) 位于编码器的注意力层 中，正如我们将在以下部分中放大 BERT 编码器层时看到的那样。

$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{512}{8}=64$$
BERT 编码器层大于原始的 Transformer 模型。

$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{768}{12}=64$$

Transformer 模型一样，也保持 64：
$$d_{k}=\frac{d_{\text {model }}}{A}=\frac{1024}{16}=64$$

$\backslash$ left 的分隔符缺失或无法识别

