## 机器学习代写|自然语言处理代写NLP代考|Preparing the pretraining input environment

The BERT model has no decoder stack of layers. As such, it does not have a masked multi-head attention sub-layer. BERT goes further and states that a masked multihead attention layer that masks the rest of the sequence impedes the attention process.
A masked multi-head attention layer masks all of the tokens that are beyond the present position. For example, take the following sentence:
The cat sat on it because it was a nice rug.
If we have just reached the word “it,” the input of the encoder could be:
The cat sat on it<masked sequences
The motivation of this approach is to prevent the model from seeing the output it supposed to predict. This left-to-right approach produces relatively good results.
However, the model cannot learn much this way. To know what “it” refers to, we need to see the whole sentence to reach the word “rug” and figure out that “it” was the rug.
The authors of BERT came up with an idea. Why not pretrain the model to make predictions using a different approach?

Masked language modeling does not require training a model with a sequence of visible words followed by a masked sequence to predict.

BERT introduces the bidirectional analysis of a sentence with a random mask on a word of the sentence.

A potential input sequence could be:
“The cat sat on it because it was a nice rug.”
The decoder would mask the attention sequence after the model reached the word “it”:
“The cat sat on it .”
But the BERT encoder masks a random token to make a prediction:
“The cat sat on it [MASK] it was a nice rug.”
The multi-attention sub-layer can now see the whole sequence, run the self-attention process, and predict the masked token.
The input tokens were masked in a tricky way to force the model to train longer but produce better results with three methods:

• Surprise the model by not masking a single token on $10 \%$ of the dataset; for example:
“The cat sat on it [because] it was a nice rug.”
• Surprise the model by replacing the token with a random token on $10 \%$ of the dataset; for example:
“The cat sat on it [often] it was a nice rug.”
• Replace a token by a [MASK] token on $80 \%$ of the dataset; for example:
“The cat sat on it [MASK] it was a nice rug.”
The authors’ bold approach avoids overfitting and forces the model to train efficiently.
BERT was also trained to perform next sentence prediction.

BERT 模型没有层的解码器堆栈。因此，它没有蒙面的多头注意力子层。BERT 走得更远，并指出屏蔽了序列其余部分的屏蔽多头注意力层会阻碍注意力过程。

BERT 的作者提出了一个想法。为什么不预训练模型以使用不同的方法进行预测？

BERT 引入了对句子的一个词进行随机掩码的句子的双向分析。

"猫坐在上面，因为它是一块漂亮的地毯。"

"the cat sat on it"。

• 通过不掩盖单个标记来给模型带来惊喜10%数据集的；例如：
"猫坐在上面[因为]这是一块漂亮的地毯。"
• 通过用随机标记替换标记来给模型带来惊喜10%数据集的；例如：
"猫坐在上面 [经常] 这是一块漂亮的地毯。"
作者大胆的方法避免了过度拟合，并迫使模型有效地训练。
BERT 还被训练来执行下一句预测。

