## 机器学习代写|自然语言处理代写NLP代考|Next sentence prediction

The second method found to train BERT is Next Sentence Prediction (NSP). The input contains two sentences.

Two new tokens were added:

• $[\mathrm{CLS}]$ is a binary classification token added to the beginning of the first sequence to predict if the second sequence follows the first sequence. A positive sample is usually a pair of consecutive sentences taken from a dataset. A negative sample is created using sequences from different documents.
• [SEP] is a separation token that signals the end of a sequence.
For example, the input sentences taken out of a book could be:
“The cat slept on the rug. It likes sleeping all day.”
These two sentences would become one input complete sequence:
[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]
This approach requires additional encoding information to distinguish sequence $A$ from sequence $B$.
If we put the whole embedding process together, we obtain:

The input embeddings are obtained by summing the token embeddings, the segment (sentence, phrase, word) embeddings, and the positional encoding embeddings.
The input embedding and positional encoding sub-layer of a BERT model can be summed up as follows:

• A sequence of words is broken down into WordPiece tokens.
• A [MASK] token will randomly replace the initial word tokens for masked language modeling training.

## 机器学习代写|自然语言处理代写NLP代考|Pretraining and fine-tuning a BERT model

BERT is a two-step framework. The first step is the pretraining, and the second is fine-tuning, as shown in Figure $2.4$ :Training a transformer model can take hours, if not days. It takes quite some time to engineer the architecture and parameters, and select the proper datasets to train a transformer model.

Pretraining is the first step of the BERT framework that can be broken down into two sub-steps:

• Defining the model’s architecture: number of layers, number of heads, dimensions, and the other building blocks of the model
• Training the model on Masked Language Modeling (MLM) and NSP tasks
The second step of the BERT framework is fine-tuning, which can also be broken down into two sub-steps:
• Initializing the downstream model chosen with the trained parameters of the pretrained BERT model
• Fine-tuning the parameters for specific downstream tasks such as Recognizing Textual Entailment (RTE), Question Answering (SQuAD v1.1, SQuAD v 2. $\theta$ ), and Situations With Adversarial Generations (SWAG)

In this section, we covered the information we need to fine-tune a BERT model. In the following chapters, we will explore the topics we brought up in this section in morë dèpth:

• In Chapter 3, Pretraining a RoBERTa Model from Scratch, we will pretrain a BERT-like model from scratch in 15 steps. We will even compile our own data, train a tokenizer, and then train the model. The goal of this chapter is to first go through the specific building blocks of BERT and then fine-tune an existing model.
• In Chapter 4, Downstream NLP Tasks with Transformers, we will go through many downstream NLP tasks, exploring GLUE, SQuAD v1.1, SQuAD, SWAG, BLEU, and several other NLP evaluation datasets. We will run several downstream transformer models to illustrate key tasks. The goal of this chapter is to finetune a downstream model.
• In Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models, we will explore the architecture and usage of Open AI GPT, GPT-2, and GPT-3 transformers. BERT BASE was configured to be close to OpenAI GPT to show that it produced better performance. However, OpenAI transformers keep evolving too! We will see how.

In this chapter, the BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA). The downstream task is based on Neural Network Acceptability Judgments by Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman.

## 机器学习代写|自然语言处理代写NLP代考|Next sentence prediction

• [CLS]是添加到第一个序列开头的二进制分类标记，用于预测第二个序列是否跟随第一个序列。正样本通常是从数据集中提取的一对连续句子。使用来自不同文档的序列创建负样本。
• [SEP] 是表示序列结束的分离标记。
例如，从一本书中取出的输入句子可能是：
“猫睡在地毯上。它喜欢整天睡觉。”
这两个句子将成为一个输入完整序列：
[CLS] 猫睡在地毯上 [SEP] 它喜欢睡觉 ##ing all day[SEP]
这种方法需要额外的编码信息来区分序列一个从序列乙.
如果我们把整个嵌入过程放在一起，我们得到：

BERT模型的输入嵌入和位置编码子层可以总结如下：

• 单词序列被分解为 WordPiece 标记。

## 机器学习代写|自然语言处理代写NLP代考|Pretraining and fine-tuning a BERT model

BERT 是一个两步框架。第一步是预训练，第二步是微调，如图2.4: 训练一个 Transformer 模型可能需要数小时，甚至数天。设计架构和参数并选择合适的数据集来训练 Transformer 模型需要相当长的时间。

• 定义模型的架构：层数、头数、维度和模型的其他构建块
• 在 Masked Language Modeling (MLM) 和 NSP 任务上训练模型
BERT 框架的第二步是微调，也可以分为两个子步骤：
• 使用预训练的 BERT 模型的训练参数初始化选择的下游模型

• 在第 3 章，从头开始预训练 RoBERTa 模型，我们将分 15 个步骤从头开始预训练类似 BERT 的模型。我们甚至会编译自己的数据，训练分词器，然后训练模型。本章的目标是首先了解 BERT 的特定构建块，然后微调现有模型。
• 在第 4 章，带有 Transformers 的下游 NLP 任务中，我们将介绍许多下游 NLP 任务，探索 GLUE、SQuAD v1.1、SQuAD、SWAG、BLEU 和其他几个 NLP 评估数据集。我们将运行几个下游 Transformer 模型来说明关键任务。本章的目标是微调下游模型。
• 在第 6 章，使用 OpenAI GPT-2 和 GPT-3 模型生成文本，我们将探讨 Open AI GPT、GPT-2 和 GPT-3 转换器的架构和用法。BERT BASE 被配置为接近 OpenAI GPT，以表明它产生了更好的性能。然而，OpenAI 转换器也在不断发展！我们将看到如何。

