## 电子工程代写|计算机系统结构代写Computer Systems Architecture代考|Experimental Results

For our experiments we used a benchmark suite of synthetic taskgraphs [14] with 36000 performance optimal schedules, that can be subdivided by the number of PUs (2, 4, 8, 16 and 32$)$, the number of tasks (7-12,13-18 and 19-24), the edge density and length and the node and edge weights. The schedules were generated with a PDS-algorithm (Pruned Depth-first Search). To find optimal solutions in an acceptable time, the search space is reduced by pruning selected paths in the search tree. As the scheduling problem is NP-hard, there have been some taskgraphs where no optimal schedule could be found even after weeks of computation. Those taskgraphs are excluded from this study. As seen in Sect. 5.2, our system model closely reflects the real system in terms of energy consumption. We used this fact to simulate nearly 34500 of the given schedules using the RUPS system. We evaluate the trade-off between $P E, F T$ and $E$ with four scenarios in which we use the four strategies from Sect. 3. These scenarios reflect system setups with one of the three parameters as inherently dominating. This choice will give a wide range of experiments with the extreme corner cases covered, and everything between them. The following scenarios were used for our simulation, where we do not consider the turbo frequency to avoid throttling effects:
(A) Strategy 1: Use only DDs and start with the highest supported frequency $(3.5 \mathrm{GHz})$. In this scenario we focus on $P E$.
(B) Strategy 2: Use Ds and DDs and start with the highest supported frequency $(3.5 \mathrm{GHz}$ ). This scenario mainly targets on $P E$, but also on $F T$.
(C) Strategy 3: Create the schedules with a simple List Scheduler that uses half of the PUs for original tasks, the other for the Ds and start with the highest supported frequency $(3.5 \mathrm{GHz})$. Here the focus is on $F T$.
(D) Strategy 4: Select a lower frequency for original tasks and start with frequency level $7(2.3 \mathrm{GHz})$. With this scenario we try to focus on $E$.

To visualize the trade-off between $P E, F T$ and $E$ the results of the four strategies are relatively related to the following estimated upper and lower boundaries for each criterion (see Table 3 ) where $m$ is the makespan in cycles, $m_{s e q}$ is the makespan, when all tasks are running in sequence and $m_{f t}$ is the makespan in case of a failure. $p_{\max } \in P U$ is the maximum number of PUs used and $f_{\text {highest/lowest }}$ is the highest or lowest frequency respectively.

Altera provides a softcore, the Nios II [5], for Altera FPGAs. The Nios RISC architecture implements a 32-bit instruction set like the MIPS instruction set architecture. Although Nios II represents a different design point from Lipsi, it is interesting to note that Nios II can be customized to meet the application requirements. Three different models are available [5]: the Fast core is optimized for high performance; the Standard core is intended to balance performance and size; and the Economy core is optimized for smallest size. The smallest core can be implemented in less than 700 logic elements (LEs). It is a sequential implementation and each instruction takes at least 6 clock cycles. Lipsi is a smaller (8-bit), accumulator-based architecture, and most instructions execute in two clock cycles.

PicoBlaze is an 8-bit microcontroller for Xilinx FPGAs [6]. The processor is highly optimized for low resource usage. This optimization results in restrictions such as a maximum program size of 1024 instructions and 64 bytes data memory. The benefit of this puristic design is a processor that can be implemented with one on-chip memory and 96 logic slices in a Spartan-3 FPGA. PicoBlaze provides 16 8-bit registers and executes one instruction in two clock cycles. The interface to $\mathrm{I} / \mathrm{O}$ devices is minimalistic in the positive sense: it is simple and very efficient to connect simple I/O devices to the processor.

The Lipsi approach is, like the concept of PicoBlaze, to provide a small processor for utility functions. Lipsi is optimized to balance the resource usage between on-chip memory and logic cells. Therefore, the LE count of Lipsi is slightly lower than the one of PicoBlaze. PicoBlaze is coded at a very low level of abstraction by using Xilinx primitive components such as LUT4 or MUXCY. Therefore, the design is optimized for Xilinx FPGAs and practically not portable. Lipsi is written in vendor agnostic Chisel and compiles unmodified for Altera and Xilinx devices.

The SpartanMC is a small microcontroller optimized for FPGA technology [7]. One interesting feature is that the instruction width and the data width are 18 bits. The argument is that current FPGAs contain on-chip memory blocks that are 18-bit wide (originally intended to contain parity protection). The processor is a 16 register RISC architecture with two operand instructions and is implemented in a threc-stage pipclinc. To avoid data forwarding within the register file, the instruction fetch and the write-back stage are split into two phases, like the original MIPS pipeline [8]. This decision slightly complicates the design as two phase-shifted clocks are needed. We assume that this phase splitting also limits the maximum clock frequency. As on-chip memories for register files are large, this resource is utilized by a sliding register window to speedup function calls. SpartanMC performs comparable to the 32 -bit RISC processors LEONII [9] and MicroBlaze [10] on the Dhrystone benchmark.

(A) 策略 1：仅使用 DD 并从支持的最高频率开始(3.5千兆赫). 在这种情况下，我们专注于磷和.
(B) 策略 2：使用 Ds 和 DDs 并从支持的最高频率开始(3.5千兆赫）。这个场景主要针对磷和，而且在F吨.
(C) 策略 3：使用简单的 List Scheduler 创建计划，将一半的 PU 用于原始任务，另一半用于 D，并以支持的最高频率开始(3.5千兆赫). 这里的重点是F吨.
(D) 策略4：为原始任务选择较低的频率，从频率级别开始7(2.3千兆赫). 在这种情况下，我们尝试专注于和.

Altera 为 Altera FPGA 提供了一个软核 Nios II [5]。Nios RISC 架构实现了类似于 MIPS 指令集架构的 32 位指令集。虽然 Nios II 代表了与 Lipsi 不同的设计点，但有趣的是 Nios II 可以定制以满足应用需求。提供三种不同的模型 [5]： Fast 内核针对高性能进行了优化；标准核心旨在平衡性能和尺寸；经济核心针对最小尺寸进行了优化。最小的内核可以在不到 700 个逻辑元件 (LE) 中实现。它是一个顺序实现，每条指令至少需要 6 个时钟周期。Lipsi 是一种较小的（8 位）、基于累加器的架构，大多数指令在两个时钟周期内执行。

PicoBlaze 是用于 Xilinx FPGA [6] 的 8 位微控制器。该处理器针对低资源使用进行了高度优化。这种优化会产生限制，例如最大程序大小为 1024 条指令和 64 字节数据存储器。这种纯粹设计的好处是可以在 Spartan-3 FPGA 中使用一个片上存储器和 96 个逻辑片来实现处理器。PicoBlaze 提供 16 个 8 位寄存器，并在两个时钟周期内执行一条指令。接口到我/○devices 在积极的意义上是简约的：将简单的 I/O 设备连接到处理器是简单且非常有效的。

Lipsi 方法就像 PicoBlaze 的概念一样，为实用功能提供一个小型处理器。Lipsi 经过优化以平衡片上存储器和逻辑单元之间的资源使用。因此，Lipsi 的 LE 计数略低于 PicoBlaze 的计数。PicoBlaze 使用 Xilinx 原始组件（例如 LUT4 或 MUXCY）以非常低的抽象级别进行编码。因此，该设计针对 Xilinx FPGA 进行了优化，实际上并不便携。Lipsi 是用与供应商无关的 Chisel 编写的，并且未经修改即可针对 Altera 和 Xilinx 器件进行编译。

SpartanMC 是针对 FPGA 技术优化的小型微控制器 [7]。一个有趣的特性是指令宽度和数据宽度都是 18 位。争论是当前的 FPGA 包含 18 位宽的片上存储器块（最初旨在包含奇偶校验保护）。该处理器是一个具有两个操作数指令的 16 寄存器 RISC 体系结构，并在 threc-stage pipclinc 中实现。为了避免寄存器文件中的数据转发，指令获取和回写阶段被分为两个阶段，就像原始的 MIPS 流水线 [8]。由于需要两个相移时钟，因此该决定会使设计稍微复杂化。我们假设这种分相也限制了最大时钟频率。由于寄存器文件的片上存储器很大，滑动寄存器窗口利用此资源来加速函数调用。SpartanMC 在 Dhrystone 基准测试中的性能与 32 位 RISC 处理器 LEONII [9] 和 MicroBlaze [10] 相当。

