# SPipe: Hybrid GPU and CPU Pipeline for Training LLMs under Memory Pressure Junyeol Ryu\*† University of Wisconsin-Madison junyeol.ryu@wisc.edu Jinpyo Kim Seoul National University jinpyo@aces.snu.ac.kr Yujin Jeong\*† Samsung Research yujin@aces.snu.ac.kr Heehoon Kim<sup>†</sup> Moreh Inc. heehoon.kim@moreh.io Daeyoung Park Seoul National University daeyoung@aces.snu.ac.kr Jaejin Lee Seoul National University jaejin@snu.ac.kr https://champ.snu.ac.kr #### **ABSTRACT** Training large language models (LLMs) with limited computing resources is challenging because of their immense memory space requirements. In this paper, we specifically focus on the scenarios where we have insufficient aggregate GPU memory to store all model states but explore pipeline parallelism and offloading across all system resources to train the model. In this context, SPipe presents a hybrid GPU and CPU pipelining mechanism that consists of two pipelines: a GPU pipeline to reduce the bubbles in conventional pipeline parallelism and a GPU-CPU pipeline to alleviate data transfer overhead and CPU bottlenecks in offloading data and computation. We evaluate SPipe for training LLMs of various sizes with diverse configurations in practice. The result indicates that SPipe outperforms the state-of-the-art by 1.26×. ## 1 INTRODUCTION Large language models (LLMs) [4, 16, 41, 48, 58] have scaled dramatically to trillion parameters and are very successful for various downstream tasks. However, such an overwhelming number of parameters requires large memory space during training. State-of-the-art models, such as LLaMA [28, 50, 51] and OPT [56], require a memory footprint on a terabyte-scale. They are typically trained with a supercomputer-scale cluster in a data center [13, 42, 49]. The cost and resources for training an LLM are highly challenging for many academic institutions and startups because they typically rely on small GPU clusters or small-scale cloud services. For example, a 0.1 trillion parameter model requires 1.83 terabytes to store its states during training [43], which far exceeds the aggregate GPU memory of a small GPU cluster with a few nodes. Thus, developing a technique that efficiently trains large models with limited resources can significantly broaden the accessibility of LLMs. A practical approach to mitigating the memory requirement is scale-out techniques, such as model parallelism [13, 42, 49], to distribute model training across multiple GPUs. Among others, pipeline parallelism [8, 13] partitions the model into different stages and assigns the stages to GPUs. A mini-batch is divided into smaller micro-batches and executed across the pipeline stages. It requires only peer-to-peer communication to transfer activations between GPUs, thereby minimizing communication overhead. However, it also introduces inefficiencies due to GPU idle times, referred to as pipeline bubbles. It may lead to significant system under-utilization and necessitate sophisticated pipeline scheduling to reduce them [21, 24, 31–33]. Another widely-used approach to alleviating the memory requirement is offloading [2, 11, 12, 15, 20, 23, 35, 46, 47]. The memory capacity is extended to non-GPU memory (e.g., the CPU main memory) to allow larger model training. Only the minimum amount of data required for the current operation is fetched and placed in the GPU memory (e.g., layer parameters are fetched on demand just before the computation for the layer). After performing the operation, the fetched data are freed, and newly generated data by the operation are offloaded to the non-GPU memory (e.g., gradients of the layer are stored in the CPU memory after the layer's backward pass). Recent approaches even offload some computational tasks (e.g., optimizer steps) to the CPU to further exploit heterogeneous resources [9, 27, 46]. However, these approaches introduce data transfer overhead between the GPU and CPU. In addition, the low computational capacity of the CPU may become a performance bottleneck. To this end, this paper proposes SPipe, a hybrid GPU-CPU pipeline for training LLMs under memory pressure. We specifically focus on the scenarios where we have insufficient aggregate GPU memory to store all model states but explore the use of model parallelism and offloading together across all system resources to train the model *by any means*. In this context, SPipe offers an efficient solution through two pipelines: *GPU pipeline* and *GPU-CPU pipeline*. The GPU pipeline reduces the bubbles introduced in conventional pipeline parallelism. The GPU-CPU pipeline hides the data transfer overhead between the CPU and GPUs and alleviates the performance bottleneck caused by the slower CPU when offloading data and computation. SPipe's GPU pipeline presents a *decoupled pass assignment*, which assigns the forward and backward passes of the same stage to different GPUs for better pipeline scheduling. Such mechanism is facilitated by storing the model parameters on the CPU's shared memory (shmem) and exploiting activation recomputation [6, 14, 18, 19]. Moreover, SPipe introduces fine-grained stage partitioning to further eliminate the bubbles due to the gap in execution time between the forward and backward passes and optimizes the communication <sup>\*</sup>Equal contribution <sup>†</sup>Work done while at Seoul National University. schedule for activation checkpoints to hide the additional communication overhead between the GPUs. SPipe's GPU-CPU pipeline presents an asynchronous CPU optimizer, which executes the optimizer steps in parallel with the GPU pipeline, thereby overlapping and hiding the CPU optimizer overhead. This mechanism is enabled by bypassing optimizer synchronization [10, 39, 40] and shifting the numerical validation as a post-step process while guaranteeing correctness through a roll-back mechanism. Overall, this paper makes the following contributions: - We propose SPipe, a hybrid GPU-CPU pipelining mechanism that efficiently leverages offloading and achieves high utilization of both GPUs and the CPU when training LLMs with insufficient aggregate GPU memory. - We compare SPipe against state-of-the-art offloading-based LLM training mechanisms—Mobius [9], Megatron [33], and DeepSpeed [30]—on multi-node clusters by training LLaMA-2 models [51]. SPipe outperforms these methods by 1.26×, 1.31×, and 4.13× on average. - We will make SPipe publicly available after publication to foster research and expand the accessibility of LLMs. #### 2 BACKGROUND AND RELATED WORK This section introduces pipeline parallelism and its techniques to train large models under GPU memory pressure. # 2.1 Pipeline Parallelism Pipeline parallelism is a type of model parallelism [13, 42, 49] that trains large models on multiple GPUs. It partitions a model into sequential groups of layers called *stages* and assigns the stages to GPUs. It divides a mini-batch into smaller micro-batches and executes them in a pipelined manner across these stages. Suppose a model is partitioned into I stages, and a mini-batch is divided into J micro-batches. We denote the ith stage as $S_i$ and jth micro-batch as $m_j$ . $f_i^j$ and $b_i^j$ denote the ith stage's forward/backward pass on the jth micro-batch, respectively. For convenience, we denote the set of forward passes $f_i^j$ for all $m_j$ as $f_i$ . Similarly, $b_i$ denotes the set of backward passes $b_i^j$ for all $m_j$ . Many prior studies have extensively optimized pipeline scheduling [1, 8, 9, 13, 21, 24, 31–33, 44, 57]. Figure 1(a) illustrates an AFAB (all forward, all backward) schedule of GPipe [13] with four GPUs, four stages, and four micro-batches. It first pipelines the forward passes of all micro-batches, followed by the backward passes of all micro-batches. DAPPLE [8] in Figure 1(b) presents a 1F1B (one forward, one backward) schedule, where each GPU alternates between forward and backward passes of different micro-batches. A pipeline schedule often makes GPUs idle. We call the time durations as *bubbles*. Minimizing the bubbles in the pipeline, or the *bubble ratio*, is critical for pipelining efficiency. One can simply inject more micro-batches into the pipeline to reduce the bubble ratio. However, increasing the number of micro-batches (e.g., more than ×4 that of GPUs as suggested in [13]) may introduce inefficiency for two reasons. One is that models typically have a practical upper limit on the mini-batch size, beyond which convergence is negatively affected [3, 7, 53–55]. The other is that increasing the Figure 1: Different pipeline schedules with four GPUs. micro-batch count reduces the micro-batch size for a given minibatch size, compromising GPU computational efficiency [21]. Coupled pass assignment. On the other hand, reducing bubbles itself is challenging. In both GPipe and DAPPLE, the same GPU is responsible for both the forward and backward passes $f_i^j$ and $b_i^j$ of the same stage $S_i$ for micro-batch $m_j$ . This creates bubbles at the beginning of the backward pass, as the backward pass proceeds in the reverse order of stages of the forward pass. We define the assignment of the same GPU to both $f_i^j$ and $b_i^j$ as coupled pass assignment. Almost all existing pipelines adhere to the coupled pass assignment for three key reasons. First, both $f_i^j$ and $b_i^j$ use $S_i$ 's parameters $\Psi_i$ . Second, $b_i^j$ reuses the activations $a_i^j$ generated by $f_i^j$ . Finally and more critically, $\Psi_i$ and $a_i^j$ are stored in the GPU memory. This paper reexamines the coupled pass assignment in offloading scenarios when pipeline parallelism uses non-GPU memory to alleviate GPU memory pressure. It focuses on opportunities to reduce bubbles, improving the bubble ratio to increase performance. # 2.2 Pipeline Parallelism with Offloading As model sizes continue to grow, the increased memory space requirement results in GPU memory pressure. As a remedy, pipeline parallelism can leverage memory-efficient techniques, such as of-floading and activation recomputation. Offloading [2, 11, 12, 15, 20, 23, 35, 46, 47] is a technique to use non-GPU memory (e.g., the CPU main memory) to store model states (e.g., parameters, gradients, and optimizer states) and residual states (e.g., activations) during training. Activation recomputation [6, 14, 18, 19] reduces activation memory usage by recomputing the activations in the backward pass instead of storing them in the forward pass and keeping them until the backward pass. Only a subset of activations, or *checkpoint*, is stored in the forward pass and used to recompute all activations before gradient computation in the backward pass. Large models such as Turing-NLG 17.2B and GPT-3 175B were trained using it [43]. Mobius [9] is a state-of-the-art pipelining mechanism that uses offloading. It introduces an *interleaved* AFAB schedule, in which the stages of GPipe are further subdivided into smaller stages to reduce the memory space requirements of each stage. Along with Figure 2: Mobius pipeline schedule with four GPUs. Two colors distinguish each forward/backward pass to indicate that it belongs to a different stage assigned to the same GPU. For example, $GPU_0$ 's forward passes on stage $S_0$ and $S_4$ are colored blue and sky blue. $p_i$ and $g_i$ denote CPU to GPU parameters transfer and GPU to CPU gradients transfer of stage $S_i$ , respectively. $o_i$ denotes the optimizer step of stage $S_i$ executed on the CPU. assigning multiple stages per GPU, all data are stored in the CPU memory, with only the minimum amount of data required for the current stage fetched and placed in the GPU memory. Their key idea for minimizing the overhead of accessing non-GPU memory is to prefetch the data required for the next stage in an overlapped manner with the computation of the current stage. In addition, it exploits activation recomputation when training large models to reduce the data transfer overhead. Consider Figure 2, which illustrates the pipeline schedule of Mobius with 4 GPUs, 8 stages, and 4 micro-batches. Stage $(S_0, S_4)$ , $(S_1, S_5)$ , $(S_2, S_6)$ , and $(S_3, S_7)$ are mapped to $GPU_0$ , $GPU_1$ , $GPU_2$ , and $GPU_3$ , respectively. Mobius first pipelines the forward passes of all micro-batches $(f_{0-3})$ for the first stage in each GPU $(S_{0-3})$ , followed by that $(f_{4-7})$ for the second stage in each GPU $(S_{4-7})$ . Then, it pipelines the backward passes of all micro-batches $(b_{4-7})$ for the second stage in each GPU $(S_{4-7})$ , followed by that $(b_{0-3})$ for the first stage in each GPU $(S_{0-3})$ . Mobius stores all stages in the CPU memory. Hence, it transfers a copy of stage's parameters from the CPU memory to GPU memory before executing it, and frees this copy after finishing the stages' execution on all micro-batches. We denote the CPU to GPU transfer of stage $S_i$ 's parameters copy as $p_i$ . Similarly, it transfers a stage's gradients, accumulated across all micro-batches, from the GPU memory to CPU memory after finishing the stage's backward passes on all micro-batches. We denote the GPU to CPU transfer of $S_i$ 's gradients as $g_i$ . We assume training with activation recomputation, so activations are not transferred between GPU and CPU in Figure 2. Optimizer steps, with $o_i$ denoting that for the $S_i$ 's parameters in the CPU memory, are processed by the CPU after the pipeline flush, as explained in detail in Section 2.3. **Decoupled pass assignment.** Mobius also adheres to the coupled pass assignment (Section 2.1). In Mobius, both $f_i^j$ and $b_i^j$ use $S_i$ 's parameters $\Psi_i$ . However, with activation recomputation, $b_i^j$ does not reuse the activations $a_i^j$ generated by $f_i^j$ but instead recomputes them during the backward pass. More critically, $\Psi_i$ and $a_i^j$ (if it exists) are not stored in the GPU memory but in the CPU memory. In a nutshell, pipeline parallelism with offloading indicates that the forward and backward passes for the same stage no longer need to be assigned to the same GPU but can be *decoupled*. Based on this observation, we investigate a new mechanism to improve the bubble ratio for pipeline parallelism with offloading. # 2.3 Hybrid GPU-CPU Training Pipeline parallelism with offloading stores optimizer states in non-GPU memory, along with parameters and gradients. Mobius leverages a CPU-based optimizer, similar to DeepSpeed CPU Adam [46], to update parameters directly on the CPU. Such a mechanism to exploit both GPUs and the CPU is called hybrid GPU-CPU training [26, 27]. Executing optimizer steps on the CPU is crucial when training under GPU memory pressure, as optimizer states are often significantly larger than other model states [22, 27, 43]. For instance, in mixed-precision training [29] with Adam [17], the memory space required for optimizer states is ×8 that of the parameters [43]. Consider the green squares of Figure 2, which illustrates Mobius's optimizer steps on the CPU. It introduces inefficiencies because they begin synchronously across all GPUs and do not overlap with the forward and backward pass execution on the GPU. Such inefficiencies arise because conventional mixed-precision training requires synchronization of overflow in the gradients before the optimizer step. Figure 3 describes its detailed mechanism. The FP16 gradients transferred from the GPU to the CPU are first converted into FP32 (Line 1), unscaled, and checked for overflow (Line 2). The results of each stage's gradients are synchronized across all stages (Line 3). If overflow is detected at any stage, an invalid loss scale was used during that training iteration. Thus, the optimizer skips the parameter update. Otherwise, all gradients are used to update the parameters, ensuring numerical stability (Lines 4-7). Thus, the explicit synchronization of overflow prevents Mobius from processing the optimizer steps of different stages asynchronously. Unfortunately, the optimizer steps can consume a substantial amount of time on the CPU. Figure 4 compares the time taken by the GPU's computation and the CPU's optimizer steps during Mobius's training, with a breakdown into functions in Figure 3. Although these functions primarily rely on element-wise operations with low computational intensity, they can add significant idle times to GPUs when not overlapped with the GPU's computation. As a result, Figure 3: Optimizer step of mixed-precision training. Figure 4: Optimizer step breakdown. GPUs and CPUs cannot achieve high utilization simultaneously, limiting the benefits of hybrid GPU-CPU training. Building on this insight, we further explore pipelining the optimizer steps on the CPU with the execution on the GPU, thereby improving both their utilization and mitigating the overhead of the CPU optimizer steps. ## 2.4 Related Work **Different pipeline schedules.** A key objective of pipeline schedules is to reduce the bubble ratio. PipeDream [31] skips periodic pipeline flushes and injects more micro-batches into the pipeline to achieve an almost zero bubble ratio. However, this requires updating the parameters after each micro-batch's backward pass and storing additional versions to ensure parameter consistency between the forward and backward passes of the same micro-batch. Bidirectional pipelines, such as Chimera [21] and MixPipe [57], operate two pipelines in opposite directions to reduce bubbles, but this requires duplicating parameters across each two GPUs. Interleaved-stage approaches, like Megatron [33] and Hanayo [24], partition a model to assign multiple stages per GPU, reducing the bubble time while increasing the amount of communication. Recent ZBPP [39, 40] splits the backward pass into two parts, activation gradient and parameter gradient computation, to fill the bubbles at the cost of higher activation memory usage. Leveraging heterogeneous devices. Existing proposals support offloading model states [38, 46] and residual states [2, 11, 20, 23, 45, 47] to non-GPU memory only for training with a single GPU. Among these, ZeRO-Offload [46] offloads optimizer states to the CPU memory and executes optimizer steps on the CPU. ZeRO-Infinity [43] and Mobius [9] extend this to support fully sharded data parallelism [42] and pipeline parallelism [13], respectively. ZeRO-Offload++ [52] and Deep Optimizer States [27] perform optimizer steps on both the GPU and CPU. # 3 THE DESIGN OF SPIPE This section describes the pipelining mechanism of SPipe. It consists of two pipelines: a *GPU pipeline* and a *GPU-CPU pipeline*. The GPU pipeline's key idea is to assign the forward and backward passes of the same stage to different GPUs (*decoupled pass assignment*) for better pipeline scheduling. The GPU-CPU pipeline assigns the optimizer steps to the CPU and executes them in parallel with the GPU pipeline for better utilization of heterogeneous resources. (a) Pass assignment in an ordinary pipeline. (b) Decoupled pass assignment. (c) Fine-grained backward stage partitioning Figure 5: SPipe GPU pipeline optimizations. For simplicity, we omit the data transfer time between GPUs. Arrows depict dependence between the passes. ## 3.1 GPU pipeline In ordinary pipelining mechanisms, a micro-batch's forward and backward passes are assigned together to the same GPU. Figure 5(a) gives the pipeline diagram of a typical training pipeline with two model stages, $S_0$ and $S_1$ , where the forward passes ( $f_0$ and $f_1$ ) and backward passes ( $f_0$ and $f_1$ ) of $f_0$ and $f_1$ are mapped to two different GPUs, $GPU_0$ and $GPU_1$ , respectively. Suppose that we have two micro-batches in a mini-batch. At time $f_1$ for a micro-batch $f_0$ 0 waits for $f_1$ 1 to finish $f_1$ 1 because the gradients computed by $f_1$ 2 are necessary to proceed $f_1$ 2. Unfortunately, $f_1$ 3 are necessary to proceed $f_1$ 5, and finally transfers the gradients $f_1$ 6 all micro-batches first, executes $f_1$ 7, and finally transfers the gradients $f_1$ 6 waits for at time $f_2$ 7. As a result, $f_1$ 6 remains idle from $f_1$ 7 to $f_2$ 9. **Decoupled pass assignment.** SPipe's GPU pipeline is based on the observation that the pipeline bubbles can be reduced if a single micro-batch's forward and backward passes for the same stage are assigned to different GPUs. Figure 5(b) is an example of a decoupled pass assignment. While $f_0$ and $f_1$ are mapped to $GPU_0$ and $GPU_1$ , respectively, $b_0$ and $b_1$ are mapped to $GPU_1$ and $GPU_0$ , respectively. As $GPU_0$ is assigned $b_1$ instead of $b_0$ , $b_1^0$ can start immediately at $GPU_0$ at $t_2$ . We see less bubbles in Figure 5(b) than the ordinary pipeline in Figure 5(a). **Fetching parameters from non-GPU memory.** With the decoupled approach, each GPU has to store all parameters for the stages assigned to it. For example, each $GPU_0$ and $GPU_1$ has to store the parameters $\Psi_0$ and $\Psi_1$ of all stages $S_0$ and $S_1$ in its memory. However, when parameters are offloaded to non-GPU memory (e.g., the CPU memory), which is a common setting when training large models under GPU memory pressure, the same parameters can be fetched to different GPUs without permanently storing them redundantly on different GPUs' memory. Moreover, SPipe mitigates the overhead of fetching parameters from non-GPU memory by prefetching in an overlapped manner with GPU computation. SPipe makes a GPU only keep its memory spaces for the current computation and the parameters being prefetched instead of storing all parameters for the stages assigned to the GPU. For example, consider Figure 5(b). At $t_0$ , $GPU_0$ starts to fetch $\Psi_1$ required for $b_1$ . Similarly, $GPU_1$ starts to fetch $\Psi_0$ at $t_1$ required for $b_0$ . $GPU_0$ also frees $\Psi_0$ as soon as $f_0$ finishes at $t_2$ . **Activation recomputation.** A problem with the decoupled pass assignment in Figure 5(b) is that for all micro-batches, all activations of $f_1$ generated at $GPU_1$ have to be transferred to $GPU_0$ to perform $b_1$ and vice versa for $f_0$ . To solve this problem, we adopt activation recomputation [6, 14, 18, 19]. Activation recomputation is a common setting when training large models under GPU memory pressure. For example, at time $t_2$ in Figure 5(b), activation recomputation allows only the activation checkpoint of $S_1$ to be transferred from $GPU_1$ to $GPU_0$ to perform $b_1^0$ instead of all activation tensors of $S_1$ including intermediate activation tensors generated from performing $f_1^0$ . Fine-grained backward stage partitioning. However, as shown in Figure 5(b), when the execution times of the forward and backward passes differ, it incurs pipeline bubbles. Based on this observation, we decompose backward stages into finer granularity to minimize the bubbles by balancing the execution times. Specifically, the backward pass $b_i$ for model stage $S_i$ can be decomposed into d backward passes, as shown in Equation 1 so that each $b_{i,k}$ computes the backward pass of $|S_i|/d$ transformer blocks: $$b_i = b_{i,0} \circ b_{i,1} \circ \cdots \circ b_{i,d-1}. \tag{1}$$ As $b_{i,k}$ only requires a portion of $S_i$ 's parameters for its computation (i.e., 1/d of $\Psi_i$ ), we denote such portion as $\Psi_{i,k}$ . Figure 5(c) is the result of decomposing the backward stage in Figure 5(b) into two fine-grained backward stages. Similar to Figure 5(b), $f_0$ and $f_1$ are mapped to $GPU_0$ and $GPU_1$ , respectively. However, $b_{1,1}$ and $b_{0,1}$ are mapped to $GPU_0$ , and $b_{1,0}$ and $b_{0,0}$ are mapped to $GPU_1$ . At $t_3$ , $GPU_1$ can start $b_{1,0}^0$ immediately, reducing bubbles. Note that $GPU_0$ starts to fetch $\Psi_{1,1}$ and $\Psi_{0,1}$ at $t_0$ and $t_2$ , respectively. $GPU_1$ starts to fetch $\Psi_{1,0}$ and $\Psi_{0,0}$ at $t_1$ and $t_3$ , respectively. Another effect of finer-grained backward stages is that it reduces the GPU memory usage by 1/d at the cost of increasing the number of activation checkpoint transfers by up to $\times d$ . However, the number of activation checkpoint transfers does not strictly scale by factors of d. This is because, while the forward pass $f_i$ on $S_i$ generates activation checkpoints for $S_{i,0},\cdots,S_{i,d-1}$ required for the backward passes $b_{i,0},\cdots,b_{i,d-1}$ , those activation checkpoints for the backward passes mapped to the same GPU as $f_i$ do not need to be transferred. For example, Figure 5(c) requires the same number of activation checkpoint transfers as in Figure 5(b) because while $b_{1,1}$ and $b_{0,0}$ require activation checkpoint transfers from another GPU, $b_{1,0}$ and $b_{0,1}$ do not. Compared to the reduced bubbles and GPU memory savings, this results in marginal communication cost, which will be further optimized next. Figure 6: Asynchronous communication of checkpoints generated by the forward pass of stage $S_0$ for micro-batch $m_0$ from Figure 5(c). Red and orange circles are the checkpoints generated by the forward pass of stage $S_0$ for micro-batch $m_0$ , required by the backward pass of fine-grained stages $S_{0,0}$ and $S_{0,1}$ for $m_0$ , respectively. Only the red circle is sent asynchronously from $GPU_0$ to $GPU_1$ at $t_0$ and used at $t_3$ . Asynchronous checkpoint communication. We denote the activation checkpoint required for the recomputation during $b_{i,k}^j$ as $c_{i,k}^j$ . Checkpoints $c_{i,0}^j,\cdots,c_{i,d-1}^j$ are generated during $f_i^j$ . Each backward stage requires a checkpoint, while not all of them should be sent from other GPUs. For example, Figure 6 focuses on the relationship between $f_0^0$ and $b_{0,0}^0$ , $b_{0,1}^0$ of Figure 5(c). $GPU_0$ performs $f_0^0$ , which is micro-batch $m_0$ 's forward pass on $S_0$ . $f_0^0$ generates checkpoints $c_{0,0}^0$ and $c_{0,1}^0$ , which are used during the recomputation of $b_{0,0}^0$ and $b_{0,1}^0$ , respectively. $c_{0,0}^0$ and $c_{0,1}^0$ are depicted as red and orange circles, respectively. As $b_{0,1}^0$ is also assigned to the same $GPU_0$ , $c_{0,1}^0$ needs to be saved only on the memory of $GPU_0$ until it is used at time $t_2$ . However, $c_{0,0}^0$ should be sent from $GPU_0$ to $GPU_1$ before it is used at time $t_3$ . A naïve way to transfer $c_{0,0}^0$ from $GPU_0$ to $GPU_1$ would be to send and receive at time $t_3$ immediately before it is used for recomputation. In such a case, checkpoint communication lies on the critical path of the GPU pipeline along with the activation and gradient communication, adding a significant communication overhead. Instead, SPipe transfers $c_{0,0}^0$ as soon as possible (at $t_0$ ) after $c_{0,0}^0$ is ready, overlapping its transfer with independent computations. This mechanism is enabled through asynchronous communication, which allows data transfer between GPUs to be initiated without waiting for completion. Hence, $GPU_1$ can use $c_{0,0}^0$ during $b_{0,0}^0$ without waiting, as $t_3-t_0$ provides sufficient time for the checkpoint to arrive. # 3.2 GPU-CPU pipeline As explained in Section 2.3, in the hybrid GPU-CPU training [26, 27], CPU optimizer steps do not overlap with GPU computations, limiting its benefits. Figure 7(a) illustrates such a case on top of SPipe's optimized pipeline with decoupled pass assignment and fine-grained backward stage partitioning, using the same setting of Section 3.1. At time $t_0$ , $GPU_0$ finishes $b_{1,1}$ for all micro-batches and offloads the accumulated gradients to the CPU memory. The same Figure 7: SPipe GPU-CPU pipeline optimization. $o_{i,k}$ denotes the optimizer step of stage $S_{i,k}$ . For simplicity, we omit the data transfer time between GPUs and GPU and CPU. Arrows depict dependence between the passes. process occurs for $b_{0,1}$ on $GPU_0$ at $t_2$ , and $b_{1,0}$ and $b_{0,0}$ on $GPU_1$ at $t_1$ and $t_3$ , respectively. Then, all offloaded gradients are validated for numerical stability, and the results are synchronized at $t_3$ . If no overflows are found, the CPU proceeds by updating the parameters of all stages. The GPUs remain idle until all optimizer steps are complete at $t_4$ to use the updated parameters for the next iteration. Asynchronous CPU optimizer. SPipe's optimizer shifts numerical validations to a post-step process. Hence, a stage's CPU optimizer step can proceed as soon as its GPU backward passes are complete and the gradients are offloaded. SPipe overlaps the optimizer step of a stage on the CPU with the subsequent stage's backward passes on the GPU to reduce GPU idle times. At the same time, correctness is ensured through a rollback mechanism following the post-step synchronization. Consider Figure 7(b) that illustrates the CPU optimizer steps of SPipe. While $GPU_0$ offloads the gradients of $b_{1,1}$ at $t_0$ similar to Figure 7(a), the CPU immediately executes stage $S_{1,1}$ 's optimizer step $o_{1,1}$ at $t_0$ as it bypasses synchronization. The similar process is repeated for stage $S_{1,0}$ , $S_{0,1}$ , and $S_{0,0}$ at $t_1$ , $t_2$ , and $t_3$ , respectively, as if the stages' backward passes on the GPUs and the optimizer step on the CPU were pipelined. At $t_4$ , when all parameter updates are complete, the gradients are finally checked for overflows, and the results are synchronized. If any overflow is detected, SPipe performs a rollback of the updated parameters of all stages. Bypassing and rollback mechanism. SPipe pipelines the CPU optimizer steps altogether with the GPU's computation by shifting the numerical validations after the parameter updates. Each stage performs its own local validation (i.e., checking gradient overflows) without waiting for the synchronization of results across all stages. Each stage executes its optimizer step based on its own validation results. Synchronization finally occurs when all stages have completed their optimizer steps. If any stage fails its local validation, parameter updates of all stages are rolled back. Optimizers, such as Adam [17] and AdamW [25], facilitate rollback without additional memory overhead because their parameter update steps are arithmetically reversible. While these rollbacks introduce some overhead compared to conventional pre-step validation, invalidations are rare during training and, therefore, have minimal impact on the overall training time [39, 40]. The CPU optimizer pipelining in Figure 7(b) shows an ideal scenario where the optimizer of each stage starts after the preceding stage has been completed. In practice, the optimizer for a subsequent stage may start before the previous stage has finished. Thus, each optimizer stage is processed in parallel on the CPU using multi-threading. Furthermore, for efficient pipelining, maximizing the overlap between the CPU optimizer step and the backward passes is crucial. Thus, selecting an appropriate number of micro-batches is also important to ensure that each stage's backward passes can sufficiently hide the CPU optimizer step latency. # 3.3 SPipe Overall Figure 8 illustrates the overall pipeline of SPipe with 4 GPUs, 4 micro-batches, 8 forward stages, and 16 backward stages. Comparing SPipe and Mobius in Figure 8 and Figure 2 with identical training settings, their bubble ratios are 25% and 47%, respectively, showing the benefit of SPipe. There are two major reasons for this: (1) pipeline bubbles at the beginning of the backward passes are eliminated, and (2) CPU optimizer steps are overlapped with GPU computation. These improvements stem from two key insights of SPipe on hybrid GPU-CPU pipelining: (1) the coupled pass assignment is unnecessary when pipelining with offloading and activation recomputation, and (2) CPU optimizer steps can operate in parallel by bypassing numerical validation. ## 4 IMPLEMENTATION SPipe is built on top of Megatron-LM [34] by modifying its pipeline schedule, implementing offloading, and integrating CPU optimizer. Pipeline schedule. SPipe implements its pipeline schedule using separate CUDA streams for CPU-to-GPU data transfer, GPU computation, and GPU-to-CPU data transfer. Prefetching the next stage, performing forward/backward computations of the current stage, and offloading the gradients of the previous stage are thus processed in parallel. They are synchronized using CUDA events. Activation checkpoints are communicated asynchronously using P2P operations in PyTorch Distributed. A checkpoint communication schedule is built during initialization to ensure that sender and receiver GPUs call isend and irecv at matching timesteps, with the receiver later synchronizing on the returned handle. This schedule is cached and reused across all training iterations. **Offloading**. We allocate POSIX shared memory [37] for each node and adjust the tensor pointers to reference this shared memory. The offloaded parameters are physically shared and virtually mapped to the GPU processes assigned. Specifically, a GPU process responsible for the backward pass $b_i$ of stage $S_i$ allocates space for its parameters $\Psi_i$ in the shared memory while another GPU process Figure 8: SPipe pipeline schedule with four GPUs. $b_{i,k}^j$ denotes the finer-grained (di+k)th backward stage's backward pass on the jth micro-batch. Note that a backward pass has identical workload with a forward pass. Each forward/backward pass is distinguished by 2 and 4 different colors, respectively, to indicate that it belongs to a different forward/backward stage assigned to the same GPU. For example, $GPU_0$ 's forward passes on forward stage $S_0$ and $S_4$ are colored blue and sky blue, and backward passes on backward stage $S_{7,1}$ , $S_{5,1}$ , $S_{5,1}$ , and $S_{1,1}$ are colored in light yellow, yellow, orange, and brown. $p_i$ (or $p_{i,k}$ ) and $g_{i,k}$ denote CPU to GPU parameters transfer and GPU to CPU gradients transfer of forward stage $S_i$ (or backward stage $S_{i,k}$ ) and backward stage $S_{i,k}$ , respectively. $o_{i,k}$ denotes the optimizer step of backward stage $S_{i,k}$ executed on the CPU. requiring $\Psi_i$ for its forward pass $f_i$ sets its pointers to the corresponding region in the shared memory. For multi-node training, Remote Direct Memory Access (RDMA) is used to fetch parameters stored in the shared memory of a remote node. CPU optimizer. We assign a separate CPU optimizer for each stage using C++ threading [5]. We pass a CUDA event that records the corresponding gradient offloading operation from a PyTorch main thread to the CPU optimizer thread, and the optimizer thread waits for the event completion before beginning the asynchronous optimizer steps at the CPU to ensure correctness. Each CPU optimizer thread uses a CPU-based Adam [17] implementation of DeepSpeed [30], and we modify it to include the post-step validation and rollback mechanism. ## 5 EVALUATION In this section, we evaluate SPipe against existing approaches to train LLMs under memory pressure. We further examine the effectiveness of our optimizations and analyze the overheads of SPipe, providing insights into its trade-offs and performance benefits. # 5.1 Evaluation Environment System configurations. Table 1 describes two different system hardware used in evaluation: Cluster V100 and Cluster RTX 3090. Both clusters have eight nodes. Cluster V100's node has four NVIDIA V100 32GB GPUs, an AMD EPYC 7452 CPU, 512GB DRAM, and an InfiniBand HDR NIC. Cluster RTX 3090's node has four NVIDIA RTX 3090 24GB GPUs, an AMD EPYC 7502 CPU, 512GB DRAM, and an InfiniBand HDR NIC. Experiments are conducted on Cluster V100 unless stated otherwise. *Workloads*. We use LLaMA2-based language models [51] of eight sizes: 10B, 19B, 30B, 40B, 52B, 69B, 88B, and 110B. Table 2 summarizes their configurations. Models are trained using a varying number of nodes to reflect differences in their size: the (10B, Table 1: Node configuration of two eight-node clusters. | Cluster | Cluster V100 | Cluster RTX 3090 | | |---------|---------------------------------------|---------------------------|--| | M/B | ASRock ROMED8-2T | Supermicro H12DSG-O-CPU | | | CPU | 1 x AMD 32-core EPYC 7452 | 1 x AMD 32-core EPYC 7502 | | | DRAM | 8 x DDR4-2666 64GB | 8 x DDR4-3200 64GB | | | GPU | 4 x NVIDIA V100 32GB | 4 x NVIDIA RTX 3090 24GB | | | PCIe | 16 x Gen3 lanes per GPU | 16 x Gen4 lanes per GPU | | | NIC | 1 x Mellnox ConnectX-6 Infiniband HDR | | | | S/W | PyTorch 2.4.1 + CUDA 12.4 | | | Table 2: Configurations of the LLaMA-2 models used in the evaluation. Model sizes are on a scale of billion (B) parameters. Columns $l, d, d_{\rm FFN}$ , # KV heads, and # Nodes represent the number of transformer layers, hidden dimension size, FFN layer's hidden dimension size, number of KV heads, and number of nodes used, respectively. | Model Size | l | d | $d_{ m FFN}$ | # KV heads | # Nodes | |------------|-----|-------|--------------|------------|---------| | 10B | 48 | 4,096 | 10,880 | 2 | 1 | | 19B | 48 | 5,632 | 14,976 | 4 | 1 | | 30B | 96 | 5,120 | 13,632 | 4 | 2 | | 40B | 96 | 5,888 | 15,680 | 4 | 2 | | 52B | 96 | 6,656 | 17,728 | 8 | 4 | | 69B | 96 | 7,680 | 20,480 | 8 | 4 | | 88B | 192 | 6,144 | 16,384 | 16 | 8 | | 110B | 192 | 6,912 | 18,432 | 16 | 8 | 19B), (30B, 40B), (52B, 69B), and (88B, 110B) models are evaluated using 1, 2, 4, and 8 nodes, respectively. The larger model in each pair represents the maximum model size trainable under our setup, which uses full model state offloading [43], mixed-precision training [29], and Adam optimizer [17]. We train on OpenWebText [36], running five warmup iterations and averaging the subsequent five iterations. All experiments are conducted with activation recomputation [6, 14, 18, 19] because preserving all activations results in GPU out-of-memory (OOM) errors even for the smallest model. Figure 9: Speedups of DeepSpeed, Megatron, and SPipe over Mobius on Cluster V100. Figure 10: Speedups of DeepSpeed, Megatron, and SPipe over Mobius on Cluster RTX 3090. Figure 11: Resource usage during an iteration when the model size is 19B and the sequence length is 1024. **Model training configurations**. We vary model sizes, sequence lengths, batch sizes, and the number of stages to capture diverse configurations used in reality. In pipeline parallelism, the input minibatch is divided into micro-batches. Increasing the size of micro-batches enhances GPU computational efficiency but is constrained by memory capacity. To address this, we scale the mini-batch size by increasing the number of micro-batches. Our experiments explore various micro-batch sizes ( $\mu$ BS) and mini-batch sizes (MBS). Stage configuration determines how the model is partitioned and mapped to the GPUs; by default, each GPU is assigned two forward and six backward stages for SPipe. Comparison baselines. Our baselines for comparison are Deep-Speed [30], Mobius [9], and Megatron [33]. DeepSpeed provides ZeRO-3 parallelism with offloading support [43]. Mobius is a state-of-the-art pipeline framework using an interleaved AFAB (all forward, all backward) schedule optimized for offloading all training states. Megatron is a widely used pipeline framework with an interleaved 1F1B (one forward, one backward) schedule, but lacks offloading support—resulting in GPU OOM even for our smallest workload, the 10B model. Thus, we extend it to support offloading. All the techniques—DeepSpeed, Mobius, Megatron, and SPipe—offload parameters, gradients, and optimizer states to the CPU memory, have access to all CPU cores for the optimizer steps, and exploits activation recomputation to minimize GPU memory usage. For Mobius and the offloading-extended Megatron, we use our own implementations due to the lack of public availability, and verify their completeness in the supplementary material. All experiments are conducted without adding any other parallelism strategies such as 3D parallelism [33, 59]. #### 5.2 Comparison Figure 9 and Figure 10 show the speedups of DeepSpeed, Megatron, and SPipe over Mobius on Cluster V100 and Cluster RTX 3090. For this experiment, we train all models in Table 2, each using # Nodes nodes, the mini-batch size (MBS) of $16\times$ # Nodes, a fixed microbatch size ( $\mu$ BS) of two, and two sequence lengths (SEQ) of 1024 and 2048. Megatron and Mobius use two stages per GPU, while SPipe uses two forward and six backward stages per GPU. Overall, SPipe achieves average speedups of **4.13**, **1.31**, and **1.26** on Cluster V100 and **4.26**, **1.22** and **1.20** on Cluster RTX 3090 over DeepSpeed, Megatron, and Mobius, respectively. Pipeline-based methods show a large advantage over DeepSpeed, and among them, SPipe consistently outperforms both Mobius and Megatron, with performance gains varying significantly by model size, sequence length, and batch size. DeepSpeed shows inferior performance to pipeline-based methods because collective communication required by ZeRO parallelism causes substantial congestion on PCIe links—further exacerbated by GPU-CPU data transfers caused by offloading. In contrast, pipeline Figure 12: Effect of scaling micro-batch size ( $\mu$ BS). Figure 13: Speedup contribution breakdowns for various micro-batch size ( $\mu$ BS) with the 30B model. +GPU and +CPU denote the respective incremental contributions from the GPU pipeline and CPU optimizer. parallelism relies on peer-to-peer (P2P) communication, avoiding such congestion. In addition, we also observe memory inefficiency in DeepSpeed's implementation, as evidenced by CPU OOM in cases where other methods run successfully. Among pipeline-based methods, SPipe outperforms Mobius and Megatron for all cases. The AFAB schedule of Mobius and SPipe outperforms the 1F1B schedule of Megatron by better hiding the offloading overhead. While Megatron closely matches Mobius's performance at the larger sequence length of 2048, SPipe continues to outperform both because of improved GPU and CPU efficiency. For instance, as shown in Figure 11 for the 19B model (SEQ=1024), the SPipe's GPU pipeline completes earlier (SPipe: 7.97s; Mobius: 9.54s; Megatron: 11.07s), the SPipe's CPU optimizer starts sooner (3.47s; 9.48s; 10.96s), and the peak GPU memory usage is lower (52%; 61%; 58%) than others. As discussed next, the performance gains from the GPU pipeline and GPU-CPU pipeline vary significantly with model size, sequence length, and batch size. *Model sizes*. Comparing two model sizes per node configuration at the same sequence length, the overall improvements of SPipe remain consistent regardless of the model size. This is because, as the model grows, the time saved by reducing GPU pipeline bubbles and overlapping CPU optimizer steps increases proportionally Figure 14: Effect of scaling mini-batch size (MBS). with the total iteration time. This demonstrates that SPipe delivers consistent performance improvements across varying model sizes. Sequence lengths. In most cases, the speedup is larger for smaller sequence lengths. This is because the computation required for the forward/backward passes increases quadratically with sequence length while the optimizer step remains unaffected. As a result, the benefits of overlapping CPU optimizer steps diminish as sequence length increases. In contrast, the GPU pipeline speedup remains constant because the size of pipeline bubbles also increases quadratically with sequence length. Number of nodes and batch sizes. Due to the nature of pipelining, the minimum required number of micro-batches increases with the number of nodes. An increase in batch size leads to longer GPU computation time and causes it to dominate the total iteration time. Hence, the benefits of overlapping CPU optimizer steps are less evident in the overall speedup. On the other hand, the speedup gained from reducing the pipeline bubble time remains constant because the pipeline depth also increases with the number of nodes along with the batch size, maintaining a steady bubble ratio. # 5.3 Effect of the Batch Size SPipe's GPU pipeline eliminates the bubbles in ordinary pipeline schedules. However, the performance gain from bubble reduction is sensitive to the batch size. We conduct experiments in two scenarios: scaling the micro-batch size (µBS) and the mini-batch size (MBS). Scaling micro-batch size. When the MBS is fixed, increasing $\mu$ BS causes the computation required per micro-batch to grow linearly with $\mu$ BS. Consequently, the size of pipeline bubbles also increases. As a result, Mobius experiences larger bubbles and a higher bubble ratio, which leads to an increase in the GPU pipeline speedup for SPipe, as it effectively minimizes these bubbles. To analyze the individual effects of GPU pipeline bubble reduction and CPU optimizer step overlapping, we break down the total speedup into the GPU pipeline speedup and CPU optimizer speedup. Figure 12 shows the results of $\mu$ BS scaling. As the $\mu$ BS varies with 1, 2, 4, and 8, the total speedup of SPipe over Mobius also increases. They are, on average, 1.14, 1.17, 1.21, and 1.26, respectively. Similarly, when the $\mu$ BS varies with 1, 2, 4, and 8, the GPU pipeline speedup over Mobius becomes at 1.00, 1.03, 1.07, and 1.12, respectively, validating larger gains from the efficient GPU pipeline schedule of SPipe. On the other hand, scaling the $\mu$ BS has no impact on CPU optimizer speedup because it does not affect the total GPU computation time per iteration, and the overlapping time for optimizer steps remains unchanged. As the $\mu$ BS scales from 1, 2, 4, and 8, the CPU optimizer speedup over Mobius remains constant with averages of 3.68, 3.66, 3.67, and 3.61, respectively. As an example, Figure 13 illustrates how the GPU pipeline and CPU optimizer speedups contribute to the total speedup over Mobius for the 30B model. Their incremental gains are labeled +GPU and +CPU, respectively. As the $\mu$ BS increases, the GPU pipeline's contribution becomes more pronounced, while that of the CPU optimizer remains steady, thereby driving the overall speedup higher. When $\mu$ BS = 8, SPipe's GPU pipeline and CPU optimizer each provide nearly identical performance gains. Additionally, SPipe demonstrates its capability to train the models with a larger $\mu$ BS than Mobius because only SPipe successfully trained the 19B, 40B, and 69B models with an $\mu$ BS of 8. It is the result of less GPU memory consumption during the backward pass caused by SPipe's fine-grained backward-stage partitioning. Scaling mini-batch size. When the $\mu$ BS is fixed, increasing the number of micro-batches translates to a larger MBS. This increases the overall execution time of the pipeline on a GPU while the idle time consumed by the bubbles remains constant. Hence, scaling the MBS results in a lower bubble ratio of Mobius and eventually reduces GPU pipeline speedup of SPipe. Figure 14 shows the results of MBS scaling. As the MBS scales by factors of 1, 2, 3, and 4, the total speedup over Mobius decreases. The average speedups are 1.36, 1.28, 1.23, and 1.18, respectively. Similarly, the GPU pipeline speedup over Mobius also decreases. The average speedups are 1.16, 1.08, 1.04, and 1.02, respectively. On the other hand, scaling MBS results in a better overlap of the optimizer step because it may not overlap well with smaller MBS values. As the MBS scales by factors of 1, 2, 3, and 4, the average speedup of the CPU optimizer step increases with the values 2.38, 3.07, 3.22, and 3.68, respectively. However, while this increase in MBS does not affect optimizer step duration in the CPU as it still operates on the same-sized accumulated gradients, the entire forward/backward pass duration grows proportionally with MBS. Consequently, although scaling MBS enables better overlap of the CPU optimizer step, the resulting total speedup over Mobius diminishes due to decreasing GPU pipeline speedup and the increasing CPU optimizer speedup becoming progressively less reflected. In addition, we observe a saturation point of CPU optimizer speedup exists in each model size, which is when the time required to process the optimizer step of a stage on the CPU equivalents with the time required to process the backward pass of all micro-batches of the stage. For example, in 30B model, such a point is when the MBS is scaled to 32. When such a saturation point is reached, all optimizer steps of the backward stages have already been fully overlapped, leaving only the latest processed backward stage to be run non-overlapped in SPipe. | Config. | SPipe features included | | |---------|---------------------------------------------------------|--| | CFG0 | None | | | CFG1 | CFG0 with parameter prefetching | | | CFG2 | CFG1 with decoupled pass assignment | | | CFG3 | CFG2 with fine-grained backward stage partitioning | | | CFG4 | CFG3 with optimized activation checkpoint communication | | | CFG5 | CFG4 with asynchronous CPU optimizer | | | Ideal | Theoretically optimal SPipe performance | | Figure 15: Impact of progressively adding system features to SPipe for the 19B model. # 5.4 Effect of Various Optimizations SPipe proposes several optimization techniques. As shown in Figure 15, we decompose the proposed optimizations into distinct steps and evaluate various SPipe configurations by incrementally incorporating the proposed techniques. We partition the training iteration time into two components: the GPU time and the nonoverlapping CPU time. The GPU time refers to the time spent on forward and backward computations, and the non-overlapping CPU time represents the remaining time spent on the CPU optimizer, excluding GPU computation. Parameter prefetching (CFG1) reduces GPU time from 7.75s to 6.28s by overlapping CPU-to-GPU data transfers and GPU computation across stages. Interestingly, decoupled pass assignment alone (CFG2) increases GPU time to 6.55s due to extra stage transition overhead (from the last forward stage to the backward stage) and activation checkpoint communication, but combining it with fine-grained backward partitioning (CFG3) and communication optimization (CFG4) successfully lowers the GPU time to 5.50s. The time of the non-overlapping CPU optimizer step remains constant at around 9.77s while the asynchronous CPU optimizer (CFG5) reduces it to 7.07s. Although it slightly increases GPU time by 0.15s for additional GPU-CPU synchronization (particularly for gradient offloading and optimization status), the CPU time improvements (2.70s) far outweigh this. Our final implementation (CFG5) approaches theoretical optimal (Ideal) performance, with only 3.67% and 1.29% differences in the GPU and non-overlapping CPU times, respectively, compared to the ideal SPipe pipeline performance model. ## 5.5 Effect of Stage Configurations To evaluate the impact of stage configuration, Figure 16 compares three cases of different numbers of forward and backward stages. More stages imply smaller stage sizes with fewer transformer blocks per stage. For the smaller 19B model, Figure 16(a) shows that with a small batch size, a larger number of smaller stages is beneficial, while Figure 16(b) demonstrates the vice versa. This reflects a trade-off: more stages reduce pipeline bubbles—due to smaller stage size and earlier CPU optimizer start—but increase communication Figure 16: Iteration time with different stage configurations and batch sizes for the 19B and 40B models. Stage configurations are denoted as M-N, where M and N represent the number of forward and backward stages, respectively. overhead. With small models and few micro-batches, the extra communication is marginal. However, as batch size further increases beyond the CPU optimizer saturation point (Section 5.3), only the communication overhead grows, making more stages unfavorable. Similar trends are observed with the larger 40B model but with amplified effects. The benefit of smaller stages increases with small batch sizes, as computation (and bubble size) grows quadratically with the model size while communication increases linearly. With large batch sizes, communication overhead also grows due to the model's larger hidden dimension, further highlighting the tradeoffs. # 5.6 Offloading and Recomputation Overhead SPipe targets offloading-based pipelining, in contrast to conventional GPU-only pipelines that store all model states in GPU memory. Also, SPipe currently requires activation recomputation because of the decoupled pass assignment, while Mobius and GPU-only pipelines can selectively apply activation recomputation. While this paper focuses on scenarios with insufficient aggregated GPU memory to store all model states, including activations, we experiment with the case for smaller models that can be trained without offloading and activation recomputation. Figure 17 compares SPipe against GPU-only Megatron without recomputation, GPU-only Megatron with recomputation, and Mobius without recomputation. These experiments were conducted on a single node using smaller models: 1.4B, 3.1B, 5.2B, and 7.8B (SEQ=1024, $\mu$ BS=1, MBS=16). As expected, GPU-only Megatron without recomputation performs the fastest, followed by GPU-only Megatron with recomputation, Mobius without recomputation, and SPipe, with SPipe showing average slowdowns of 2.10, 1.66, and 1.20, respectively. However, all the other three baselines encounter GPU out-of-memory (OOM) errors beyond 5.2B parameters, making them infeasible for larger models. While SPipe shows relatively lower performance on small models that do not require offloading or recomputation, these scenarios Figure 17: Iteration time for Figure 18: Rollback oversmaller models. head. fall outside the primary scope of this study. SPipe is designed to address memory and scalability challenges in larger models, where offloading and recomputation become indispensable. ## 5.7 Rollback Overhead SPipe bypasses optimizer synchronization while ensuring numerical stability through post-validation. If validation fails, parameters are reverted to their pre-update state using the rollback algorithm. However, the rollback process does not overlap with GPU computation, leading to some overhead. The rollback overhead includes both the time spent on the rollback and the time of the non-overlapping optimization step that would have been skipped. Figure 18 shows the rollback overhead for a 10B model across different batch sizes, demonstrating that the overhead varies significantly from 8% to 53% depending on the batch size. To assess the frequency of rollbacks during training, we set the initial scale factor to $2^{32}$ and trained a 10B model for 4,500 iterations, resulting in 20 rollbacks. Notably, rollbacks occurred during the first 11 iterations, which could have been avoided with a lower initial scale factor. Even in a conservative scenario where rollback occurs once every 100 iterations for a batch size of 8, the resulting overhead is only 0.53%, which is negligible compared to the speedup SPipe achieves, making the rollback overhead an acceptable trade-off in the context of SPipe's overall performance benefits. #### 6 CONCLUSION This paper presents SPipe, a hybrid GPU-CPU pipelining mechanism that efficiently overcomes GPU memory limits in LLM training. SPipe consists of two pipelines: a GPU pipeline and a GPU-CPU pipeline. The GPU pipeline presents a novel pipeline scheduling scheme that decouples a stage's forward and backward passes for the same micro-batch to different GPUs by leveraging the CPU's shared memory and activation recomputation. It further optimizes its pipeline stages through fine-grained model partitioning that balances the passes' execution times and asynchronous checkpoint communication that hides the additional communication overhead. The GPU-CPU pipeline presents an asynchronous CPU optimizer that executes the optimizer steps on the CPU in parallel with the GPU pipeline stages. It efficiently utilizes the CPU to overlap the CPU optimizer overhead while guaranteeing the training correctness by a post-step validation and rollback mechanism. As a result, SPipe advances the state of the art in offloading-based LLM training, achieving an average 1.26× speedup with negligible overhead. # **ACKNOWLEDGMENTS** This work was partially supported by the National Research Foundation of Korea (NRF) under Grant No. RS-2023-00222663 (Center for Optimizing Hyperscale AI Models and Platforms), and by the Institute for Information and Communications Technology Promotion (IITP) under Grant No. 2018-0-00581 (CUDA Programming Environment for FPGA Clusters) and No. RS-2025-02304554 (Efficient and Scalable Framework for AI Heterogeneous Cluster Systems), all funded by the Ministry of Science and ICT (MSIT) of Korea. Additional support was provided by the BK21 Plus Program for Innovative Data Science Talent Education (Department of Data Science, SNU, No. 5199990914569) and the BK21 FOUR Program for Intelligent Computing (Department of Computer Science and Engineering, SNU, No. 4199990214639), both funded by the Ministry of Education (MOE) of Korea. This work was also partially supported by the Artificial Intelligence Industrial Convergence Cluster Development Project, funded by the MSIT and Gwangju Metropolitan City. ICT at Seoul National University provided research facilities for this study. #### A ARTIFACT APPENDIX Our artifact contains the complete source code for SPipe, along with scripts to reproduce the evaluation results presented in the paper. Our implementation is based on Megatron-LM and includes the comparison baselines — DeepSpeed, Mobius, and Megatron. For Mobius and the offloading-extended Megatron, we provide our own implementations due to the lack of public availability. This appendix describes how to obtain the artifact, install, and run the experiments using the provided scripts. ## A.1 Evaluation Check List The following is the check list for artifact evaluation: - Algorithm: Parallelization method for LLM training that reduces pipeline bubbles and alleviates CPU bottlenecks. - Program: Baselines: DeepSpeed, Mobius, Megatron. Custom implementations provided for Mobius and offloading-extended Megatron due to lack of public code. - Model: Configuration files for LLaMA models ranging from 10B to 110B parameters are provided. The pretrained model weights are not included or downloaded. - Dataset: OpenWebText. Download and preprocessing instructions are provided as scripts. - Run-time environment: Supports Linux and has been tested on Ubuntu 20.04, Python 3.8, CUDA 12.4, UCX 1.14.1, Open MPI 4.1.0, and PyTorch 2.4.1. - Hardware: GPU clusters described in Section 5.1. - Metrics: Iteration time is averaged over multiple runs, partitioned into GPU computation time and non-overlapping CPU optimizer time. - Output: Iteration-wise logs with iteration times saved in CSV format for analysis. - Experiments: Shell scripts are provided for experiment preparation and result reproduction. Scripts are written for Slurm clusters but it is not a strict requirement. - How much time is needed to prepare workflow (approximately)?: About 1 hour for installation. - How much time is needed to complete the experiments (approximately)?: About 6 hours for all experiments. - $\bullet \ \ \textbf{Publicly available?:} \ https://github.com/mcrl/spipe.$ - Code licenses (if publicly available)?: Apache-2.0 license. - Archived (provide DOI)?: https://doi.org/10.5281/zenodo. 16812303. ## A.2 Obtaining SPipe SPipe can be obtained from GitHub: Figure 19 shows the directory structure of the artifact. **Hardware dependencies.** SPipe targets GPU clusters, specifically focusing on scenarios where the aggregate GPU memory is insufficient to store all model states. Our evaluation environment is described in Section 5.1. **Software dependencies.** SPipe requires a Linux environment. Our evaluation environment includes PyTorch 2.4.1, CUDA 12.4, Figure 19: Directory structure of the artifact NVIDIA Apex, and Open MPI 4.1.0. Scripts for installing all third-party dependencies are provided with the artifact. # A.3 Installation In spipe-aec/spipe/scripts directory, run initialization scripts as follows: ``` 1 $ source setup_env.sh 2 $ source setup_mpi.sh 3 $ source setup_conda.sh 4 $ source setup_data.sh ``` - setup\_env. sh: Sets necessary environment variables. Recommended to also set the shell profile to prevent running this script for every new shell session. - setup\_mpi.sh: Installs UCX and CUDA-aware MPI. - setup\_conda. sh: Creates a conda environment and installs PyTorch along with other dependencies. - setup\_data. sh: Downloads and preprocesses the train dataset. ## A.4 Experiment Workflow In spipe-aec/spipe directory, run experiment scripts as follows: ``` 1 $ scripts/eval_speedup.sh 2 $ scripts/eval_batch_scaling.sh 3 $ scripts/eval_optimizations.sh ``` - eval\_speedup.sh: Compares speedup between DeepSpeed, Mobius, Megatron, and SPipe. - eval\_batch\_scaling.sh: Measures scaling of micro-batch size and mini-batch size for SPipe. - setup\_optimizations.sh: Measures impact of adding system optimizations to SPipe. Inside the experiment scripts are slurm job launch commands that each correspond to a single result in Section 5 and generate a log file slurm-<jobId>-<jobName>.out in spipe-aec/spipe/results/. # A.5 Evaluation and Expected results In spipe-aec/spipe/results directory, extract each slurm job's results and compare with original results from the paper as follows: - 1 \$ ../scripts/result\_extract.sh 2 \$ ../scripts/result\_compare.sh - result\_extract.sh: Extracts results into actual.csv in spipe-aec/spipe/results directory. - eval\_batch\_scaling . sh: Measures scaling of micro-batch size and mini-batch size for SPipe. - result\_compare.sh: Compares the extracted actual.csv with the original results expected.csv and calculates the difference. #### REFERENCES - Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems. 472–487. - [2] Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 387–401. - [3] Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43. - [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models Are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020). - [5] C++. 2024. std::thread. https://cplusplus.com/reference/thread/thread/. - [6] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016). - [7] Daning Cheng, Shigang Li, Hanping Zhang, Fen Xia, and Yunquan Zhang. 2021. Why dataset properties bound the scalability of parallel machine learning training algorithms. *IEEE Transactions on Parallel and Distributed Systems* 32, 7 (2021), 1702–1712. - [8] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, and Wei Lin. 2021. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 431-445. - [9] Yangyang Feng, Minhui Xie, Zijie Tian, Shuo Wang, Youyou Lu, and Jiwu Shu. 2023. Mobius: Fine tuning large-scale models on commodity GPU servers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 489–501. - [10] Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis. 2024. ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 211–228. - [11] Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. AutoTM: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 875–890. - [12] Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1341–1355. - [13] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019). - [14] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497–511. - [15] Hai Jin, Bo Liu, Wenbin Jiang, Yang Ma, Xuanhua Shi, Bingsheng He, and Shaofeng Zhao. 2018. Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO) 15, 3 (2018), 1–26. - [16] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). - [17] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). - [18] Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2021. Dynamic Tensor Rematerialization. In *International Conference on Learning Representations*. https://openreview.net/forum?id=Vfs\_2RnOD0H - [19] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023). - [20] Tung D Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2018. TFLMS: Large model support in TensorFlow by graph rewriting. arXiv preprint arXiv:1807.02037 (2018). - [21] Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. - [22] Zhenxing Li, Qiang Cao, Yajie Chen, and Wenrui Yan. 2023. CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel. In Proceedings of the 52nd International Conference on Parallel Processing. 92–101. - [23] Bo Liu, Wenbin Jiang, Hai Jin, Xuanhua Shi, and Yang Ma. 2018. Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 405–406. - [24] Ziming Liu, Shenggan Cheng, Haotian Zhou, and Yang You. 2023. Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13. - [25] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://arxiv.org/abs/ 1711.05101 - [26] Avinash Maurya, Jie Ye, M Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. Breaking the memory wall: A study of i/o patterns and GPU memory utilization for hybrid CPU-GPU offloaded optimizers. In Proceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures. 9–16. - [27] Avinash Maurya, Jie Ye, M Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading. In Proceedings of the 25th International Middleware Conference. 404–416. - [28] Meta. 2024. LLaMA3. https://llama.meta.com/llama3/. - [29] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017). - [30] Microsoft. 2024. DeepSpeed. https://github.com/deepspeedai/DeepSpeed. - [31] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15. - [32] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel DNN training. In *International Conference on Machine Learning*. PMLR, 7937–7947. - [33] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. - [34] NVIDIA. 2024. Megatron-LM. https://github.com/NVIDIA/Megatron-LM. - [35] Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891– - [36] Joshua Peterson, Stephan Meylan, and David Bourgin. 2019. OpenWebText. https://github.com/jcpeterson/openwebtext#openwebtext. - [37] POSIX. 2024. The Open Group Base Specifications. https://pubs.opengroup.org/ onlinepubs/9699919799/. - [38] Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. 2020. Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645 (2020). - [39] Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241 (2023). - [40] Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Conference on Learning Representations. - [41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9. - [42] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16. - [43] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. - [44] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-Speed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International - Conference on Knowledge Discovery & Data Mining. 3505-3506. - [45] Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2021. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 598-611. - [46] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564. - [47] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memoryefficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–13. - [48] Murray Shanahan. 2024. Talking about large language models. Commun. ACM 67, 2 (2024), 68–79. - [49] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). - [50] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023). - Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) - [52] Guanhua Wang, Masahiro Tanaka, Xiaoxia Wu, Lok Chand Koppaka, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2024. DeepSpeed ZeRO-Offload++: 6x higher training throughput via collaborative CPU/GPU twin-flow. https://github.com/microsoft/DeepSpeed/tree/offloadppnews/blogs/ deepspeed-offloadpp. - [53] Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large-batch training for LSTM and beyond. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16. - [54] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training BERT in 76 minutes. arXiv preprint arXiv:1904.00962 (2019). - [55] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing. 1–10. - [56] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022). - [57] Weigang Zhang, Biyu Zhou, Xuehai Tang, Zhaoxing Wang, and Songlin Hu. 2023. MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6. - [58] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023). - [59] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, Joseph E Gonzalez, and Ion Stoica. 2022. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559-578.