0
  • 聊天消息
  • 系統(tǒng)消息
  • 評(píng)論與回復(fù)
登錄后你可以
  • 下載海量資料
  • 學(xué)習(xí)在線課程
  • 觀看技術(shù)視頻
  • 寫(xiě)文章/發(fā)帖/加入社區(qū)
會(huì)員中心
創(chuàng)作中心

完善資料讓更多小伙伴認(rèn)識(shí)你,還能領(lǐng)取20積分哦,立即完善>

3天內(nèi)不再提示

如何使用FasterTransformer進(jìn)行單機(jī)及分布式模型推理

深度學(xué)習(xí)自然語(yǔ)言處理 ? 來(lái)源:深度學(xué)習(xí)自然語(yǔ)言處理 ? 作者:深度學(xué)習(xí)自然語(yǔ)言 ? 2023-05-18 14:35 ? 次閱讀

最近幾個(gè)月,隨著ChatGPT的現(xiàn)象級(jí)表現(xiàn),大模型如雨后春筍般涌現(xiàn)。而模型推理是抽象的算法模型觸達(dá)具體的實(shí)際業(yè)務(wù)的最后一公里。

但是在這個(gè)環(huán)節(jié)中,仍然還有很多已經(jīng)是大家共識(shí)的痛點(diǎn)和訴求,比如:

任何線上產(chǎn)品的用戶體驗(yàn)都與服務(wù)的響應(yīng)時(shí)長(zhǎng)成反比,復(fù)雜的模型如何極致地壓縮請(qǐng)求時(shí)延?

模型推理通常是資源常駐型服務(wù),如何通過(guò)提升服務(wù)單機(jī)性能從而增加QPS,同時(shí)大幅降低資源成本?

端-邊-云是現(xiàn)在模型服務(wù)發(fā)展的必然趨勢(shì),如何讓離線訓(xùn)練的模型“瘦身塑形”從而在更多設(shè)備上快速部署使用?

因此,模型推理的加速優(yōu)化成為了AI界的重要研究領(lǐng)域。

本文給大家分享大模型推理加速引擎FasterTransformer的基本使用。

25fb074e-f545-11ed-90ce-dac502259ad0.pngimage.png

FasterTransformer簡(jiǎn)介

NVIDIA FasterTransformer (FT)是一個(gè)用于實(shí)現(xiàn)基于Transformer的神經(jīng)網(wǎng)絡(luò)推理的加速引擎。它包含Transformer塊的高度優(yōu)化版本的實(shí)現(xiàn),其中包含編碼器和解碼器部分。使用此模塊,您可以運(yùn)行編碼器-解碼器架構(gòu)模型(如:T5)、僅編碼器架構(gòu)模型(如:BERT)和僅解碼器架構(gòu)模型(如:GPT)的推理。

FT框架是用C++/CUDA編寫(xiě)的,依賴于高度優(yōu)化的 cuBLAS、cuBLASLt 和 cuSPARSELt 庫(kù),這使您可以在 GPU 上進(jìn)行快速的 Transformer 推理。

與NVIDIA TensorRT等其他編譯器相比,F(xiàn)T 的最大特點(diǎn)是它支持以分布式方式進(jìn)行 Transformer 大模型推理。

下圖顯示了如何使用張量并行 (TP) 和流水線并行 (PP) 技術(shù)將基于Transformer架構(gòu)的神經(jīng)網(wǎng)絡(luò)拆分到多個(gè) GPU 和節(jié)點(diǎn)上。

當(dāng)每個(gè)張量被分成多個(gè)塊時(shí),就會(huì)發(fā)生張量并行,并且張量的每個(gè)塊都可以放置在單獨(dú)的 GPU 上。在計(jì)算過(guò)程中,每個(gè)塊在不同的 GPU 上單獨(dú)并行處理;最后,可以通過(guò)組合來(lái)自多個(gè) GPU 的結(jié)果來(lái)計(jì)算最終張量。

當(dāng)模型被深度拆分,并將不同的完整層放置到不同的 GPU/節(jié)點(diǎn)上時(shí),就會(huì)發(fā)生流水線并行。2625e6e4-f545-11ed-90ce-dac502259ad0.png

在底層,節(jié)點(diǎn)間或節(jié)點(diǎn)內(nèi)通信依賴于 MPI 、 NVIDIA NCCL、Gloo等。因此,使用FasterTransformer,您可以在多個(gè) GPU 上以張量并行運(yùn)行大型Transformer,以減少計(jì)算延遲。同時(shí),TP 和 PP 可以結(jié)合在一起,在多 GPU 節(jié)點(diǎn)環(huán)境中運(yùn)行具有數(shù)十億、數(shù)萬(wàn)億個(gè)參數(shù)的大型 Transformer 模型。

除了使用 C ++ 作為后端部署,F(xiàn)asterTransformer 還集成了 TensorFlow(使用 TensorFlow op)、PyTorch (使用 Pytorch op)和 Triton作為后端框架進(jìn)行部署。當(dāng)前,TensorFlow op 僅支持單 GPU,而 PyTorch op 和 Triton 后端都支持多 GPU 和多節(jié)點(diǎn)。

目前,F(xiàn)T 支持了 Megatron-LM GPT-3、GPT-J、BERT、ViT、Swin Transformer、Longformer、T5 和 XLNet 等模型。您可以在 GitHub 上的 FasterTransformer庫(kù)中查看最新的支持矩陣。

FT 適用于計(jì)算能力 >= 7.0 的 GPU,例如 V100、A10、A100 等。

下圖展示了 GPT-J 6B 參數(shù)的模型推斷加速比較:

2653555c-f545-11ed-90ce-dac502259ad0.pngimage.png

FasterTransformer 中的優(yōu)化技術(shù)

深度學(xué)習(xí)訓(xùn)練的通用框架相比,F(xiàn)T 使您能夠獲得更快的推理流水線以及基于 Transformer 的神經(jīng)網(wǎng)絡(luò)具有更低的延遲和更高的吞吐量。FT 對(duì) GPT-3 和其他大型Transformer模型進(jìn)行的一些優(yōu)化技術(shù)包括:

層融合(Layer fusion)

這是預(yù)處理階段的一組技術(shù),將多層神經(jīng)網(wǎng)絡(luò)組合成一個(gè)單一的神經(jīng)網(wǎng)絡(luò),將使用一個(gè)單一的核(kernel)進(jìn)行計(jì)算。這種技術(shù)減少了數(shù)據(jù)傳輸并增加了數(shù)學(xué)密度,從而加速了推理階段的計(jì)算。例如, multi-head attention 塊中的所有操作都可以合并到一個(gè)核(kernel)中。

自回歸模型的推理優(yōu)化(激活緩存)

為了防止通過(guò)Transformer重新計(jì)算每個(gè)新 token 生成器的先前鍵和值,F(xiàn)T 分配了一個(gè)緩沖區(qū)來(lái)在每一步存儲(chǔ)它們。

雖然需要一些額外的內(nèi)存使用,但 FT 可以節(jié)省重新計(jì)算的成本。該過(guò)程如下圖所示。相同的緩存機(jī)制用于 NN 的多個(gè)部分。

2687c5b2-f545-11ed-90ce-dac502259ad0.pngimage.png

內(nèi)存優(yōu)化

與 BERT 等傳統(tǒng)模型不同,大型 Transformer 模型具有多達(dá)數(shù)萬(wàn)億個(gè)參數(shù),占用數(shù)百 GB 存儲(chǔ)空間。即使我們以半精度存儲(chǔ)模型,GPT-3 175b 也需要 350 GB。因此有必要減少其他部分的內(nèi)存使用。

例如,在 FasterTransformer 中,我們?cè)诓煌慕獯a器層重用了激活/輸出的內(nèi)存緩沖(buffer)。由于 GPT-3 中的層數(shù)為 96,因此我們只需要 1/96 的內(nèi)存量用于激活。

使用 MPI 和 NCCL 實(shí)現(xiàn)節(jié)點(diǎn)間/節(jié)點(diǎn)內(nèi)通信并支持模型并行

FasterTransormer 同時(shí)提供張量并行和流水線并行。對(duì)于張量并行,F(xiàn)asterTransformer 遵循了 Megatron 的思想。對(duì)于自注意力塊和前饋網(wǎng)絡(luò)塊,F(xiàn)T 按行拆分第一個(gè)矩陣的權(quán)重,并按列拆分第二個(gè)矩陣的權(quán)重。通過(guò)優(yōu)化,F(xiàn)T 可以將每個(gè) Transformer 塊的歸約(reduction)操作減少到兩次。

對(duì)于流水線并行,F(xiàn)asterTransformer 將整批請(qǐng)求拆分為多個(gè)微批,隱藏了通信的空泡(bubble)。FasterTransformer 會(huì)針對(duì)不同情況自動(dòng)調(diào)整微批量大小。

MatMul 核自動(dòng)調(diào)整(GEMM 自動(dòng)調(diào)整)

矩陣乘法是基于Transformer的神經(jīng)網(wǎng)絡(luò)中最主要和繁重的操作。FT 使用來(lái)自 CuBLAS 和 CuTLASS 庫(kù)的功能來(lái)執(zhí)行這些類型的操作。重要的是要知道 MatMul 操作可以在“硬件”級(jí)別使用不同的底層(low-level)算法以數(shù)十種不同的方式執(zhí)行。

GemmBatchedEx函數(shù)實(shí)現(xiàn)了 MatMul 操作,并以cublasGemmAlgo_t作為輸入?yún)?shù)。使用此參數(shù),您可以選擇不同的底層算法進(jìn)行操作。

FasterTransformer 庫(kù)使用此參數(shù)對(duì)所有底層算法進(jìn)行實(shí)時(shí)基準(zhǔn)測(cè)試,并為模型的參數(shù)和您的輸入數(shù)據(jù)(注意層的大小、注意頭的數(shù)量、隱藏層的大?。┻x擇最佳的一個(gè)。此外,F(xiàn)T 對(duì)網(wǎng)絡(luò)的某些部分使用硬件加速的底層函數(shù),例如:__expf、__shfl_xor_sync。

低精度推理

FT 的核(kernels)支持使用 fp16 和 int8 等低精度輸入數(shù)據(jù)進(jìn)行推理。由于較少的數(shù)據(jù)傳輸量和所需的內(nèi)存,這兩種機(jī)制都會(huì)加速。同時(shí),int8 和 fp16 計(jì)算可以在特殊硬件上執(zhí)行,例如:Tensor Core(適用于從 Volta 開(kāi)始的所有 GPU 架構(gòu))和即將推出的 Hopper GPU 中的Transformer引擎。

除此之外還有快速的 C++ BeamSearch 實(shí)現(xiàn)、當(dāng)模型的權(quán)重部分分配到八個(gè) GPU 之間時(shí),針對(duì) TensorParallelism 8 模式優(yōu)化的 all-reduce

上面簡(jiǎn)述了FasterTransformer,下面將演示針對(duì) Bloom 模型以 PyTorch 作為后端使用FasterTransformer。

FasterTransformer GPT 簡(jiǎn)介

下文將會(huì)使用BLOOM模型進(jìn)行演示,而 BLOOM 是一個(gè)利用 ALiBi(用于添加位置嵌入) 的 GPT 模型的變體,因此,本文先簡(jiǎn)要介紹一下 GPT 的相關(guān)工作。GPT是僅解碼器架構(gòu)模型的一種變體,沒(méi)有編碼器模塊,使用GeLU作為激活。

FasterTransformer GPT 工作流程

下圖展示了 FasterTransformer GPT 的工作流程。與 BERT(僅編碼器結(jié)構(gòu)) 和編碼器-解碼器結(jié)構(gòu)不同,GPT 接收一些輸入 id 作為上下文,并生成相應(yīng)的輸出 id 作為響應(yīng)。在此工作流程中,主要瓶頸是 GptDecoderLayer(transformer塊),因?yàn)楫?dāng)我們?cè)黾訉訑?shù)時(shí),時(shí)間會(huì)線性增加。在 GPT-3 中,GptDecoderLayer 占用了大約 95% 的總時(shí)間。

26ab1904-f545-11ed-90ce-dac502259ad0.pngimage.png

FasterTransformer 將整個(gè)工作流程分成兩部分。

第一部分是“計(jì)算上下文(輸入 ids)的 k/v 緩存”。

第二部分是“自回歸生成輸出 ids”。

這兩部分的操作類似,只是SelfAttention中張量的形狀不同。因此,我們使用 2 種不同的實(shí)現(xiàn)來(lái)處理兩種不同的情況,如下圖所示。

26f3da2c-f545-11ed-90ce-dac502259ad0.pngimage.png

在 DecoderSelfAttention 中,查詢的序列長(zhǎng)度始終為 1,因此我們使用自定義的 fused masked multi-head attention kernel 來(lái)處理。另一方面,ContextSelfAttention 中查詢的序列長(zhǎng)度是最大輸入長(zhǎng)度,因此我們使用 cuBLAS 來(lái)利用tensor core。

以下的示例演示了如何運(yùn)行多 GPU 和多節(jié)點(diǎn)的 GPT 模型。

examples/cpp/multi_gpu_gpt_example.cc:它使用MPI來(lái)組織所有的GPU。

examples/cpp/multi_gpu_gpt_triton_example.cc:它在節(jié)點(diǎn)內(nèi)使用線程,在節(jié)點(diǎn)間使用 MPI。此示例還演示了如何使用基于 FasterTransformer 的 Triton 后端 API 來(lái)運(yùn)行 GPT 模型。

examples/pytorch/gpt/multi_gpu_gpt_example.py:這個(gè)例子和examples/cpp/multi_gpu_gpt_example.cc類似,但是通過(guò)PyTorch OP封裝了FasterTransformer的實(shí)例。

總之,運(yùn)行 GPT 模型的工作流程是:

通過(guò) MPI 或線程初始化 NCCL 通信并設(shè)置張量并行和流水線并行的ranks

按張量并行、流水線并行和其他模型超參數(shù)的ranks加載權(quán)重。

通過(guò)張量并行、流水線并行和其他模型超參數(shù)的ranks創(chuàng)建ParalelGpt實(shí)例。

接收來(lái)自客戶端的請(qǐng)求并將請(qǐng)求轉(zhuǎn)換為 ParallelGpt 的輸入張量格式。

運(yùn)行forward

將 ParallelGpt 的輸出張量轉(zhuǎn)換為客戶端的響應(yīng)并返回響應(yīng)。

在C++示例代碼中,我們跳過(guò)第4步和第6步,通過(guò)examples/cpp/multi_gpu_gpt/start_ids.csv加載該請(qǐng)求。在 PyTorch 示例代碼中,該請(qǐng)求來(lái)自 PyTorch 端。在 Triton 示例代碼中,我們有從步驟 1 到步驟 6 的完整示例。

源代碼放在 src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc 中。其中,GPT的構(gòu)造函數(shù)參數(shù)包括head_num、num_layer、tensor_para、pipeline_para等,GPT的輸入?yún)?shù)包括input_ids、input_lengths、output_seq_len等;GPT的輸出參數(shù)包括output_ids(包含 input_ids 和生成的 id)、sequence_length、output_log_probs、cum_log_probs、context_embeddings。

FasterTransformer GPT 優(yōu)化

核優(yōu)化:很多核都是基于已經(jīng)高度優(yōu)化的解碼器和解碼碼模塊的核。為了防止重新計(jì)算以前的鍵和值,我們將在每一步分配一個(gè)緩沖區(qū)來(lái)存儲(chǔ)它們。雖然它需要一些額外的內(nèi)存使用,但我們可以節(jié)省重新計(jì)算的成本。

內(nèi)存優(yōu)化:與 BERT 等傳統(tǒng)模型不同,GPT-3 有 1750 億個(gè)參數(shù),即使我們以半精度存儲(chǔ)模型也需要 350 GB。因此,我們必須減少其他部分的內(nèi)存使用。在 FasterTransformer 中,我們將重用不同解碼器層的內(nèi)存緩沖。由于 GPT-3 的層數(shù)是 96,我們只需要 1/96 的內(nèi)存。

模型并行:在GPT模型中,F(xiàn)asterTransormer同時(shí)提供張量并行和流水線并行。對(duì)于張量并行,F(xiàn)asterTransformer 遵循了 Megatron 的思想。對(duì)于自注意力塊和前饋網(wǎng)絡(luò)塊,我們按行拆分第一個(gè)矩陣乘法的權(quán)重,按列拆分第二個(gè)矩陣乘法的權(quán)重。通過(guò)優(yōu)化,我們可以將每個(gè)transformer block的歸約操作減少到 2 次,工作流程如下圖所示。對(duì)于流水線并行,F(xiàn)asterTransformer 將整批請(qǐng)求拆分為多個(gè)微批并隱藏通信空泡。FasterTransformer 會(huì)針對(duì)不同情況自動(dòng)調(diào)整微批量大小。用戶可以通過(guò)修改 gpt_config.ini 文件來(lái)調(diào)整模型并行度。我們建議在節(jié)點(diǎn)內(nèi)使用張量并行,在節(jié)點(diǎn)間使用流水線并行,因?yàn)椋瑥埩坎⑿行枰嗟?NCCL 通信。27179ae8-f545-11ed-90ce-dac502259ad0.png

多框架:FasterTransformer除了c上的源代碼,還提供了TensorFlow op、PyTorch op和Triton backend。目前TensorFlow op只支持單GPU,而PyTorch op和Triton backend支持多GPU和多節(jié)點(diǎn)。FasterTransformer 還提供了一個(gè)工具,可以將 Megatron 的模型拆分并轉(zhuǎn)換為FasterTransformer二進(jìn)制文件,以便 FasterTransformer 可以直接加載二進(jìn)制文件,從而避免為模型并行而進(jìn)行的額外拆分模型工作。

FasterTransformer GPT 推理選項(xiàng)

FasterTransformer GPT 還提供環(huán)境變量以針對(duì)特定用途進(jìn)行調(diào)整。

名稱 描述 默認(rèn)值 可接受的值
FMHA_ENABLE 啟用融合多頭注意力核 (fp16 accumulation) disabled ON= enable fmha, otherwise disabled
CONTEXT_ATTENTION_BMM1_HALF_ACCUM 對(duì) qk gemm 使用 fp16 累加,并且只對(duì)未融合的多頭注意力核產(chǎn)生影響 fp32 accumulation ON= fp32 accumulation, otherwise fp16 accumulation

環(huán)境搭建

基礎(chǔ)環(huán)境配置

首先確保您具有以下組件:

NVIDIA Docker 和 NGC 容器

NVIDIA Pascal/Volta/Turing/Ampere 系列的 GPU

基礎(chǔ)組件版本要求:

CMake: 3.13及以上版本

CUDA: 11.0及以上版本

NCCL: 2.10及以上版本

Python: 3.8.13

PyTorch: 1.13.0

這些組件在 Nvidia 官方提供的 TensorFlow/PyTorch Docker 鏡像中很容易獲得。

構(gòu)建FasterTransformer

推薦使用Nvidia官方提供的鏡像,如:nvcr.io/nvidia/tensorflow:22.09-tf1-py3 、 nvcr.io/nvidia/pytorch:22.09-py3等,當(dāng)然也可以使用Pytorch官方提供的鏡像。

首先,拉取相應(yīng)版本的PyTorch鏡像。

docker pull nvcr.io/nvidia/pytorch:22.09-py3

鏡像下載完成之后,創(chuàng)建容器,以便后續(xù)進(jìn)行編譯和構(gòu)建FasterTransformer。

nvidia-docker run -dti --name bloom_faster_transformer 
--restart=always --gpus all --network=host 
--shm-size 5g 
-v /home/gdong/workspace/code:/workspace/code 
-v /home/gdong/workspace/data:/workspace/data 
-v /home/gdong/workspace/model:/workspace/model 
-v /home/gdong/workspace/output:/workspace/output 
-w /workspace 
nvcr.io/nvidia/pytorch:22.09-py3 
bash

進(jìn)入容器。

docker exec -it bloom_faster_transformer bash

下載FasterTransformer代碼。

cd code
git clone https://github.com/NVIDIA/FasterTransformer.git
cd FasterTransformer/
git submodule init && git submodule update

進(jìn)入build構(gòu)建FasterTransformer。

mkdir -p build
cd build

然后,執(zhí)行cmake PATH命令生成 Makefile 文件。

cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..

注意:

第一點(diǎn):腳本中-DMS=xx的xx表示GPU的計(jì)算能力。下表顯示了常見(jiàn)GPU的計(jì)算能力。

GPU 計(jì)算能力
P40 60
P4 61
V100 70
T4 75
A100 80
A30 80
A10 86

默認(rèn)情況下,-DSM 設(shè)置為 70、75、80 和 86。當(dāng)用戶設(shè)置更多類型的 -DSM 時(shí),需要更長(zhǎng)的編譯時(shí)間。因此,我們建議只為您使用的設(shè)備設(shè)置 -DSM。

第二點(diǎn):本文使用Pytorch作為后端,因此,腳本中添加了-DBUILD_PYT=ON配置項(xiàng)。這將構(gòu)建 TorchScript 自定義類。因此,請(qǐng)確保 PyTorch 版本大于 1.5.0。

運(yùn)行過(guò)程:

-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found suitable version "11.8", minimum required is "10.2") 
CUDA_VERSION 11.8 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so  
-- Add DBUILD_CUTLASS_MOE, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MIXED_GEMM, requires CUTLASS. Increases compilation time
-- Running submodule update to fetch cutlass
-- Add DBUILD_MULTI_GPU, requires MPI and NCCL
-- Found MPI_CXX: /opt/hpcx/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Found NCCL: /usr/include  
-- Determining NCCL version from /usr/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so.2.15.1)
-- NVTX is enabled.
-- Assign GPU architecture (sm=80)
-- Use WMMA
CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG -Xcompiler -O3 -DCUDA_PTX_FP8_F2FP_ENABLED --use_fast_math
-- COMMON_HEADER_DIRS: /workspace/code/FasterTransformer;/usr/local/cuda/include;/workspace/code/FasterTransformer/3rdparty/cutlass/include;/workspace/code/FasterTransformer/src/fastertransformer/cutlass_extensions/include;/workspace/code/FasterTransformer/3rdparty/trt_fp8_fmha/src;/workspace/code/FasterTransformer/3rdparty/trt_fp8_fmha/generated
-- Found CUDA: /usr/local/cuda (found version "11.8") 
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.8
-- Found cuDNN: v8.6.0  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is 672ee683
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch.so  
-- USE_CXX11_ABI=True
-- The C compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /opt/conda/bin/python3.8 (found version "3.8.13") found components: Interpreter 
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/code/FasterTransformer/build

之后,通過(guò)make使用12個(gè)線程去執(zhí)行編譯加快編譯速度:

make -j12

運(yùn)行過(guò)程:

[  0%] Building CXX object src/fastertransformer/kernels/cutlass_kernels/CMakeFiles/cutlass_preprocessors.dir/cutlass_preprocessors.cc.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/nvtx_utils.dir/nvtx_utils.cc.o
[  0%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/layernorm_kernels.dir/layernorm_kernels.cu.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/cuda_utils.dir/cuda_utils.cc.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/logger.dir/logger.cc.o
[  1%] Building CXX object 3rdparty/common/CMakeFiles/cuda_driver_wrapper.dir/cudaDriverWrapper.cpp.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/custom_ar_kernels.dir/custom_ar_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/add_residual_kernels.dir/add_residual_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/activation_kernels.dir/activation_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/transpose_int8_kernels.dir/transpose_int8_kernels.cu.o
[  2%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/unfused_attention_kernels.dir/unfused_attention_kernels.cu.o
[  2%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/bert_preprocess_kernels.dir/bert_preprocess_kernels.cu.o
[  2%] Linking CUDA device code CMakeFiles/cuda_driver_wrapper.dir/cmake_device_link.o
[  2%] Linking CXX static library ../../lib/libcuda_driver_wrapper.a
[  2%] Built target cuda_driver_wrapper
...
[100%] Linking CXX executable ../../../bin/gptneox_example
[100%] Built target gptj_triton_example
[100%] Building CXX object examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_triton_example.dir/multi_gpu_gpt_triton_example.cc.o
[100%] Built target gptj_example
[100%] Building CXX object examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_interactive_example.dir/multi_gpu_gpt_interactive_example.cc.o
[100%] Built target gptneox_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_example
[100%] Linking CXX executable ../../../bin/gptneox_triton_example
[100%] Built target multi_gpu_gpt_example
[100%] Built target gptneox_triton_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_triton_example
[100%] Linking CXX static library ../../../../lib/libth_t5.a
[100%] Built target th_t5
[100%] Built target multi_gpu_gpt_triton_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_async_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_interactive_example
[100%] Built target multi_gpu_gpt_async_example
[100%] Linking CXX static library ../../../../lib/libth_parallel_gpt.a
[100%] Built target th_parallel_gpt
[100%] Linking CXX shared library ../../../lib/libth_transformer.so
[100%] Built target multi_gpu_gpt_interactive_example
[100%] Built target th_transformer

至此,構(gòu)建FasterTransformer完成。

安裝依賴包

安裝進(jìn)行模型推理所需要的依賴包。

cd /workspace/code/FasterTransformer
pip install -r examples/pytorch/gpt/requirement.txt -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn

數(shù)據(jù)與模型準(zhǔn)備

模型

本文使用BLOOM模型進(jìn)行演示,它不需要學(xué)習(xí)位置編碼,并允許模型生成比訓(xùn)練中使用的序列長(zhǎng)度更長(zhǎng)的序列。BLOOM 也具有與 OpenAI GPT 相似的結(jié)構(gòu)。因此,像 OPT 一樣,F(xiàn)T 通過(guò) GPT 類提供了 BLOOM 模型作為變體。用戶可以使用 examples/pytorch/gpt/utils/huggingface_bloom_convert.py 將預(yù)訓(xùn)練的 Huggingface BLOOM 模型轉(zhuǎn)換為 fastertransformer 文件格式。

我們使用bloomz-560m作為基礎(chǔ)模型。該模型是基于bloom-560m在xP3數(shù)據(jù)集上對(duì)多任務(wù)進(jìn)行了微調(diào)而得到的。

下載模型:

cd /workspace/model
git lfs clone https://huggingface.co/bigscience/bloomz-560m

模型文件:

> ls -al bloomz-560m
total 2198796
drwxr-xr-x 4 root root       4096 Apr 25 16:50 .
drwxr-xr-x 4 root root       4096 Apr 26 07:06 ..
drwxr-xr-x 9 root root       4096 Apr 25 16:53 .git
-rw-r--r-- 1 root root       1489 Apr 25 16:50 .gitattributes
-rw-r--r-- 1 root root      24778 Apr 25 16:50 README.md
-rw-r--r-- 1 root root        715 Apr 25 16:50 config.json
drwxr-xr-x 4 root root       4096 Apr 25 16:50 logs
-rw-r--r-- 1 root root 1118459450 Apr 25 16:53 model.safetensors
-rw-r--r-- 1 root root 1118530423 Apr 25 16:53 pytorch_model.bin
-rw-r--r-- 1 root root         85 Apr 25 16:50 special_tokens_map.json
-rw-r--r-- 1 root root   14500438 Apr 25 16:50 tokenizer.json
-rw-r--r-- 1 root root        222 Apr 25 16:50 tokenizer_config.json

數(shù)據(jù)集

本文使用Lambada數(shù)據(jù)集,它是一個(gè)NLP(自然語(yǔ)言處理)任務(wù)中使用的數(shù)據(jù)集。它包含大量的英文句子,并要求模型去預(yù)測(cè)下一個(gè)單詞,這種任務(wù)稱為語(yǔ)言建模。Lambada數(shù)據(jù)集的特點(diǎn)是它的句子長(zhǎng)度較長(zhǎng),并且包含更豐富的語(yǔ)義信息。因此,對(duì)于語(yǔ)言模型的評(píng)估來(lái)說(shuō)是一個(gè)很好的測(cè)試數(shù)據(jù)集。

下載LAMBADA測(cè)試數(shù)據(jù)集。

cd /workspace/data
wget -c https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl

數(shù)據(jù)格式如下:

{"text": "In my palm is a clear stone, and inside it is a small ivory statuette. A guardian angel.

"Figured if you're going to be out at night getting hit by cars, you might as well have some backup."

I look at him, feeling stunned. Like this is some sort of sign. But as I stare at Harlin, his mouth curved in a confident grin, I don't care about signs"}
{"text": "Give me a minute to change and I'll meet you at the docks." She'd forced those words through her teeth.

"No need to change. We won't be that long."

Shane gripped her arm and started leading her to the dock.

"I can make it there on my own, Shane"}
...
{"text": ""Only one source I know of that would be likely to cough up enough money to finance a phony sleep research facility and pay people big bucks to solve crimes in their dreams," Farrell concluded dryly.

"What can I say?" Ellis unfolded his arms and widened his hands. "Your tax dollars at work."

Before Farrell could respond, Leila's voice rose from inside the house.

"No insurance?" she wailed. "What do you mean you don't have any insurance"}
{"text": "Helen's heart broke a little in the face of Miss Mabel's selfless courage. She thought that because she was old, her life was of less value than the others'. For all Helen knew, Miss Mabel had a lot more years to live than she did. "Not going to happen," replied Helen"}
{"text": "Preston had been the last person to wear those chains, and I knew what I'd see and feel if they were slipped onto my skin-the Reaper's unending hatred of me. I'd felt enough of that emotion already in the amphitheater. I didn't want to feel anymore.

"Don't put those on me," I whispered. "Please."

Sergei looked at me, surprised by my low, raspy please, but he put down the chains"}

模型格式轉(zhuǎn)換

為了避免在模型并行時(shí),拆分模型的額外工作,F(xiàn)asterTransformer 提供了一個(gè)工具,用于將模型從不同格式拆分和轉(zhuǎn)換為 FasterTransformer 二進(jìn)制文件格式;然后, FasterTransformer 可以直接以二進(jìn)制格式加載模型。

將Huggingface Transformer模型權(quán)重文件格式轉(zhuǎn)換成FasterTransformer格式。

cd /workspace/code/FasterTransformer

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py 
    --input-dir /workspace/model/bloomz-560m 
    --output-dir /workspace/model/bloomz-560m-convert 
    --data-type fp16 
    -tp 1 -v

轉(zhuǎn)換過(guò)程:

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py 
>     --input-dir /workspace/model/bloomz-560m 
>     --output-dir /workspace/model/bloomz-560m-convert 
>     --data-type fp16 
>     -tp 1 -v

======================= Arguments =======================
 - input_dir...........: /workspace/model/bloomz-560m
 - output_dir..........: /workspace/model/bloomz-560m-convert
 - tensor_para_size....: 1
 - data_type...........: fp16
 - processes...........: 1
 - verbose.............: True
 - by_shard............: False
=========================================================
loading from pytorch bin format
model file num: 1
 - model.wte.......................................: shape (250880, 1024)     | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.wte.bin
 - model.pre_decoder_layernorm.weight..............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.pre_decoder_layernorm.weight.bin
 - model.pre_decoder_layernorm.bias................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.pre_decoder_layernorm.bias.bin
 - model.layers.0.input_layernorm.weight...........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.input_layernorm.weight.bin
 - model.layers.0.input_layernorm.bias.............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.input_layernorm.bias.bin
 - model.layers.0.attention.query_key_value.weight.: shape (1024, 3, 1024)  s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.query_key_value.weight.0.bin (0/1)
 - model.layers.0.attention.query_key_value.bias...: shape (3, 1024)        s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.query_key_value.bias.0.bin (0/1)
 - model.layers.0.attention.dense.weight...........: shape (1024, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.dense.weight.0.bin (0/1)
 - model.layers.0.attention.dense.bias.............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.dense.bias.bin
 - model.layers.0.post_attention_layernorm.weight..: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.post_attention_layernorm.weight.bin
 - model.layers.0.post_attention_layernorm.bias....: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.post_attention_layernorm.bias.bin
 - model.layers.0.mlp.dense_h_to_4h.weight.........: shape (1024, 4096)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.mlp.dense_h_to_4h.weight.0.bin (0/1)
 - model.layers.0.mlp.dense_h_to_4h.bias...........: shape (4096,)          s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.mlp.dense_h_to_4h.bias.0.bin (0/1)
...
rs.22.mlp.dense_4h_to_h.bias.bin
 - model.layers.23.input_layernorm.weight..........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.input_layernorm.weight.bin
 - model.layers.23.input_layernorm.bias............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.input_layernorm.bias.bin
 - model.layers.23.attention.query_key_value.weight: shape (1024, 3, 1024)  s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.query_key_value.weight.0.bin (0/1)
 - model.layers.23.attention.query_key_value.bias..: shape (3, 1024)        s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.query_key_value.bias.0.bin (0/1)
 - model.layers.23.attention.dense.weight..........: shape (1024, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.dense.weight.0.bin (0/1)
 - model.layers.23.attention.dense.bias............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.dense.bias.bin
 - model.layers.23.post_attention_layernorm.weight.: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.post_attention_layernorm.weight.bin
 - model.layers.23.post_attention_layernorm.bias...: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.post_attention_layernorm.bias.bin
 - model.layers.23.mlp.dense_h_to_4h.weight........: shape (1024, 4096)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_h_to_4h.weight.0.bin (0/1)
 - model.layers.23.mlp.dense_h_to_4h.bias..........: shape (4096,)          s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_h_to_4h.bias.0.bin (0/1)
 - model.layers.23.mlp.dense_4h_to_h.weight........: shape (4096, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_4h_to_h.weight.0.bin (0/1)
 - model.layers.23.mlp.dense_4h_to_h.bias..........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_4h_to_h.bias.bin
 - model.final_layernorm.weight....................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.final_layernorm.weight.bin
 - model.final_layernorm.bias......................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.final_layernorm.bias.bin
Checkpoint conversion (HF >> FT) has done (elapsed time: 17.07 sec)

轉(zhuǎn)換成FasterTransformer格式后的文件如下所示:

> tree bloomz-560m-convert/
bloomz-560m-convert/
└── 1-gpu
    ├── config.ini
    ├── model.final_layernorm.bias.bin
    ├── model.final_layernorm.weight.bin
    ├── model.layers.0.attention.dense.bias.bin
    ├── model.layers.0.attention.dense.weight.0.bin
    ├── model.layers.0.attention.query_key_value.bias.0.bin
    ├── model.layers.0.attention.query_key_value.weight.0.bin
    ├── model.layers.0.input_layernorm.bias.bin
    ├── model.layers.0.input_layernorm.weight.bin
    ├── model.layers.0.mlp.dense_4h_to_h.bias.bin
    ├── model.layers.0.mlp.dense_4h_to_h.weight.0.bin
    ├── model.layers.0.mlp.dense_h_to_4h.bias.0.bin
    ├── model.layers.0.mlp.dense_h_to_4h.weight.0.bin
    ├── model.layers.0.post_attention_layernorm.bias.bin
    ├── model.layers.0.post_attention_layernorm.weight.bin
    ├── model.layers.1.attention.dense.bias.bin
    ...
    ├── model.layers.8.post_attention_layernorm.weight.bin
    ├── model.layers.9.attention.dense.bias.bin
    ├── model.layers.9.attention.dense.weight.0.bin
    ├── model.layers.9.attention.query_key_value.bias.0.bin
    ├── model.layers.9.attention.query_key_value.weight.0.bin
    ├── model.layers.9.input_layernorm.bias.bin
    ├── model.layers.9.input_layernorm.weight.bin
    ├── model.layers.9.mlp.dense_4h_to_h.bias.bin
    ├── model.layers.9.mlp.dense_4h_to_h.weight.0.bin
    ├── model.layers.9.mlp.dense_h_to_4h.bias.0.bin
    ├── model.layers.9.mlp.dense_h_to_4h.weight.0.bin
    ├── model.layers.9.post_attention_layernorm.bias.bin
    ├── model.layers.9.post_attention_layernorm.weight.bin
    ├── model.pre_decoder_layernorm.bias.bin
    ├── model.pre_decoder_layernorm.weight.bin
    └── model.wte.bin

模型基準(zhǔn)測(cè)試

下面使用官方提供的樣例進(jìn)行基準(zhǔn)測(cè)試對(duì)比下Huggingface Transformers和FasterTransformer的響應(yīng)時(shí)長(zhǎng)。

Huggingface Transformers基準(zhǔn)測(cè)試

運(yùn)行命令:

# Run HF benchmark
CUDA_VISIBLE_DEVICES=1 python examples/pytorch/gpt/bloom_lambada.py 
    --tokenizer-path /workspace/model/bloomz-560m 
    --dataset-path /workspace/data/lambada_test.jsonl 
    --lib-path bulid/lib/libth_transformer.so 
    --test-hf 
    --show-progress

運(yùn)行過(guò)程:

python examples/pytorch/gpt/bloom_lambada.py 
>     --tokenizer-path /workspace/model/bloomz-560m 
>     --dataset-path /workspace/data/lambada_test.jsonl 
>     --lib-path bulid/lib/libth_transformer.so 
>     --test-hf 
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: None
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: bulid/lib/libth_transformer.so
 - test_hf..................: True
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 645/645 [02:33<00:00,  4.21it/s]
Accuracy: 39.4722% (2034/5153) (elapsed time: 146.7230 sec)

FasterTransformer基準(zhǔn)測(cè)試

運(yùn)行命令:

# Run FT benchmark
python examples/pytorch/gpt/bloom_lambada.py 
    --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu 
    --tokenizer-path /workspace/model/bloomz-560m 
    --dataset-path /workspace/data/lambada_test.jsonl 
    --lib-path build/lib/libth_transformer.so 
    --show-progress

:還可添加--data-type fp16以半精度方式加載模型,以減少模型對(duì)于顯存的消耗。

運(yùn)行過(guò)程:

python examples/pytorch/gpt/bloom_lambada.py 
>     --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu 
>     --tokenizer-path /workspace/model/bloomz-560m 
>     --dataset-path /workspace/data/lambada_test.jsonl 
>     --lib-path build/lib/libth_transformer.so 
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][INFO] Device NVIDIA A800 80GB PCIe
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 645/645 [00:18<00:00, 34.58it/s]
Accuracy: 39.4722% (2034/5153) (elapsed time: 13.0032 sec)

對(duì)比Huggingface Transformers和FasterTransformer

HF:    Accuracy: 39.4722% (2034/5153) (elapsed time: 146.7230 sec)
FT:    Accuracy: 39.4722% (2034/5153) (elapsed time: 13.0032 sec)

可以看到它們的準(zhǔn)確率一致,但是FasterTransformer比Huggingface Transformers的推理速度更加快速。

模型并行推理(多卡)

對(duì)于像GPT3(175B)、OPT-175B這樣的大模型,單卡無(wú)法加載整個(gè)模型,因此,我們需要以分布式(模型并行)方式進(jìn)行大模型推理。模型并行推理有兩種方式:張量并行和流水線并行,前面已經(jīng)進(jìn)行過(guò)相應(yīng)的說(shuō)明,這里不再贅述。

張量并行

模型轉(zhuǎn)換

如果想使用張量并行 (TP) 技術(shù)將模型拆分多個(gè)GPU進(jìn)行推理,可參考如下命令將模型轉(zhuǎn)換到2個(gè)GPU上進(jìn)行推理。

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py 
--input-dir /workspace/model/bloomz-560m 
--output-dir /workspace/model/bloomz-560m-convert 
--data-type fp16 
-tp 2 -v

轉(zhuǎn)換成張量并行度為2的FasterTransformer格式后的文件如下所示:

tree /workspace/model/bloomz-560m-convert/2-gpu
/workspace/model/bloomz-560m-convert/2-gpu
├── config.ini
├── model.final_layernorm.bias.bin
├── model.final_layernorm.weight.bin
├── model.layers.0.attention.dense.bias.bin
├── model.layers.0.attention.dense.weight.0.bin
├── model.layers.0.attention.dense.weight.1.bin
├── model.layers.0.attention.query_key_value.bias.0.bin
├── model.layers.0.attention.query_key_value.bias.1.bin
├── model.layers.0.attention.query_key_value.weight.0.bin
├── model.layers.0.attention.query_key_value.weight.1.bin
├── model.layers.0.input_layernorm.bias.bin
├── model.layers.0.input_layernorm.weight.bin
├── model.layers.0.mlp.dense_4h_to_h.bias.bin
├── model.layers.0.mlp.dense_4h_to_h.weight.0.bin
├── model.layers.0.mlp.dense_4h_to_h.weight.1.bin
├── model.layers.0.mlp.dense_h_to_4h.bias.0.bin
├── model.layers.0.mlp.dense_h_to_4h.bias.1.bin
├── model.layers.0.mlp.dense_h_to_4h.weight.0.bin
├── model.layers.0.mlp.dense_h_to_4h.weight.1.bin
├── model.layers.0.post_attention_layernorm.bias.bin
├── model.layers.0.post_attention_layernorm.weight.bin
...
├── model.layers.9.attention.dense.bias.bin
├── model.layers.9.attention.dense.weight.0.bin
├── model.layers.9.attention.dense.weight.1.bin
├── model.layers.9.attention.query_key_value.bias.0.bin
├── model.layers.9.attention.query_key_value.bias.1.bin
├── model.layers.9.attention.query_key_value.weight.0.bin
├── model.layers.9.attention.query_key_value.weight.1.bin
├── model.layers.9.input_layernorm.bias.bin
├── model.layers.9.input_layernorm.weight.bin
├── model.layers.9.mlp.dense_4h_to_h.bias.bin
├── model.layers.9.mlp.dense_4h_to_h.weight.0.bin
├── model.layers.9.mlp.dense_4h_to_h.weight.1.bin
├── model.layers.9.mlp.dense_h_to_4h.bias.0.bin
├── model.layers.9.mlp.dense_h_to_4h.bias.1.bin
├── model.layers.9.mlp.dense_h_to_4h.weight.0.bin
├── model.layers.9.mlp.dense_h_to_4h.weight.1.bin
├── model.layers.9.post_attention_layernorm.bias.bin
├── model.layers.9.post_attention_layernorm.weight.bin
├── model.pre_decoder_layernorm.bias.bin
├── model.pre_decoder_layernorm.weight.bin
└── model.wte.bin

0 directories, 438 files

張量并行模型推理

運(yùn)行命令:

mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py 
    --checkpoint-path /workspace/model/bloomz-560m-convert/2-gpu 
    --tokenizer-path /workspace/model/bloomz-560m 
    --dataset-path /workspace/data/lambada_test.jsonl 
    --lib-path build/lib/libth_transformer.so 
    --tensor-para-size 2 
    --pipeline-para-size 1 
    --show-progress

運(yùn)行過(guò)程:

mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py 
>     --checkpoint-path /workspace/model/bloomz-560m-convert/2-gpu 
>     --tokenizer-path /workspace/model/bloomz-560m 
>     --dataset-path /workspace/data/lambada_test.jsonl 
>     --lib-path build/lib/libth_transformer.so 
>     --tensor-para-size 2 
>     --pipeline-para-size 1 
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 2
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/2-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 2
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/2-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 2
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 2
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
world_size: 2
world_size: 2
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5556305627d0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5556305d5d20]
[FT][INFO] Device NVIDIA A800 80GB PCIe
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x55b9600a9ca0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x55b96011cff0]
[FT][INFO] Device NVIDIA A800 80GB PCIe
/workspace/code/FasterTransformer/examples/pytorch/gpt/utils/gpt.py SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(self.pre_embed_idx < self.post_embed_idx, "Pre decoder embedding index should be lower than post decoder embedding index.")
  0%|          | 0/645 [00:00

流水線并行

模型轉(zhuǎn)換

如果僅使用流水線并行,不使用張量并行,則tp設(shè)置為1即可,如果需要同時(shí)進(jìn)行張量并行和流水線并行,則需要將tp設(shè)置成張量并行度大小。具體命令參考前面的模型轉(zhuǎn)換部分。

流水線并行模型推理

運(yùn)行命令:

CUDA_VISIBLE_DEVICES=1,2 mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py 
    --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu 
    --tokenizer-path /workspace/model/bloomz-560m 
    --dataset-path /workspace/data/lambada_test.jsonl 
    --lib-path build/lib/libth_transformer.so 
    --tensor-para-size 1 
    --pipeline-para-size 2 
    --batch-size 1 
    --show-progress

運(yùn)行過(guò)程:

CUDA_VISIBLE_DEVICES=1,2 mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py       
>     --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu                                       
>     --tokenizer-path /workspace/model/bloomz-560m 
>     --dataset-path /workspace/data/lambada_test.jsonl 
>     --lib-path build/lib/libth_transformer.so 
>     --tensor-para-size 1 
>     --pipeline-para-size 2 
>     --batch-size 1 
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 2
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 1
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 2
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 1
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 2
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 2
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
world_size: 2
world_size: 2
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5557a53dc1b0] pipeline_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5557a5444df0]
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=0, world_size=1, nccl_comm=0x560cf3452820] pipeline_para=NcclParam[rank=1, world_size=2, nccl_comm=0x560cf34bb190]
[FT][INFO] Device NVIDIA A800 80GB PCIe
[FT][INFO] Device NVIDIA A800 80GB PCIe
100%|██████████| 5153/5153 [01:51<00:00, 46.12it/s] current process id: 47861   Accuracy: 39.4527% (2033/5153) (elapsed time: 102.1145 sec)
current process id: 47862   Accuracy: 39.4527% (2033/5153) (elapsed time: 102.3391 sec)

單卡、流水線并行、張量并行對(duì)比

下面在BatchSize為1的情況下,對(duì)單卡、張量并行、流水線并行進(jìn)行了簡(jiǎn)單的測(cè)試,僅供參考(由于測(cè)試時(shí),有其他訓(xùn)練任務(wù)也在運(yùn)行,可能對(duì)結(jié)果會(huì)產(chǎn)生干擾)。

TP=1、PP=1、BZ=1:

累積響應(yīng)時(shí)長(zhǎng):
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [02:21<00:00, 36.31it/s]
current process id: 47645   Accuracy: 39.4527% (2033/5153) (elapsed time: 132.2274 sec)

顯存占用:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8356      C   python                           1740MiB |
+-----------------------------------------------------------------------------+

TP=2、PP=1、BZ=1:

累積響應(yīng)時(shí)長(zhǎng):
100%|██████████| 5153/5153 [00:35<00:00, 144.80it/s]current process id: 49111   Accuracy: 39.4916% (2035/5153) (elapsed time: 26.1384 sec)
current process id: 49112   Accuracy: 39.4916% (2035/5153) (elapsed time: 26.5110 sec)


顯存占用:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     41339      C   python                           1692MiB |
|    2   N/A  N/A     41340      C   python                           1692MiB |
+-----------------------------------------------------------------------------+

TP=1、PP=2、BZ=1:

累積響應(yīng)時(shí)長(zhǎng):
100%|██████████| 5153/5153 [00:33<00:00, 153.92it/s]current process id: 48755   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.1695 sec)
current process id: 48754   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.4391 sec)


顯存占用:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A      4001      C   python                           1952MiB |
|    2   N/A  N/A      4002      C   python                           1952MiB |
+-----------------------------------------------------------------------------+

TP=1、PP=3、BZ=1:

累積響應(yīng)時(shí)長(zhǎng):
100%|██████████| 5153/5153 [00:33<00:00, 152.46it/s]current process id: 48220   Accuracy: 0.0000% (0/5153) (elapsed time: 24.9212 sec)
100%|██████████| 5153/5153 [00:33<00:00, 153.63it/s]current process id: 48219   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.9767 sec)
current process id: 48221   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.3489 sec)

顯存占用:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     57588      C   python                           1420MiB |
|    1   N/A  N/A     57589      C   python                           1468MiB |
|    2   N/A  N/A     57590      C   python                           1468MiB |
+-----------------------------------------------------------------------------+

結(jié)語(yǔ)

本文給大家簡(jiǎn)要介紹了FasterTransformer的基本概念以及如何使用FasterTransformer進(jìn)行單機(jī)及分布式模型推理,希望能夠幫助大家快速了解FasterTransformer。

審核編輯:彭靜
聲明:本文內(nèi)容及配圖由入駐作者撰寫(xiě)或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人,不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用,如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題,請(qǐng)聯(lián)系本站處理。 舉報(bào)投訴
  • 編碼
    +關(guān)注

    關(guān)注

    6

    文章

    946

    瀏覽量

    54870
  • 模型
    +關(guān)注

    關(guān)注

    1

    文章

    3261

    瀏覽量

    48912
  • 單機(jī)
    +關(guān)注

    關(guān)注

    0

    文章

    16

    瀏覽量

    6295

原文標(biāo)題:大模型的好伙伴,淺析推理加速引擎FasterTransformer

文章出處:【微信號(hào):zenRRan,微信公眾號(hào):深度學(xué)習(xí)自然語(yǔ)言處理】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。

收藏 人收藏

    評(píng)論

    相關(guān)推薦

    分布式軟件系統(tǒng)

    三個(gè)特點(diǎn):分布性、通信性和穩(wěn)健性。 分布式文件系統(tǒng)具有執(zhí)行遠(yuǎn)程文件存取的能力,并以透明方式對(duì)分布在網(wǎng)絡(luò)上的文件進(jìn)行管理和存取。 分布式數(shù)據(jù)庫(kù)
    發(fā)表于 07-22 14:53

    HarmonyOS應(yīng)用開(kāi)發(fā)-分布式設(shè)計(jì)

    設(shè)計(jì)理念HarmonyOS 是面向未來(lái)全場(chǎng)景智慧生活方式的分布式操作系統(tǒng)。對(duì)消費(fèi)者而言,HarmonyOS 將生活場(chǎng)景中的各類終端進(jìn)行能力整合,形成“One Super Device”,以實(shí)現(xiàn)
    發(fā)表于 09-22 17:11

    如何對(duì)分布式天線系統(tǒng)(DAS)進(jìn)行優(yōu)化?

    什么是分布式天線系統(tǒng)?如何對(duì)分布式天線系統(tǒng)(DAS)進(jìn)行優(yōu)化?
    發(fā)表于 05-24 06:03

    各種分布式電源的電氣特性

    特性(主要包括電壓V、電流I、有功P、無(wú)功Q)不同,需要的建模方式也有所不同。1.常見(jiàn)的分布式電源2.分布式電源建模燃料電池是電力電子變換器接口型的潮流計(jì)算模型,它在潮流計(jì)算里面可以使用pq,pq節(jié)點(diǎn)來(lái)
    發(fā)表于 07-12 07:54

    分布式對(duì)象調(diào)試中的事件模型

    針對(duì)事件的分布式程序調(diào)試過(guò)程中,需處理大量的事件消息,如果處理不當(dāng),則會(huì)影響分布式程序的執(zhí)行,提出了一種分布式對(duì)象中的事件模型,采用這種模型
    發(fā)表于 12-10 17:29 ?8次下載

    基于代理模型分布式聚類算法

    II DDM模型是現(xiàn)有的分布式聚類模型中性能較好的一種個(gè)體合作以及串行工作方式固有的不足,在該模型基礎(chǔ)上引入分層的思想,提出了一種新的分布式
    發(fā)表于 09-16 14:08 ?0次下載
    基于代理<b class='flag-5'>模型</b>的<b class='flag-5'>分布式</b>聚類算法

    分布式動(dòng)態(tài)信任管理模型

    分布式動(dòng)態(tài)信任管理模型DDTM-TR.首先使用可靠度對(duì)信任進(jìn)行評(píng)估,降低不可靠數(shù)據(jù)對(duì)直接信任、推薦信任、綜合信任計(jì)算的影響:然后,選擇多個(gè)待選節(jié)點(diǎn)計(jì)算它們的綜合信任,并以計(jì)算出的綜合信任為概率,隨機(jī)選擇待選節(jié)點(diǎn)
    發(fā)表于 12-26 19:18 ?2次下載
    <b class='flag-5'>分布式</b>動(dòng)態(tài)信任管理<b class='flag-5'>模型</b>

    分布式電源上網(wǎng)電價(jià)機(jī)制

    為鼓勵(lì)優(yōu)質(zhì)分布式電源接入電網(wǎng)、懲罰垃圾分布式電源對(duì)電網(wǎng)造成的影響,提出了基于電能質(zhì)量的分布式電源上網(wǎng)電價(jià)模型。該模型使用層次分析法對(duì)電能質(zhì)量
    發(fā)表于 01-17 17:36 ?8次下載
    <b class='flag-5'>分布式</b>電源上網(wǎng)電價(jià)機(jī)制

    簡(jiǎn)述圖文存儲(chǔ)常識(shí):單機(jī)、集中、分布式、云、云原生存儲(chǔ)

    個(gè)宏觀上的認(rèn)知。 存儲(chǔ)發(fā)展史 從單機(jī)到互聯(lián)網(wǎng),存儲(chǔ)作為的基礎(chǔ)設(shè)施,主要發(fā)展都是圍繞構(gòu)建 低成本、高性能、可擴(kuò)展、易用的目標(biāo)進(jìn)行演進(jìn),時(shí)至今日,在形態(tài)上存儲(chǔ)分為單機(jī)存儲(chǔ)、集中存儲(chǔ)、分布式
    的頭像 發(fā)表于 05-26 10:05 ?3622次閱讀
    簡(jiǎn)述圖文存儲(chǔ)常識(shí):<b class='flag-5'>單機(jī)</b>、集中、<b class='flag-5'>分布式</b>、云、云原生存儲(chǔ)

    Google Brain和DeepMind聯(lián)手發(fā)布可以分布式訓(xùn)練模型的框架

    【導(dǎo)讀】AI模型進(jìn)入大數(shù)據(jù)時(shí)代,單機(jī)早已不能滿足訓(xùn)練模型的要求,最近Google Brain和DeepMind聯(lián)手發(fā)布了一個(gè)可以分布式訓(xùn)練模型
    的頭像 發(fā)表于 06-26 15:42 ?2266次閱讀
    Google Brain和DeepMind聯(lián)手發(fā)布可以<b class='flag-5'>分布式</b>訓(xùn)練<b class='flag-5'>模型</b>的框架

    NVIDIA FasterTransformer庫(kù)的概述及好處

    這是討論 NVIDIA FasterTransformer 庫(kù)的兩部分系列的第一部分,該庫(kù)是用于對(duì)任意大?。ǘ噙_(dá)數(shù)萬(wàn)億個(gè)參數(shù))的 Transformer 進(jìn)行分布式推理的最快庫(kù)之一。它
    的頭像 發(fā)表于 08-31 09:30 ?1557次閱讀

    使用推理服務(wù)器加速大型Transformer模型推理

    這是討論 NVIDIA FasterTransformer 庫(kù)的兩部分系列的第一部分,該庫(kù)是用于對(duì)任意大小(多達(dá)數(shù)萬(wàn)億個(gè)參數(shù))的Transformer進(jìn)行分布式推理的最快庫(kù)之一。它提供
    的頭像 發(fā)表于 10-10 16:07 ?1381次閱讀
    使用<b class='flag-5'>推理</b>服務(wù)器加速大型Transformer<b class='flag-5'>模型</b>的<b class='flag-5'>推理</b>

    總結(jié)FasterTransformer Encoder(BERT)的cuda相關(guān)優(yōu)化技巧

    FasterTransformer BERT 包含優(yōu)化的 BERT 模型、高效的 FasterTransformer 和 INT8 量化推理。
    的頭像 發(fā)表于 01-30 09:34 ?2294次閱讀
    總結(jié)<b class='flag-5'>FasterTransformer</b> Encoder(BERT)的cuda相關(guān)優(yōu)化技巧

    總結(jié)FasterTransformer Encoder優(yōu)化技巧

    FasterTransformer BERT 包含優(yōu)化的 BERT 模型、高效的 FasterTransformer 和 INT8 量化推理。
    的頭像 發(fā)表于 05-30 15:15 ?1313次閱讀
    總結(jié)<b class='flag-5'>FasterTransformer</b> Encoder優(yōu)化技巧

    Java手寫(xiě)分布式鎖的實(shí)現(xiàn)

    隨著互聯(lián)網(wǎng)業(yè)務(wù)的發(fā)展,原本單機(jī)部署的系統(tǒng)演化成如今的分布式集群系統(tǒng)后,由于分布式系統(tǒng)多線程
    的頭像 發(fā)表于 11-17 15:51 ?602次閱讀
    Java手寫(xiě)<b class='flag-5'>分布式</b>鎖的實(shí)現(xiàn)