部署模型到飛騰派
本章作為一個這幾天,我嘗試將訓(xùn)練的佩奇檢測模型部署到飛騰派的階段總結(jié),在部署的過程中,我的計劃是分三步走:
- 在第三章的基礎(chǔ)上,直接遷移模型到飛騰派,通過python測試模型是否正常使用;
- 使用 C++ 將 TensorFlow Lite 移植到飛騰派,分別在 PC 端和飛騰派上使用 sample/image_label 進(jìn)行測試;
- 開發(fā)基于 TensorFlow Lite 的應(yīng)用程序,完成佩奇檢測模型的部署;
第一個部分,比較容易,因為我已經(jīng)在第一章編譯了 jammy 到飛騰派,這部分的工作可以認(rèn)為就是第二章和第三章測試模型的工作在飛騰派進(jìn)行一個重復(fù),步驟幾乎是一樣的,我對比了 PC 和飛騰派使用同一個模型進(jìn)行檢測的時間,可以對比看下:
飛騰派耗時大概 9s
PC端耗時大概 2s
第二部分,分別在PC和飛騰派測試 TensorFlow Lite 的簡單模型例程,這里我選擇的例程是label_image。
2.1 首先是在 PC 端編譯 TensorFlow Lite,這部分參考[官網(wǎng)的 build_cmake],主要執(zhí)行的命令是,我這里執(zhí)行的命令是在 tensorflow 的源碼目錄:
mkdir tflite_pc; cd tflite_pc;
cmake --build ../tensorflow/lite/ -j -t label_image
根據(jù)實際的網(wǎng)絡(luò)環(huán)境(保證可以正常下載部分文件),我作了部分修改,
diff --git a/tensorflow/workspace3.bzl b/tensorflow/workspace3.bzl
index af1613994a7..baba25d3d4d 100644
--- a/tensorflow/workspace3.bzl
+++ b/tensorflow/workspace3.bzl
@@ -55,7 +55,7 @@ def workspace():
name = "rules_jvm_external",
strip_prefix = "rules_jvm_external-%s" % RULES_JVM_EXTERNAL_TAG,
sha256 = "6274687f6fc5783b589f56a2f1ed60de3ce1f99bc4e8f9edef3de43bdf7c6e74",
- url = "https://github.com/bazelbuild/rules_jvm_external/archive/%s.zip" % RULES_JVM_EXTERNAL_TAG,
+ url = "https://hub.gitmirror.com/https://github.com/bazelbuild/rules_jvm_external/archive/%s.zip" % RULES_JVM_EXTERNAL_TAG,
)
# Load the raw llvm-project. llvm does not have build rules set up by default,
編譯完成后,進(jìn)入 examples/label_image/ 目錄,從網(wǎng)絡(luò)下載[mobilenet_v1_1.0_224.tgz],進(jìn)行測試,還需要一個 labels.txt 文件,這里列出了目標(biāo)的類別,這個文件的內(nèi)容大概是這樣的:
dummy
tench
goldfish
great white shark
tiger shark
hammerhead
electric ray
stingray
cock
hen
...
下面使用這個文件[tensorflow/lite/examples/label_image/testdata/grace_hopper.bmp] 進(jìn)行測試,
![grace_hopper.bmp]
測試命令及其結(jié)果為:
? ./label_image -m mobilenet_quant_v1_224.tflite -l labels.txt grace_hopper.bmp
INFO: Loaded model mobilenet_quant_v1_224.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 5.001 ms
INFO: 0.796078: 653 military uniform
INFO: 0.0901961: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00392157: 922 book jacket
可以看出正常檢測了出來。
2.2 下一步就是在 PC 上交叉編譯,然后將 label_image 放在飛騰派上運(yùn)行。交叉編譯的,這里需要的一個交叉工具鏈的配置文件:
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
set(TOOLCHAIN_PATH /home/red/.local/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/)
set(CMAKE_C_COMPILER ${TOOLCHAIN_PATH}/bin/aarch64-none-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER ${TOOLCHAIN_PATH}/bin/aarch64-none-linux-gnu-g++)
然后執(zhí)行下面的命令進(jìn)行構(gòu)建
mkdir tflite_build; cd tflite_build;
cmake -DCMAKE_TOOLCHAIN_FILE=cross.txt ../tensorflow/lite/ -j -t label_image
make -j10
交叉編譯之后,看下 label_image 的動態(tài)庫依賴:
? aarch64-none-linux-gnu-readelf -d examples/label_image/label_image | head
Dynamic section at offset 0x523e90 contains 31 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
這里,看到都是常規(guī)的依賴庫,將相關(guān)的模型文件,labels.txt,圖片文件等發(fā)送到飛騰派進(jìn)行測試:
red@phytiumpi:~$ ./label_image -m mobilenet_quant_v1_224.tflite -l labels.txt grace_hopper.bmp
INFO: Loaded model mobilenet_quant_v1_224.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 34.457 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit
可以看到平均時間是34ms而PC端大概5ms,差距還是比較明顯的。
第三部分就有點復(fù)雜了,首先是模型導(dǎo)出,使用下面腳本,將佩奇檢測模型導(dǎo)出為tflite格式。
#!/bin/python3.8
import tensorflow as tf
saved_model_dir="saved_model"
# Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) # path to the SavedModel directory
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter=True
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
# Save the model.
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
在第三章訓(xùn)練出的模型目錄下執(zhí)行上述轉(zhuǎn)換命令,可以在當(dāng)前目錄下得到 model.tflite 文件,打印是這樣的:
? ./convert_tflite.py
2023-12-19 18:59:58.400519: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round
-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-19 18:59:58.417367: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 18:59:58.556611: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 18:59:58.557891: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critica
l operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-19 18:59:59.146661: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-19 19:00:00.078886: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2023-12-19 19:00:00.078951: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: fedora
2023-12-19 19:00:00.078955: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: fedora
2023-12-19 19:00:00.079032: I tensorflow/colempiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 535.146.2
2023-12-19 19:00:00.079041: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 535.146.2
2023-12-19 19:00:00.079043: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 535.146.2
2023-12-19 19:00:06.622272: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
2023-12-19 19:00:06.622301: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
2023-12-19 19:00:06.622673: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: saved_model
2023-12-19 19:00:06.677554: I tensorflow/cc/saved_model/reader.cc:91] Reading meta graph with tags { serve }
2023-12-19 19:00:06.677584: I tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: saved_model
2023-12-19 19:00:06.777090: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2023-12-19 19:00:06.810628: I tensorflow/cc/saved_model/loader.cc:231] Restoring SavedModel bundle.
2023-12-19 19:00:07.527629: I tensorflow/cc/saved_model/loader.cc:215] Running initialization op on SavedModel bundle at path: saved_model
2023-12-19 19:00:07.821400: I tensorflow/cc/saved_model/loader.cc:314] SavedModel load for tags { serve }; Status: success: OK. Took 1198726 microseconds.
2023-12-19 19:00:08.118777: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-12-19 19:00:14.333479: W tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2073] TFLite interpreter needs to link Flex delegate in order to run the model since it contains the following Select TFop(s):
Flex ops: FlexStridedSlice, FlexTensorListReserve, FlexTensorListSetItem, FlexTensorListStack
Details:
tf.StridedSlice(tensor< ?x?x3xf32 >, tensor< 4xi32 >, tensor< 4xi32 >, tensor< 4xi32 >) - > (tensor< 1x?x?x3xf32 >) : {begin_mask = 14 : i64, device = "", ellipsis_mask = 0 : i64, end_mask = 14 : i64, new_axis_mask = 1 : i64, shrink_axis_mask = 0 : i64}
tf.TensorListReserve(tensor< i32 >, tensor< i32 >) - > (tensor< !tf_type.variant< tensor< *xf32 >> >) : {device = ""}
tf.TensorListReserve(tensor< i32 >, tensor< i32 >) - > (tensor< !tf_type.variant< tensor< *xi32 >> >) : {device = ""}
tf.TensorListSetItem(tensor< !tf_type.variant< tensor< *xf32 >> >, tensor< i32 >, tensor< ?x?x3xf32 >) - > (tensor< !tf_type.variant< tensor< *xf32 >> >) : {device = "", resize_if_index_out_of_bounds = false}
tf.TensorListSetItem(tensor< !tf_type.variant< tensor< *xi32 >> >, tensor< i32 >, tensor< 3xi32 >) - > (tensor< !tf_type.variant< tensor< *xi32 >> >) : {device = "", resize_if_index_out_of_bounds = false}
tf.TensorListStack(tensor< !tf_type.variant< tensor< *xf32 >> >, tensor< 3xi32 >) - > (tensor< 1x512x512x3xf32 >) : {device = "", num_elements = 1 : i64}
tf.TensorListStack(tensor< !tf_type.variant< tensor< *xi32 >> >, tensor< 1xi32 >) - > (tensor< 1x3xi32 >) : {device = "", num_elements = 1 : i64}
See instructions: https://www.tensorflow.org/lite/guide/ops_select
這里有很重要的一段內(nèi)容:
TFLite interpreter needs to link Flex delegate in order to run the model since it contains the following Select TFop(s):
Flex ops: FlexStridedSlice, FlexTensorListReserve, FlexTensorListSetItem, FlexTensorListStack ,這段話提示我們這個模型使用了部分算子需要 Flex 代理,之前使用 cmake 編譯出來的 TensorFlow Lite 并沒有包含 Flex 代理,所以這個模型是無法被前面編譯的 TensorFlow Lite 解析的,這里我使用 TensorFlow 的 minimal 工具嘗試解析這個模型,得到的結(jié)果如下:
可以看到因為缺少部分算子,導(dǎo)致無法解析這個我們轉(zhuǎn)換的佩奇檢測模型,經(jīng)過一段時間的搜索發(fā)現(xiàn)了解決方法 [Adding Select Tf Ops to Cmake]。簡單來說,首先通過bazel工具(我嘗試強(qiáng)制使用7.0.0發(fā)現(xiàn)無法構(gòu)建成功,后來還是使用bazel-6.1.0才正常構(gòu)建了出來) 使用如下命令構(gòu)建出來庫文件。
bazel build -c opt --config=monolithic tensorflow/lite/delegates/flex:tensorflowlite_flex
然后修改 minimal 工具的 CMakeLists.txt 文件,diff 文件為:
diff --git a/tensorflow/lite/examples/minimal/CMakeLists.txt b/tensorflow/lite/examples/minimal/CMakeLists.txt
index 7f8301162bb..1dd8ae05089 100644
--- a/tensorflow/lite/examples/minimal/CMakeLists.txt
+++ b/tensorflow/lite/examples/minimal/CMakeLists.txt
@@ -35,10 +35,14 @@ add_subdirectory(
EXCLUDE_FROM_ALL
)
+find_library(TF_LIB_FLEX tensorflowlite_flex HINTS "${TENSORFLOW_SOURCE_DIR}/bazel-bin/tensorflow/lite/delegates/flex/")
+
set(CMAKE_CXX_STANDARD 17)
add_executable(minimal
minimal.cc
)
target_link_libraries(minimal
+ -Wl,--no-as-needed # Need --no-as-needed to link tensorflowlite_flex
tensorflow-lite
+ ${TF_LIB_FLEX{TF_LIB_FLEX}
)
構(gòu)建成功后,會在目錄中看到對應(yīng)的庫文件:
? ll bazel-bin/tensorflow/lite/delegates/flex/libtensorflowlite_flex.so
-r-xr-xr-x. 1 red red 194047768 Dec 19 16:10 bazel-bin/tensorflow/lite/delegates/flex/libtensorflowlite_flex.so
使用命令
cmake ../tensorflow_src/tensorflow/lite/examples/minimal;
cmake --build . -j;
重新構(gòu)建 minimal 工具,然后再次測試:
? ldd minimal
linux-vdso.so.1 (0x00007ffdab5e4000)
libtensorflowlite_flex.so = > /home/red/Projects/ai_track_feiteng/demo_tflite/tensorflow_src/bazel-bin/tensorflow/lite/delegates/flex/libtensorflowlite_flex.so (0x00007f8a00800000)
libm.so.6 = > /lib64/libm.so.6 (0x00007f8a0c189000)
libstdc++.so.6 = > /lib64/libstdc++.so.6 (0x00007f8a00400000)
libgcc_s.so.1 = > /lib64/libgcc_s.so.1 (0x00007f8a0c165000)
libc.so.6 = > /lib64/libc.so.6 (0x00007f8a00222000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8a0c287000)
? ./minimal model.tflite
2023-12-20 08:51:21.278490: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
INFO: Created TensorFlow Lite delegate for select TF ops.
2023-12-20 08:51:21.300452: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: TfLiteFlexDelegate delegate: 4 nodes delegated out of 21284 nodes with 2 partitions.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
WARNING: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#394 is a dynamic-sized tensor).
=== Pre-invoke Interpreter State ===
Interpreter has 5 subgraphs.
-----------Subgraph-0 has 164547 tensors and 21286 nodes------------
1 Inputs: [0] - > 3B (0.00MB)
8 Outputs: [41871,42025,41990,41891,42078,42060,42007,42043] - > 982104B (0.94MB)
Tensor ID Name Type AllocType Size (Bytes/MB) Shape MemAddr-Offset
Tensor 0 serving_default_input_... kTfLiteUInt8 kTfLiteArenaRw 3 / 0.00 [1,1,1,3] [0, 3)
Tensor 1 Postprocessor/Decode/g... kTfLiteFloat32 kTfLiteMmapRo 196416 / 0.19 [49104] [4628736, 4825152)
...... 省略
--------------Subgraph-4 dump has completed--------------
--------------Memory Arena Status Start--------------
Total memory usage: 65676804 bytes (62.634 MB)
- Total arena memory usage: 56230520 bytes (53.626 MB)
- Total dynamic memory usage: 9446284 bytes (9.009 MB)
Subgraph#0 Arena (Normal) 56098956 (85.42%)
Subgraph#0 Arena (Persistent) 131112 (0.20%)
Subgraph#0 Dyanmic Tensors 9100 (0.01%)
Subgraph#1 Dyanmic Tensors 3145728 (4.79%)
Subgraph#2 Arena (Normal) 452 (0.00%)
Subgraph#4 Dyanmic Tensors 6291456 (9.58%)
--------------Memory Arena Status End--------------
至此,就完成了在 PC 端使用 C++ 對我們佩奇模型的檢測。下面就是關(guān)鍵的在飛騰派上使用 C++ 通過 tensorflowlite 完成對模型的檢測和調(diào)用。
這里參照在 PC 端的過程經(jīng)驗,首先使用 Bazel 交叉編譯出飛騰派對應(yīng)的庫文件,這個過程可以參照 Bazel 的官方文檔了解如何配置工具鏈,然后配置到 tensorflow 的倉庫。這里注意的是要添加額外的編譯參數(shù):
build:android_arm64 --cxxopt=-fPIC
build:android_arm64 --copt=-fPIC
否則的話在編譯最后階段會提示類似如下的錯誤:
在PC上經(jīng)過一段時間的編譯,最后編譯成功的截圖為:
檢查下庫文件的類型:
? file bazel-bin/tensorflow/lite/delegates/flex/libtensorflowlite_flex.so
bazel-bin/tensorflow/lite/delegates/flex/libtensorflowlite_flex.so: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, stripped
可以看到 aarch64 版本的類型,然后參照[tensorflow/lite/examples/minimal/README.md]交叉編譯出 minimal 例程進(jìn)行測試。
? aarch64-none-linux-gnu-readelf -d minimal | head -n 12
Dynamic section at offset 0x4ecea0 contains 33 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libtensorflowlite_flex.so]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000f (RPATH) Library rpath: [/home/red/Projects/ai_track_feiteng/demo_tflite/tensorflow_src_for_aarch64/bazel-bin/tensorflow/lite/delegates/flex]
┏─?[red]?─?[17:51:57]?─?[0]
┗─?[~/Projects/ai_track_feiteng/demo_tflite/tensorflow_src_for_aarch64/tflite_aarch64_minimal (master)]
? file minimal
minimal: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, with debug_info, not stripped
下面將對應(yīng)的 minimal 和 libtensorflowlite_flex.so 文件發(fā)送到飛騰派并對佩奇檢測模型進(jìn)行測試。測試過程和結(jié)果分別如下:
red@phytiumpi:~$ file minimal
minimal: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, with debug_info, not stripped
red@phytiumpi:~$ ldd minimal
linux-vdso.so.1 (0x0000ffffb29be000)
libtensorflowlite_flex.so = > /lib/libtensorflowlite_flex.so (0x0000ffff96480000)
librt.so.1 = > /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff96460000)
libdl.so.2 = > /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff96440000)
libpthread.so.0 = > /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff96420000)
libm.so.6 = > /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff96380000)
libstdc++.so.6 = > /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff96150000)
libgcc_s.so.1 = > /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff96120000)
libc.so.6 = > /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff95f70000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb2985000)
red@phytiumpi:~$
可以看到檢測的結(jié)果和PC端的一致。
至此已經(jīng)完成了佩奇檢測模型部署到飛騰派的前期準(zhǔn)備工作(環(huán)境搭建,功能測試驗證)。接下來的章節(jié),就是開始使用 C++ 開發(fā)佩奇檢測模型的應(yīng)用功能了,敬請期待。
審核編輯 黃宇
-
4G
+關(guān)注
關(guān)注
15文章
5524瀏覽量
119218 -
飛騰派
+關(guān)注
關(guān)注
2文章
9瀏覽量
237
發(fā)布評論請先 登錄
相關(guān)推薦
評論