操屁眼的视频在线免费看,日本在线综合一区二区,久久在线观看免费视频,欧美日韩精品久久综

新聞資訊

    前言

    MySQL 數據庫最常見的兩個瓶頸是CPU和I/O的瓶頸。CPU在飽和的時候一般發生在數據裝入內存或從磁盤上讀取數據時候,磁盤I/O瓶頸發生在裝入數據遠大于內存容量的時候。

    MySQL數據庫性能遇到瓶頸,如何快速定位問題的原因,是每個DBA或系統運維人員應該思考的問題。正確的借助一些性能分析工具,能夠幫助DBA或系統運維人員進行問題快速的定位。

    下面小編匯總了一些 MySQL DBA工作中好用的性能分析工具,歡迎大家收藏轉發~

    一、 Pt-qurey-digest

    pt-query-digest主要用來分析mysql的慢日志,與mysqldumpshow工具相比,pt-querydigest 工具的分析結果更具體,更完善。pt-querydigest是PT工具集的子集。

    1.1 安裝

    yum install percona-toolkit-3.0.13-1.el7.x86_64.rpm 
    

    1.2 用法

    1、直接分析慢查詢文件:

    pt-query-digest /var/lib/mysql/slowtest-slow.log > slow_report.log
    

    2、分析最近12小時內的查詢:

    pt-query-digest --since=12h /var/lib/mysql/slowtest-slow.log > slow_report2.log
    

    3、分析指定時間范圍內的查詢:

    pt-query-digest /var/lib/mysql/slowtest-slow.log --since '2017-01-07 09:30:00' --until '2017-01-07 10:00:00'> > slow_report3.log
    

    4、分析指含有select語句的慢查詢

    pt-query-digest --filter '$event->{fingerprint}=~ m/^select/i' /var/lib/mysql/slowtest-slow.log> slow_report4.log
    

    5、針對某個用戶的慢查詢

    pt-query-digest --filter '($event->{user} || "")=~ m/^root/i' /var/lib/mysql/slowtest-slow.log> slow_report5.log
    

    6、查詢所有所有的全表掃描或full join的慢查詢

    pt-query-digest --filter '(($event->{Full_scan} || "") eq "yes") ||(($event->{Full_join} || "") eq "yes")' /var/lib/mysql/slowtest-slow.log> slow_report6.log
    

    二、Innotop

    innotop是一個MySQL和InnoDB事務/狀態監視器,它顯示查詢、InnoDB事務、鎖等待、死鎖、外鍵錯誤,打開表,復制狀態,緩沖區信息,行操作、日志、I/O操作、加載圖等等。你可以使用innotop同時監控多個服務器。innotp可以綜合了解你的 MySQL。

    2.1、安裝innotop

    yum install innotop-1.11.4-1.el7.noarch 
    

    2.2、使用方法

    1、基本使用

    innotop --host 192.168.1.181 --user admin --password 123456 --port 3306 --delay 1 -m Q
    

    2、參數說明,如下圖:

    注意::?可以切換至其他命令。

    三、 Orzdba

    是taobao開源出來一個數據庫實時性能查看工具,借助此工具你可以時刻了解的你的數據庫的性能情況。

    3.1 使用方法:

    ./orzdba_remote --host=192.168.1.181 --user="admin" --password=123456 --port=3306 -mysql -sys 2>/dev/null 
    

    參數說明:

    • --host:指定主機
    • --user:指定用戶名
    • --password:數據庫密碼
    • --port:數據庫的端口號

    四、 Tcp抓包

    4.1 安裝tcp包

    yum install tcpdump-4.9.2-3.el7.x86_64 -y
    

    4.2 使用

    1、通過tcpdump抓包

    tcpdump -i any port 3306 -l -s 0 -w - |strings |grep -A 5 select|less
    

    2、tcpdump+pt-query-digest

    tcpdump -s 65535 -x -nn -q -tttt -i any -c 1000 port 3306 > mysql.tcp.txt
    pt-query-digest --type tcpdump mysql.tcp.txt> slow_report9.log
    

    五、ioprofile

    5.1 pt-ioprofile

    pt-ioprofile定位負載來源文件,通過ps找出負載較高的進程。

    5.2 使用

    pt-ioprofile --profile-pid=12036 --cell=sizes 
    

    參數說明:

    • --profile-pid:mysql進程的id
    • --cell-sizes:該參數將結果已 B/s 的方式展示出來

    六、Tcprstat

    通過響應時間判斷數據庫運行狀況

    6.1 安裝tcprstat

    如果是在64位操作系統中使用,可以直接下載二進制文件使用。步驟如下:

    1)下載文件 http://github.com/downloads/Lowercases/tcprstat/tcprstat-static.v0.3.1.x86_64

    2)把下載的文件移動到 /usr/bin

    3)把文件名修改為 tcprstat

    4)修改文件權限,增加執行權限 chmod +x /usr/bin/tcprstat 如果你想在32位操作系統中使用,那你只能自己編譯了。

    代碼下載地址 https://github.com/Lowercases/tcprstat https://launchpad.net/tcprstat

    6.2 使用

    [root@localhost ~]# tcprstat --p 3306 -t 1 -n 10
    timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std
    1539760803 1 103 103 103 103 0 0 0 0 0 0 0
    1539760804 1 108 108 108 108 0 0 0 0 0 0 0
    1539760805 1 124 124 124 124 0 0 0 0 0 0 0
    1539760806 1 115 115 115 115 0 0 0 0 0 0 0
    1539760807 1 112 112 112 112 0 0 0 0 0 0 0
    

    每個請求的時間在0.1ms~0.124ms

    參數說明:

    • --p:數據庫端口號
    • -t:刷新間隔時間
    • -n:輸出次數

    七、 Nicstat

    nicstat網絡利器,充分了解你的網卡運行狀況)

    7.1、安裝nicstat

    yum install http://rpmfind.net/linux/fedora/linux/releases/28/Everything/x8664/os/Packages/n/nicstat-1.95-7.fc27.x8664.rpm
    

    7.2、使用方法

    [root@lkjtest ~]# nicstat -z 1
     Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
    15:29:14 ens160 4.03 0.91 43.18 1.60 95.61 581.8 0.00 0.00
    15:29:15 ens160 3.09 0.73 35.95 2.00 88.11 375.5 0.00 0.00
    15:29:16 ens160 3.93 0.66 43.99 2.00 91.52 335.5 0.00 0.00
    15:29:17 ens160 3.99 0.66 45.00 2.00 90.71 335.5 0.00 0.00
    15:29:18 ens160 4.04 0.66 46.99 2.00 88.04 335.5 0.00 0.00
    15:29:19 ens160 3.64 0.66 42.00 2.00 88.76 335.5 0.00 0.00
    

    參數說明:

    • -z:跳過0行

    輸出參數說明:

    • wKB/s,OutKB #每秒寫的千字節數(transmitted)
    • rMbps,RdMbps #每秒讀的百萬字節數K(received)
    • %Util #接口的利用率百分比
    • Sat #每秒的錯誤數,接口接近飽和的一個指標

    八、 Dstat

    8.1 安裝dstat

    yum install dstat -y
    

    8.2 使用

    [root@localhost ~]# dstat -tclmndy 1 
    

    參數說明:

    • -t:enable time/date output
    • -c:enable cpu stats
    • -l:enable load stats
    • -m:enable memory stats
    • -n:enable network stats
    • -d:enable disk stats
    • -y:enable system stats

    九、 vmtouch

    vmtouch是一個學習和控制unix和類unix系統的文件系統緩存的工具。

    9.1、快速安裝

    $ git clone https://github.com/hoytech/vmtouch.git
    $ cd vmtouch
    $ make
    $ sudo make install
    

    9.2、使用方法

    • 顯示
    $ vmtouch -v big-dataset.txt
    
    • 回收
    vmtouch -ve a.txt
    

    十、 oprofile

    Oprofile 是一個開源的profiling工具,通過取樣來工作,是一個全局的抽樣統計工具。cpu無端占用高?應用程序響應慢?苦于沒有分析的工具?找它就對了!通過計數采樣,幫助我們從進程、函數、代碼層面找出占用cpu的"罪魁禍首"。

    ? 10.1 安裝

    yum install http://www.rpmfind.net/linux/centos/7.5.1804/os/x86_64/Packages/oprofile-0.9.9-25.el7.x86_64.rpm -y
    

    10.2 使用方法

    1、使用

    #加載oprofile內核模塊 opcontrol --init 
    #我們對內核的取樣沒興趣 opcontrol --setup --no-vmlinux
    #在開始收集采樣數據前回顧下我們的設置 opcontrol --status
    #清除上一次采樣到的數據 opcontrol --reset 
    #運行我們的程序 opcontrol --start 
    #收集采樣數據 opcontrol --dump
    #關閉守護程序, 同時準備好采樣的數據 opcontrol --shutdown
    

    注意:如報Cannot find event CPUCLKUNHALTED,解決方法如下:

    #解決步驟
    $sudo opcontrol --deinit
    Daemon not running
    Unloading oprofile module
     
    $sudo modprobe oprofile timer=1 
     
    $dmesg|grep oprofile|tail -n 1
    oprofile: using timer interrupt.
    如果你看到上面的字說明你成功了。
    我們再試驗下:
    $sudo opcontrol --init && sudo opcontrol --reset && sudo opcontrol --start
    Using 2.6+ OProfile kernel interface.
    Using log file /var/lib/oprofile/samples/oprofiled.log
    Daemon started.
    Profiler running.
    

    2、獲取采樣信息

    #系統級別的 opreport --long-filenames
    #模塊級別的 opreport image:foo -l
    #源碼級別的 opannotate image:foo -s
    

    3、使用樣例

    opcontrol --deinit
    modprobe oprofile timer=1
    $dmesg|grep oprofile|tail -n 1
    (oprofile: using timer interrupt.)
    opcontrol --reset
    pcontrol --separate=lib --no-vmlinux
    --start --image=/home/mysql_user/mysqlhome/bin/mysqld
    opcontrol --dump
    opcontrol --shutdown
    opreport -l /home/mysql_user/mysqlhome/bin/mysqld
    

    如果大家有更好的推薦,歡迎大家評論區留言~


    網站崩潰找不到原因?平臺搭建無從下手?熱門技術不想落伍?想要變強找不到資料?

    加入[IT拯救聯盟],讓大牛和同伴拯救你,帶你裝x帶你飛。群里有定期干貨分享、大牛專業解答、實用IT工具優選.....

    私信小編“聯盟”,即可加入我們~

    作者丨付輝輝、周鈺臣

    編輯丨極市平臺

    前言

    近年來,基于深度學習的人體動作識別的研究越來越多,slowfast模型提出了快慢兩通道網絡在動作識別數據集上表現十分優異,本文介紹了Slowfast數據準備,如何訓練,以及slowfast使用onnx進行推理,著重介紹了Slowfast使用Tensorrt推理,并且使用yolov5deepsort進行人物追蹤,以及使用C++ 部署。

    1.數據準備

    1.1 剪裁視頻

    準備多組視頻數據,其中IN_DATA_DIR 為原始視頻數據存放目錄,OUT_DATA_DIR為目標視頻數據存放目錄。這一步保證所有視頻長度相同

    IN_DATA_DIR="/project/train/src_repo/data/video"
    OUT_DATA_DIR="/project/train/src_repo/data/splitvideo"
    str="_"
    if [[ ! -d "${OUT_DATA_DIR}" ]]; then
      echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
      mkdir -p ${OUT_DATA_DIR}
    fi
    
    for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    do 
        for i in {0..10}
        do 
          index=$(expr $i \* 10)
          out_name="${OUT_DATA_DIR}/${i}${str}${video##*/}"
          if [ ! -f "${out_name}" ]; then
            ffmpeg -ss ${index} -t 80 -i "${video}" "${out_name}"
          fi
        done
    done
    

    1.2 提取關鍵幀

    關鍵幀是從視頻每一秒中提取一幀,IN_DATA_DIR為步驟一得到視頻的目錄,OUT_DATA_DIR為提取的關鍵幀的存放目錄

    #切割圖片,每秒1幀
    IN_DATA_DIR="/project/train/src_repo/data/splitvideo/"
    OUT_DATA_DIR="/project/train/src_repo/data/splitimages/"
     
    if [[ ! -d "${OUT_DATA_DIR}" ]]; then
      echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
      mkdir -p ${OUT_DATA_DIR}
    fi
     
    for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    do
      video_name=${video##*/}
     
      if [[ $video_name=*".webm" ]]; then
        video_name=${video_name::-5}
      else
        video_name=${video_name::-4}
      fi
     
      out_video_dir=${OUT_DATA_DIR}/${video_name}/
      mkdir -p "${out_video_dir}"
     
      out_name="${out_video_dir}/${video_name}_%06d.jpg"
     
      ffmpeg -i "${video}" -r 1 -q:v 1 "${out_name}"
    done
     
    

    1.3 分割視頻

    將步驟一生成的視頻通過ffmpeg進行分幀,每秒30幀,IN_DATA_DIR為存放視頻目錄,OUT_DATA_DIR為存放結果目錄

    IN_DATA_DIR="/project/train/src_repo/video"
    OUT_DATA_DIR="/project/train/src_repo/spiltvideo"
    
    if [[ ! -d "${OUT_DATA_DIR}" ]]; then
      echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
      mkdir -p ${OUT_DATA_DIR}
    fi
    
    for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    do
      out_name="${OUT_DATA_DIR}/${video##*/}"
      if [ ! -f "${out_name}" ]; then
        ffmpeg -ss 0 -t 100 -i "${video}" "${out_name}"
      fi
    done
    

    1.4 文件目錄

    ava  #一級文件夾,用來存放視頻信息
    —person_box_67091280_iou90 #二級文件夾,用來存放目標檢測信息文件夾
    ——ava_detection_train_boxes_and_labels_include_negative_v2.2.csv #二級文件夾下文件,用來存放目標檢測的信息,用于訓練
    ——ava_detection_val_boxes_and_labels.csv #二級文件夾下文件,用來存放目標檢測的信息,用于測試
    —ava_action_list_v2.2_for_activitynet_2019.pbtxt #一級文件夾下的文件,用來存放標簽信息
    —ava_val_excluded_timestamps_v2.2.csv #一級文件夾下的文件,用來沒有人物的幀,在訓練過程中會拋棄這些幀
    —ava_train_v2.2.csv #一級文件夾下的文件,用來存放訓練數據,關鍵幀的信息
    —ava_val_v2.2.csv  #一級文件夾下的文件,用來存放驗證數據,關鍵幀的信息
    
    frame_lists  #一級文件夾,存放1.3中生成的圖片的路徑
    —train.csv
    —val.csv
    
    frames  #一級文件夾,存放1.3中生成的圖片
    —A
    ——A_000001.jpg
    ——A_0000012.jpg
    …
    ——A_000090.jpg
    —B
    ——B_000001.jpg
    ——B_0000012.jpg
    …
    ——B_000090.jpg
    

    2.環境準備

    2.1 環境準備

    pip install iopath
    pip install fvcore
    pip install simplejson
    pip install pytorchvideo
    

    2.2detectron2安裝

    !python -m pip install pyyaml==5.1
    import sys, os, distutils.core
    # Note: This is a faster way to install detectron2 in Colab, but it does not include all functionalities.
    # See https://detectron2.readthedocs.io/tutorials/install.html for full installation instructions
    !git clone 'https://github.com/facebookresearch/detectron2'
    dist=distutils.core.run_setup("./detectron2/setup.py")
    !python -m pip install {' '.join([f"'{x}'" for x in dist.install_requires])}
    sys.path.insert(0, os.path.abspath('./detectron2'))
    

    3.slowfast訓練

    3.1 訓練

    python tools/run_net.py --cfg configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml
    

    SLOWFAST_32x2_R50_SHORT.yaml

    TRAIN:
      ENABLE: Fasle
      DATASET: ava
      BATCH_SIZE: 8 #64
      EVAL_PERIOD: 5
      CHECKPOINT_PERIOD: 1
      AUTO_RESUME: True
      CHECKPOINT_FILE_PATH: '/content/SLOWFAST_32x2_R101_50_50.pkl'  #預訓練模型地址
      CHECKPOINT_TYPE: pytorch
    DATA:
      NUM_FRAMES: 32
      SAMPLING_RATE: 2
      TRAIN_JITTER_SCALES: [256, 320]
      TRAIN_CROP_SIZE: 224
      TEST_CROP_SIZE: 224
      INPUT_CHANNEL_NUM: [3, 3]
      PATH_TO_DATA_DIR: '/content/ava'
    DETECTION:
      ENABLE: True
      ALIGNED: True
    AVA:
      FRAME_DIR: '/content/ava/frames'   #數據準備階段生成的目錄
      FRAME_LIST_DIR: '/content/ava/frame_lists'
      ANNOTATION_DIR: '/content/ava/annotations'
      DETECTION_SCORE_THRESH: 0.5
      FULL_TEST_ON_VAL: True
      TRAIN_PREDICT_BOX_LISTS: [
        "ava_train_v2.2.csv",
        "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv",
      ]
      TEST_PREDICT_BOX_LISTS: [
        "person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]
      
      
    SLOWFAST:
      ALPHA: 4
      BETA_INV: 8
      FUSION_CONV_CHANNEL_RATIO: 2
      FUSION_KERNEL_SZ: 7
    RESNET:
      ZERO_INIT_FINAL_BN: True
      WIDTH_PER_GROUP: 64
      NUM_GROUPS: 1
      DEPTH: 50
      TRANS_FUNC: bottleneck_transform
      STRIDE_1X1: False
      NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]
      SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]
      SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]
    NONLOCAL:
      LOCATION: [[[], []], [[], []], [[], []], [[], []]]
      GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]
      INSTANTIATION: dot_product
      POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]
    BN:
      USE_PRECISE_STATS: False
      NUM_BATCHES_PRECISE: 20
    SOLVER:
      BASE_LR: 0.1
      LR_POLICY: steps_with_relative_lrs
      STEPS: [0, 10, 15, 20]
      LRS: [1, 0.1, 0.01, 0.001]
      MAX_EPOCH: 20
      MOMENTUM: 0.9
      WEIGHT_DECAY: 1e-7
      WARMUP_EPOCHS: 5.0
      WARMUP_START_LR: 0.000125
      OPTIMIZING_METHOD: sgd
    MODEL:
      NUM_CLASSES: 1
      ARCH: slowfast
      MODEL_NAME: SlowFast
      LOSS_FUNC: bce
      DROPOUT_RATE: 0.5
      HEAD_ACT: sigmoid
    TEST:
      ENABLE: False
      DATASET: ava
      BATCH_SIZE: 8
    DATA_LOADER:
      NUM_WORKERS: 0
      PIN_MEMORY: True
    NUM_GPUS: 1
    NUM_SHARDS: 1
    RNG_SEED: 0
    OUTPUT_DIR: .
    

    3.2 訓練過程常見報錯

    1.slowfast/datasets/ava_helper.pyAVA_VALID_FRAMES改為你的視頻長度

    2.pytorchvideo.layers.distributed報錯

    from pytorchvideo.layers.distributed import ( # noqa
    ImportError: cannot import name 'cat_all_gather' from 'pytorchvideo.layers.distributed' 
    (/site-packages/pytorchvideo/layers/distributed.py)
    

    3.pytorchvideo.losses 報錯

    File "SlowFast/slowfast/models/losses.py", line 11, in
    from pytorchvideo.losses.soft_target_cross_entropy import (
    ModuleNotFoundError: No module named 'pytorchvideo.losses'
    

    錯誤2,3可以通過查看參考鏈接一來解決

    4.slowfast預測

    第一種:使用官方的腳本進行推理

    python tools/run_net.py --cfg demo/AVA/SLOWFAST_32x2_R101_50_50.yaml
    

    第二種:由于detectron2安裝問題,以及之后部署一系列的問題,可以使用yolov5加上slowfast進行推理

    首先,先來了解slowfast的推理過程


    Step1:連續讀取64幀并且判斷是否滿足64幀

    while was_read:
        frames=[]
        seq_length=64
        while was_read and len(frames) < seq_length:
            was_read, frame=cap.read()
            frames.append(frame)
    

    Step2: 使用yolov5進行目標檢測

    1.yolov5 推理代碼,將sys.path.insert路徑和權重路徑weights進行更改

    import argparse
    import os
    import platform
    import shutil
    import time
    from pathlib import Path
    import sys
    import json
    sys.path.insert(1, '/content/drive/MyDrive/yolov5/')
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    import numpy as np
    import argparse
    import time
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    from numpy import random
    from models.common import DetectMultiBackend
    from utils.augmentations import letterbox
    from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    from utils.torch_utils import select_device
    # ####### 參數設置
    conf_thres=0.6
    iou_thres=0.5
    #######
    imgsz=640
    weights="/content/yolov5l.pt"
    device='0'
    stride=32
    names=["person"]
    import os
    def init():
        # Initialize
        global imgsz, device, stride
        set_logging()
        device=select_device('0')
        half=device.type !='cpu'  # half precision only supported on CUDA
        model=DetectMultiBackend(weights, device=device, dnn=False)
        stride, pt, jit, engine=model.stride, model.pt, model.jit, model.engine
        imgsz=check_img_size(imgsz, s=stride)  # check img_size
        model.half()  # to FP16
        model.eval()
        return model
    
    def process_image(model, input_image=None, args=None, **kwargs):
        img0=input_image
        img=letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
        img=img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img=np.ascontiguousarray(img)
    
        img=torch.from_numpy(img).to(device)
        img=img.half()
        img /=255.0  # 0 - 255 to 0.0 - 1.0
        if len(img.shape)==3:
            img=img[None]
        pred=model(img, augment=False, val=True)[0]
        pred=non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
        result=[]
        for i, det in enumerate(pred):  # detections per image
            gn=torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh
            if det is not None and len(det):
                # Rescale boxes from img_size to im0 size
                det[:, :4]=scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
                for *xyxy, conf, cls in det:
                    if cls==0:
                        result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
        if len(result)==0:
          return None
        return torch.from_numpy(np.array(result))
    

    2.bbox 預處理

    def scale_boxes(size, boxes, height, width):
        """
        Scale the short side of the box to size.
        Args:
            size (int): size to scale the image.
            boxes (ndarray): bounding boxes to peform scale. The dimension is
            `num boxes` x 4.
            height (int): the height of the image.
            width (int): the width of the image.
        Returns:
            boxes (ndarray): scaled bounding boxes.
        """
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return boxes
    
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
            boxes *=float(new_height) / height
        else:
            new_width=int(math.floor((float(width) / height) * size))
            boxes *=float(new_width) / width
        return boxes
    

    Step3: 圖像預處理

    1.Resize 圖像尺寸

    def scale(size, image):
        """
        Scale the short side of the image to size.
        Args:
            size (int): size to scale the image.
            image (array): image to perform short side scale. Dimension is
                `height` x `width` x `channel`.
        Returns:
            (ndarray): the scaled image with dimension of
                `height` x `width` x `channel`.
        """
        height=image.shape[0]
        width=image.shape[1]
        # print(height,width)
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return image
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
        else:
            new_width=int(math.floor((float(width) / height) * size))
        img=cv2.resize(
            image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
        )
        # print(new_width, new_height)
        return img.astype(np.float32)
    

    2.歸一化

    def tensor_normalize(tensor, mean, std, func=None):
        """
        Normalize a given tensor by subtracting the mean and dividing the std.
        Args:
            tensor (tensor): tensor to normalize.
            mean (tensor or list): mean value to subtract.
            std (tensor or list): std to divide.
        """
        if tensor.dtype==torch.uint8:
            tensor=tensor.float()
            tensor=tensor / 255.0
        if type(mean)==list:
            mean=torch.tensor(mean)
        if type(std)==list:
            std=torch.tensor(std)
        if func is not None:
            tensor=func(tensor)
        tensor=tensor - mean
        tensor=tensor / std
        return tensor
    

    3.構建slow以及fast 輸入數據

    主要思路為從64幀圖像數據中選取32幀作為fast的輸入,再從fast中選取8幀作為slow的輸入,并將 T H W C -> C T H W.因此最后fast_pathway維度為(b,3,32,h,w) slow_pathway的維度為(b,3,8,h,w)

    def process_cv2_inputs(frames):
        """
        Normalize and prepare inputs as a list of tensors. Each tensor
        correspond to a unique pathway.
        Args:
            frames (list of array): list of input images (correspond to one clip) in range [0, 255].
            cfg (CfgNode): configs. Details can be found in
                slowfast/config/defaults.py
        """
        inputs=torch.from_numpy(np.array(frames)).float() / 255
        inputs=tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
        # T H W C -> C T H W.
        inputs=inputs.permute(3, 0, 1, 2)
        # Sample frames for num_frames specified.
        index=torch.linspace(0, inputs.shape[1] - 1, 32).long()
        print(index)
        inputs=torch.index_select(inputs, 1, index)
        fast_pathway=inputs
        slow_pathway=torch.index_select(
                inputs,
                1,
                torch.linspace(
                    0, inputs.shape[1] - 1, inputs.shape[1] // 4
                ).long(),
            )
        frame_list=[slow_pathway, fast_pathway]
        print(np.shape(frame_list[0]))
        inputs=[inp.unsqueeze(0) for inp in frame_list]
        return inputs
    

    5.slowfast onnx推理

    5.1 導出onnx文件

    import os
    import sys
    from collections import OrderedDict
    import torch
    import argparse
    work_root=os.path.split(os.path.realpath(__file__))[0]
    from slowfast.config.defaults import get_cfg
    import slowfast.utils.checkpoint as cu
    from slowfast.models import build_model
    
    
    def parser_args():
        parser=argparse.ArgumentParser()
        parser.add_argument(
            "--cfg",
            dest="cfg_file",
            type=str,
            default=os.path.join(
                work_root, "/content/drive/MyDrive/SlowFast/demo/AVA/SLOWFAST_32x2_R101_50_50.yaml"),
            help="Path to the config file",
        )
        parser.add_argument(
            '--half',
            type=bool,
            default=False,
            help='use half mode',
        )
        parser.add_argument(
            '--checkpoint',
            type=str,
            default=os.path.join(work_root,
                                 "/content/SLOWFAST_32x2_R101_50_50.pkl"),
            help='test model file path',
        )
        parser.add_argument(
            '--save',
            type=str,
            default=os.path.join(work_root, "/content/SLOWFAST_head.onnx"),
            help='save model file path',
        )
        return parser.parse_args()
    
    
    def main():
        args=parser_args()
        print(args)
        cfg_file=args.cfg_file
        checkpoint_file=args.checkpoint
        save_checkpoint_file=args.save
        half_flag=args.half
        cfg=get_cfg()
        cfg.merge_from_file(cfg_file)
        cfg.TEST.CHECKPOINT_FILE_PATH=checkpoint_file
        print(cfg.DATA)
        print("export pytorch model to onnx!\n")
        device="cuda:0"
        with torch.no_grad():
            model=build_model(cfg)
            model=model.to(device)
            model.eval()
            cu.load_test_checkpoint(cfg, model)
            if half_flag:
                model.half()
            fast_pathway=torch.randn(1, 3, 32, 256, 455)
            slow_pathway=torch.randn(1, 3, 8, 256, 455)
            bbox=torch.randn(32,5).to(device)
            fast_pathway=fast_pathway.to(device)
            slow_pathway=slow_pathway.to(device)
            inputs=[slow_pathway, fast_pathway]
            for p in model.parameters():
             p.requires_grad=False
            torch.onnx.export(model, (inputs,bbox), save_checkpoint_file, input_names=['slow_pathway','fast_pathway','bbox'],output_names=['output'], opset_version=12)
            onnx_check()
    
    
    def onnx_check():
        import onnx
        args=parser_args()
        print(args)
        onnx_model_path=args.save
        model=onnx.load(onnx_model_path)
        onnx.checker.check_model(model)
    
    
    if __name__=='__main__':
        main()
    

    5.2onnx推理

    import torch
    import math
    import onnxruntime
    from torchvision.ops import roi_align
    import argparse
    import os
    import platform
    import shutil
    import time
    from pathlib import Path
    import sys
    import json
    sys.path.insert(1, '/content/drive/MyDrive/yolov5/')
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    import numpy as np
    import argparse
    import time
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    from numpy import random
    from models.common import DetectMultiBackend
    from utils.augmentations import letterbox
    from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    from utils.torch_utils import select_device
    # ####### 參數設置
    conf_thres=0.6
    iou_thres=0.5
    #######
    imgsz=640
    weights="/content/yolov5l.pt"
    device='0'
    stride=32
    names=["person"]
    import os
    def init():
        # Initialize
        global imgsz, device, stride
        set_logging()
        device=select_device('0')
        half=device.type !='cpu'  # half precision only supported on CUDA
        model=DetectMultiBackend(weights, device=device, dnn=False)
        stride, pt, jit, engine=model.stride, model.pt, model.jit, model.engine
        imgsz=check_img_size(imgsz, s=stride)  # check img_size
        model.half()  # to FP16
        model.eval()
        return model
    
    def process_image(model, input_image=None, args=None, **kwargs):
        img0=input_image
        img=letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
        img=img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img=np.ascontiguousarray(img)
    
        img=torch.from_numpy(img).to(device)
        img=img.half()
        img /=255.0  # 0 - 255 to 0.0 - 1.0
        if len(img.shape)==3:
            img=img[None]
        pred=model(img, augment=False, val=True)[0]
        pred=non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
        result=[]
        for i, det in enumerate(pred):  # detections per image
            gn=torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh
            if det is not None and len(det):
                # Rescale boxes from img_size to im0 size
                det[:, :4]=scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
                for *xyxy, conf, cls in det:
                    if cls==0:
                        result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
        if len(result)==0:
          return None
        for i in range(32-len(result)):
          result.append([float(0),float(0),float(0),float(0)])
        return torch.from_numpy(np.array(result))
    def scale(size, image):
        """
        Scale the short side of the image to size.
        Args:
            size (int): size to scale the image.
            image (array): image to perform short side scale. Dimension is
                `height` x `width` x `channel`.
        Returns:
            (ndarray): the scaled image with dimension of
                `height` x `width` x `channel`.
        """
        height=image.shape[0]
        width=image.shape[1]
        # print(height,width)
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return image
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
        else:
            new_width=int(math.floor((float(width) / height) * size))
        img=cv2.resize(
            image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
        )
        # print(new_width, new_height)
        return img.astype(np.float32)
    def tensor_normalize(tensor, mean, std, func=None):
        """
        Normalize a given tensor by subtracting the mean and dividing the std.
        Args:
            tensor (tensor): tensor to normalize.
            mean (tensor or list): mean value to subtract.
            std (tensor or list): std to divide.
        """
        if tensor.dtype==torch.uint8:
            tensor=tensor.float()
            tensor=tensor / 255.0
        if type(mean)==list:
            mean=torch.tensor(mean)
        if type(std)==list:
            std=torch.tensor(std)
        if func is not None:
            tensor=func(tensor)
        tensor=tensor - mean
        tensor=tensor / std
        return tensor
    def scale_boxes(size, boxes, height, width):
        """
        Scale the short side of the box to size.
        Args:
            size (int): size to scale the image.
            boxes (ndarray): bounding boxes to peform scale. The dimension is
            `num boxes` x 4.
            height (int): the height of the image.
            width (int): the width of the image.
        Returns:
            boxes (ndarray): scaled bounding boxes.
        """
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return boxes
    
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
            boxes *=float(new_height) / height
        else:
            new_width=int(math.floor((float(width) / height) * size))
            boxes *=float(new_width) / width
        return boxes
    def process_cv2_inputs(frames):
        """
        Normalize and prepare inputs as a list of tensors. Each tensor
        correspond to a unique pathway.
        Args:
            frames (list of array): list of input images (correspond to one clip) in range [0, 255].
            cfg (CfgNode): configs. Details can be found in
                slowfast/config/defaults.py
        """
        inputs=torch.from_numpy(np.array(frames)).float() / 255
        inputs=tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
        # T H W C -> C T H W.
        inputs=inputs.permute(3, 0, 1, 2)
        # Sample frames for num_frames specified.
        index=torch.linspace(0, inputs.shape[1] - 1, 32).long()
        print(index)
        inputs=torch.index_select(inputs, 1, index)
        fast_pathway=inputs
        slow_pathway=torch.index_select(
                inputs,
                1,
                torch.linspace(
                    0, inputs.shape[1] - 1, inputs.shape[1] // 4
                ).long(),
            )
        frame_list=[slow_pathway, fast_pathway]
        print(np.shape(frame_list[0]))
        inputs=[inp.unsqueeze(0) for inp in frame_list]
        return inputs
    #加載模型
    yolov5=init()
    slowfast=onnxruntime.InferenceSession('/content/SLOWFAST_32x2_R101_50_50.onnx')
    #加載數據開始推理
    cap=cv2.VideoCapture("/content/atm_125.mp4")
    was_read=True
    while was_read:
        frames=[]
        seq_length=64
        while was_read and len(frames) < seq_length:
            was_read, frame=cap.read()
            frames.append(frame)
        
        bboxes=process_image(yolov5,frames[64//2])
        if bboxes is not None:
          frames=[cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]
          frames=[scale(256, frame) for frame in frames]
          inputs=process_cv2_inputs(frames)
          if bboxes is not None:
              bboxes=scale_boxes(256,bboxes,1080,1920)
              index_pad=torch.full(
                  size=(bboxes.shape[0], 1),
                  fill_value=float(0),
                  device=bboxes.device,
              )
              # Pad frame index for each box.
              bboxes=torch.cat([index_pad, bboxes], axis=1)
          for i in range(len(inputs)):
            inputs[i]=inputs[i].numpy()
          if bboxes is not None:
              outputs=slowfast.run(None, {'slow_pathway': inputs[0],'fast_pathway':inputs[1],'bbox':bboxes})
              for i in range(80):
                if outputs[0][0][i]>0.3:
                  print(i)
              print(np.shape(prd))
        else:
            print("沒有檢測到任何人物")
    

    6slowfastpythonTensorrt推理

    6.1 導出Tensorrt

    接下來,為本文的創新點

    一開始,本文嘗試使用直接將onnx導出為Tensorrt,導出失敗,查找原因是因為roi_alignTensorrt中還未實現(roi_align 將在下個版本的Tensorrt中實現)。

    查看導出的onnx圖,會發現roi_align只在head部分用到。


    于是我們提出以下思路,如下圖所示,將roi_ailgn模塊單獨劃分出來,不經過Tensorrt加速,將slowfast分成為兩個網絡,其中主體網絡用于提取特征,head網絡部分負責進行動作分類.。


    6.2Tensorrt推理代碼

    import ctypes
    import os
    import numpy as np
    import cv2
    import random
    import tensorrt as trt
    import pycuda.autoinit
    import pycuda.driver as cuda
    import threading
    import time
    
    
    class TrtInference():
        _batch_size=1
        def __init__(self, model_path=None, cuda_ctx=None):
            self._model_path=model_path
            if self._model_path is None:
                print("please set trt model path!")
                exit()
            self.cuda_ctx=cuda_ctx
            if self.cuda_ctx is None:
                self.cuda_ctx=cuda.Device(0).make_context()
            if self.cuda_ctx:
                self.cuda_ctx.push()
            self.trt_logger=trt.Logger(trt.Logger.INFO)
            self._load_plugins()
            self.engine=self._load_engine()
            try:
                self.context=self.engine.create_execution_context()
                self.stream=cuda.Stream()
                for index, binding in enumerate(self.engine):
                    if self.engine.binding_is_input(binding):
                        batch_shape=list(self.engine.get_binding_shape(binding)).copy()
                        batch_shape[0]=self._batch_size
                        self.context.set_binding_shape(index, batch_shape)
                self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings=self._allocate_buffers()
            except Exception as e:
                raise RuntimeError('fail to allocate CUDA resources') from e
            finally:
                if self.cuda_ctx:
                    self.cuda_ctx.pop()
    
        def _load_plugins(self):
            pass
    
        def _load_engine(self):
            with open(self._model_path, 'rb') as f, trt.Runtime(self.trt_logger) as runtime:
                return runtime.deserialize_cuda_engine(f.read())
    
        def _allocate_buffers(self):
            host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings=\
                [], [], [], [], []
            for index, binding in enumerate(self.engine):
                size=trt.volume(self.context.get_binding_shape(index)) * \
                       self.engine.max_batch_size
                host_mem=cuda.pagelocked_empty(size, np.float32)
                cuda_mem=cuda.mem_alloc(host_mem.nbytes)
                bindings.append(int(cuda_mem))
                if self.engine.binding_is_input(binding):
                    host_inputs.append(host_mem)
                    cuda_inputs.append(cuda_mem)
                else:
                    host_outputs.append(host_mem)
                    cuda_outputs.append(cuda_mem)
            return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings
    
        def destroy(self):
            """Free CUDA memories and context."""
            del self.cuda_outputs
            del self.cuda_inputs
            del self.stream
            if self.cuda_ctx:
                self.cuda_ctx.pop()
                del self.cuda_ctx
    
        def inference(self, inputs):
            np.copyto(self.host_inputs[0], inputs[0].ravel())
            np.copyto(self.host_inputs[1], inputs[1].ravel())
            if self.cuda_ctx:
                self.cuda_ctx.push()
            cuda.memcpy_htod_async(
                self.cuda_inputs[0], self.host_inputs[0], self.stream)
            cuda.memcpy_htod_async(
                self.cuda_inputs[1], self.host_inputs[1], self.stream)
            self.context.execute_async(
                batch_size=1,
                bindings=self.bindings,
                stream_handle=self.stream.handle)
            cuda.memcpy_dtoh_async(
                self.host_outputs[0], self.cuda_outputs[0], self.stream)
            cuda.memcpy_dtoh_async(
                self.host_outputs[1], self.cuda_outputs[1], self.stream)
            self.stream.synchronize()
            if self.cuda_ctx:
                self.cuda_ctx.pop()
            output=[self.host_outputs[0],self.host_outputs[1]]
            return output
    
    
    class TrtInference_head():
        _batch_size=1
        def __init__(self, model_path=None, cuda_ctx=None):
            self._model_path=model_path
            if self._model_path is None:
                print("please set trt model path!")
                exit()
            self.cuda_ctx=cuda_ctx
            if self.cuda_ctx is None:
                self.cuda_ctx=cuda.Device(0).make_context()
            if self.cuda_ctx:
                self.cuda_ctx.push()
            self.trt_logger=trt.Logger(trt.Logger.INFO)
            self._load_plugins()
            self.engine=self._load_engine()
            try:
                self.context=self.engine.create_execution_context()
                self.stream=cuda.Stream()
                for index, binding in enumerate(self.engine):
                    if self.engine.binding_is_input(binding):
                        batch_shape=list(self.engine.get_binding_shape(binding)).copy()
                        batch_shape[0]=self._batch_size
                        self.context.set_binding_shape(index, batch_shape)
                self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings=self._allocate_buffers()
            except Exception as e:
                raise RuntimeError('fail to allocate CUDA resources') from e
            finally:
                if self.cuda_ctx:
                    self.cuda_ctx.pop()
    
        def _load_plugins(self):
            pass
    
        def _load_engine(self):
            with open(self._model_path, 'rb') as f, trt.Runtime(self.trt_logger) as runtime:
                return runtime.deserialize_cuda_engine(f.read())
    
        def _allocate_buffers(self):
            host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings=\
                [], [], [], [], []
            for index, binding in enumerate(self.engine):
                size=trt.volume(self.context.get_binding_shape(index)) * \
                       self.engine.max_batch_size
                host_mem=cuda.pagelocked_empty(size, np.float32)
                cuda_mem=cuda.mem_alloc(host_mem.nbytes)
                bindings.append(int(cuda_mem))
                if self.engine.binding_is_input(binding):
                    host_inputs.append(host_mem)
                    cuda_inputs.append(cuda_mem)
                else:
                    host_outputs.append(host_mem)
                    cuda_outputs.append(cuda_mem)
            return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings
    
        def destroy(self):
            """Free CUDA memories and context."""
            del self.cuda_outputs
            del self.cuda_inputs
            del self.stream
            if self.cuda_ctx:
                self.cuda_ctx.pop()
                del self.cuda_ctx
    
        def inference(self, inputs):
            np.copyto(self.host_inputs[0], inputs[0].ravel())
            np.copyto(self.host_inputs[1], inputs[1].ravel())
            if self.cuda_ctx:
                self.cuda_ctx.push()
            cuda.memcpy_htod_async(
                self.cuda_inputs[0], self.host_inputs[0], self.stream)
            cuda.memcpy_htod_async(
                self.cuda_inputs[1], self.host_inputs[1], self.stream)
            self.context.execute_async(
                batch_size=1,
                bindings=self.bindings,
                stream_handle=self.stream.handle)
            cuda.memcpy_dtoh_async(
                self.host_outputs[0], self.cuda_outputs[0], self.stream)
            self.stream.synchronize()
            if self.cuda_ctx:
                self.cuda_ctx.pop()
            output=self.host_outputs[0]
            return output
    
    import torch
    import math
    from torchvision.ops import roi_align
    import argparse
    import os
    import platform
    import shutil
    import time
    from pathlib import Path
    import sys
    import json
    sys.path.insert(1, '/content/drive/MyDrive/yolov5/')
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    import numpy as np
    import argparse
    import time
    import cv2
    import torch
    import torch.backends.cudnn as cudnn
    from numpy import random
    from models.common import DetectMultiBackend
    from utils.augmentations import letterbox
    from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    from utils.torch_utils import select_device
    # ####### 參數設置
    conf_thres=0.89
    iou_thres=0.5
    #######
    imgsz=640
    weights="/content/yolov5l.pt"
    device='0'
    stride=32
    names=["person"]
    import os
    def init():
        # Initialize
        global imgsz, device, stride
        set_logging()
        device=select_device('0')
        half=device.type !='cpu'  # half precision only supported on CUDA
        model=DetectMultiBackend(weights, device=device, dnn=False)
        stride, pt, jit, engine=model.stride, model.pt, model.jit, model.engine
        imgsz=check_img_size(imgsz, s=stride)  # check img_size
        model.half()  # to FP16
        model.eval()
        return model
    
    def process_image(model, input_image=None, args=None, **kwargs):
        img0=input_image
        img=letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
        img=img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img=np.ascontiguousarray(img)
    
        img=torch.from_numpy(img).to(device)
        img=img.half()
        img /=255.0  # 0 - 255 to 0.0 - 1.0
        if len(img.shape)==3:
            img=img[None]
        pred=model(img, augment=False, val=True)[0]
        pred=non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
        result=[]
        for i, det in enumerate(pred):  # detections per image
            gn=torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh
            if det is not None and len(det):
                # Rescale boxes from img_size to im0 size
                det[:, :4]=scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
                for *xyxy, conf, cls in det:
                    if cls==0:
                        result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
        if len(result)==0:
          return None
        for i in range(32-len(result)):
          result.append([float(0),float(0),float(0),float(0)])
        return torch.from_numpy(np.array(result))
    def scale(size, image):
        """
        Scale the short side of the image to size.
        Args:
            size (int): size to scale the image.
            image (array): image to perform short side scale. Dimension is
                `height` x `width` x `channel`.
        Returns:
            (ndarray): the scaled image with dimension of
                `height` x `width` x `channel`.
        """
        height=image.shape[0]
        width=image.shape[1]
        # print(height,width)
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return image
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
        else:
            new_width=int(math.floor((float(width) / height) * size))
        img=cv2.resize(
            image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
        )
        # print(new_width, new_height)
        return img.astype(np.float32)
    def tensor_normalize(tensor, mean, std, func=None):
        """
        Normalize a given tensor by subtracting the mean and dividing the std.
        Args:
            tensor (tensor): tensor to normalize.
            mean (tensor or list): mean value to subtract.
            std (tensor or list): std to divide.
        """
        if tensor.dtype==torch.uint8:
            tensor=tensor.float()
            tensor=tensor / 255.0
        if type(mean)==list:
            mean=torch.tensor(mean)
        if type(std)==list:
            std=torch.tensor(std)
        if func is not None:
            tensor=func(tensor)
        tensor=tensor - mean
        tensor=tensor / std
        return tensor
    def scale_boxes(size, boxes, height, width):
        """
        Scale the short side of the box to size.
        Args:
            size (int): size to scale the image.
            boxes (ndarray): bounding boxes to peform scale. The dimension is
            `num boxes` x 4.
            height (int): the height of the image.
            width (int): the width of the image.
        Returns:
            boxes (ndarray): scaled bounding boxes.
        """
        if (width <=height and width==size) or (
            height <=width and height==size
        ):
            return boxes
    
        new_width=size
        new_height=size
        if width < height:
            new_height=int(math.floor((float(height) / width) * size))
            boxes *=float(new_height) / height
        else:
            new_width=int(math.floor((float(width) / height) * size))
            boxes *=float(new_width) / width
        return boxes
    def process_cv2_inputs(frames):
        """
        Normalize and prepare inputs as a list of tensors. Each tensor
        correspond to a unique pathway.
        Args:
            frames (list of array): list of input images (correspond to one clip) in range [0, 255].
            cfg (CfgNode): configs. Details can be found in
                slowfast/config/defaults.py
        """
        inputs=torch.from_numpy(np.array(frames)).float() / 255
        inputs=tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
        # T H W C -> C T H W.
        inputs=inputs.permute(3, 0, 1, 2)
        # Sample frames for num_frames specified.
        index=torch.linspace(0, inputs.shape[1] - 1, 32).long()
        print(index)
        inputs=torch.index_select(inputs, 1, index)
        fast_pathway=inputs
        slow_pathway=torch.index_select(
                inputs,
                1,
                torch.linspace(
                    0, inputs.shape[1] - 1, inputs.shape[1] // 4
                ).long(),
            )
        frame_list=[slow_pathway, fast_pathway]
        print(np.shape(frame_list[0]))
        inputs=[inp.unsqueeze(0) for inp in frame_list]
        return inputs
    #加載模型
    yolov5=init()
    slowfast=TrtInference('/content/SLOWFAST_32x2_R101_50_50.engine',None)
    head=TrtInference_head('/content/SLOWFAST_head.engine',None)
    
    #加載數據開始推理
    cap=cv2.VideoCapture("/content/atm_125.mp4")
    was_read=True
    while was_read:
        frames=[]
        seq_length=64
        while was_read and len(frames) < seq_length:
            was_read, frame=cap.read()
            frames.append(frame)
        
        bboxes=process_image(yolov5,frames[64//2])
        if bboxes is not None:
          frames=[cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]
          frames=[scale(256, frame) for frame in frames]
          inputs=process_cv2_inputs(frames)
          print(bboxes)
          if bboxes is not None:
              bboxes=scale_boxes(256,bboxes,1080,1920)
              index_pad=torch.full(
                  size=(bboxes.shape[0], 1),
                  fill_value=float(0),
                  device=bboxes.device,
              )
              # Pad frame index for each box.
              bboxes=torch.cat([index_pad, bboxes], axis=1)
          for i in range(len(inputs)):
            inputs[i]=inputs[i].numpy()
          if bboxes is not None:
              outputs=slowfast.inference(inputs)
              outputs[0]=outputs[0].reshape(1,2048,16,29)
              outputs[1]=outputs[1].reshape(1,256,16,29)
              outputs[0]=torch.from_numpy(outputs[0])
              outputs[1]=torch.from_numpy(outputs[1])
              outputs[0]=roi_align(outputs[0],bboxes.to(dtype=outputs[0].dtype),7,1.0/16,0,True)
              outputs[1]=roi_align(outputs[1],bboxes.to(dtype=outputs[1].dtype),7,1.0/16,0,True)
              outputs[0]=outputs[0].numpy()
              outputs[1]=outputs[1].numpy()
              prd=head.inference(outputs)
              prd=prd.reshape(32,80)
              for i in range(80):
                if prd[0][i]>0.3:
                  print(i)
        else:
            print("沒有檢測到任何人物")
    

    通過閱讀上述的代碼


    slow_pathwayfast_pathway 經過slowfast主體模型,通過reshaperoi_align 需要的維度,將reshape后的結果,bbox以及相應的參數帶入到roi_align中得到head模型需要的輸入。

    7.slowfastC++tensorrt部署

    7.1yolov5C++目標檢測

    yolov5 本文就不介紹了,我直接使用平臺自帶的yolov5 tensorrt 代碼

    https://github.com/ExtremeMart/ev_sdk_demo4.0_pedestrian_intrusion_yolov5
    

    7.2deepsortC++目標追蹤

    本文參考以下的deepsort代碼

    https://github.com/RichardoMrMu/deepsort-tensorrt
    

    由于這部分不是本文的重點,只需要知道怎么使用這部分的代碼,寫好CmakeLists文件,在代碼中可以按照以下的方式使用deepsort

    #include "deepsort.h" 
    /**
     DeepSortBox 為yolov5識別的結果
     DeepSortBox 結構
     {
      x1,
      y1,
      x2,
      y2,
      score,
      label,
      trackID
     }
     img 為原始的圖片
     最終結果存放在DeepSortBox中
    */
    DS->sort(img, DeepSortBox); 
    

    7.3slowfastC++目標動作識別

    運行環境:

    Tensorrt8.4

    opencv4.1.1

    cudnn8.0

    cuda11.1

    文件準備:

    body.onnx

    head.onnx

    slowfast推理流程圖

    我們還是按照預測的流程圖來實現Tensorrt推理代碼

    通過onnx可視化查看body.onnx輸入以及輸出


    head.onnx的輸入以及輸出


    Step1:模型加載

    body.onnx以及head.onnx 通過Tensorrt加載,并且開辟Tensorrt推理運行空間,代碼如下

    void loadheadOnnx(const std::string strModelName)
    {
        Logger gLogger;
        //根據tensorrt pipeline 構建網絡
        IBuilder* builder=createInferBuilder(gLogger);
        builder->setMaxBatchSize(1);
        const auto explicitBatch=1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);  
        INetworkDefinition* network=builder->createNetworkV2(explicitBatch);
        nvonnxparser::IParser* parser=nvonnxparser::createParser(*network, gLogger);
        parser->parseFromFile(strModelName.c_str(), static_cast<int>(ILogger::Severity::kWARNING));
        IBuilderConfig* config=builder->createBuilderConfig();
        config->setMaxWorkspaceSize(1ULL << 30);    
        m_CudaheadEngine=builder->buildEngineWithConfig(*network, *config);    
    
        std::string strTrtName=strModelName;
        size_t sep_pos=strTrtName.find_last_of(".");
        strTrtName=strTrtName.substr(0, sep_pos) + ".trt";
        IHostMemory *gieModelStream=m_CudaheadEngine->serialize();
        std::string serialize_str;
        std::ofstream serialize_output_stream;
        serialize_str.resize(gieModelStream->size());   
        memcpy((void*)serialize_str.data(),gieModelStream->data(),gieModelStream->size());
        serialize_output_stream.open(strTrtName.c_str());
        serialize_output_stream<<serialize_str;
        serialize_output_stream.close();
        m_CudaheadContext=m_CudaheadEngine->createExecutionContext();
        parser->destroy();
        network->destroy();
        config->destroy();
        builder->destroy();
    }
    

    Step2: 為輸入輸出數據開辟空間

    body.onnx 輸入為slow_pathwayfast_pathway的維度為(B,C,T,H,W),其中slow_pathway的T為8,輸出為(B,2048,16,29)fast_pathway的維度為32,輸出為(B,256,16,29)``,head的輸入(32,2048,7,7)與(32,256,7,7),輸出為(32,80),具體代碼實現如下:

     slow_pathway_InputIndex=m_CudaslowfastEngine->getBindingIndex(slow_pathway_NAME);
        fast_pathway_InputIndex=m_CudaslowfastEngine->getBindingIndex(fast_pathway_NAME);
        slow_pathway_OutputIndex=m_CudaslowfastEngine->getBindingIndex(slow_pathway_OUTPUT);
        fast_pathway_OutputIndex=m_CudaslowfastEngine->getBindingIndex(fast_pathway_OUTPUT); 
        dims_i=m_CudaslowfastEngine->getBindingDimensions(slow_pathway_InputIndex);
        SDKLOG(INFO)<<slow_pathway_InputIndex<<" "<<fast_pathway_InputIndex<<" "<<slow_pathway_OutputIndex<<" "<<fast_pathway_OutputIndex;
        SDKLOG(INFO) << "slow_pathway dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3]<< " " << dims_i.d[4];
        size=dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];
        cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_InputIndex], size * sizeof(float));
        slowfast_ArrayHostMemory[slow_pathway_InputIndex]=malloc(size * sizeof(float));
        slowfast_ArraySize[slow_pathway_InputIndex]=size* sizeof(float);
        
        dims_i=m_CudaslowfastEngine->getBindingDimensions(fast_pathway_InputIndex);
        SDKLOG(INFO) << "fast_pathway dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3]<< " " << dims_i.d[4];
        size=dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];
        cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_InputIndex], size * sizeof(float));
        slowfast_ArrayHostMemory[fast_pathway_InputIndex]=malloc(size * sizeof(float));
        slowfast_ArraySize[fast_pathway_InputIndex]=size* sizeof(float);
        
        
        dims_i=m_CudaslowfastEngine->getBindingDimensions(slow_pathway_OutputIndex);
        SDKLOG(INFO) << "slow_out dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3];
        size=dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];
        cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_OutputIndex], size * sizeof(float));
        slowfast_ArrayHostMemory[slow_pathway_OutputIndex]=malloc(size * sizeof(float));
        slowfast_ArraySize[slow_pathway_OutputIndex]=size* sizeof(float);
        
        
        
        dims_i=m_CudaslowfastEngine->getBindingDimensions(fast_pathway_OutputIndex);
        SDKLOG(INFO) << "fast_out dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3];
        size=dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];
        cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_OutputIndex], size * sizeof(float));
        slowfast_ArrayHostMemory[fast_pathway_OutputIndex]=malloc(size * sizeof(float));
        slowfast_ArraySize[fast_pathway_OutputIndex]=size* sizeof(float);
        
        
        
        size=32*2048*7*7;
        cudaMalloc(&ROIAlign_ArrayDevMemory[0], size * sizeof(float));
        ROIAlign_ArrayHostMemory[0]=malloc(size * sizeof(float));
        ROIAlign_ArraySize[0]=size* sizeof(float);
        
        size=32*256*7*7;
        cudaMalloc(&ROIAlign_ArrayDevMemory[1], size * sizeof(float));
        ROIAlign_ArrayHostMemory[1]=malloc(size * sizeof(float));
        ROIAlign_ArraySize[1]=size* sizeof(float);
        
        
        size=32*80;
        cudaMalloc(&ROIAlign_ArrayDevMemory[2], size * sizeof(float));
        ROIAlign_ArrayHostMemory[2]=malloc(size * sizeof(float));
        ROIAlign_ArraySize[2]=size* sizeof(float);
        size=32*5;
        boxes_data=malloc(size * sizeof(float));
        dims_i=m_CudaheadEngine->getBindingDimensions(0);
    

    Step3:輸入數據預處理

    首先由于我導出onnx文件沒有使用動態尺寸,導致input 圖片大小已經確定了,size=256*455(這個結果是1080*1920等比例放縮),slowfast模型要求為RGB,需要將圖片從BGR轉換為RGB,之后進行resize到256*455,具體代碼實現如下

      cv::Mat framesimg=img.clone();
            cv::cvtColor(framesimg, framesimg, cv::COLOR_BGR2RGB);
            int height=framesimg.rows;
            int width=framesimg.cols;
            // 對圖像進行預處理
            //cv2.COLOR_BGR2RGB
            int size=256;
            int new_width=width;
            int new_height=height;
            if ((width <=height && width==size) || (height <=width and height==size)){
                
            }
            else{
                new_width=size;
                new_height=size;
                if(width<height){
                    new_height=int((float(height) / width) * size);
                }else{  
                    new_width=int((float(width) / height) * size);
                }
                cv::resize(framesimg, framesimg, cv::Size{new_width,new_height},cv::INTER_LINEAR);
            } 
    

    其次之后對圖像進行歸一化操作,并且按照CTHW的順序進行排列,其中C為通道,T為圖像順序,H 為圖像的長度,W為圖像的寬度,由于slowfast有兩個輸入,一個輸入為fast_pathway 為32幀的圖像,維度為(b,c,T,h,w),其中T為32 ,因此需要每兩幀添加圖像數據到fast_pathway中,另外一個輸入為slow_pathway為8幀的圖像,維度為(b,c,T,h,w),其中T為8,因此需要每四幀添加圖像數據到slow_pathway 中,具體代碼如下

      float *data=(float *)slowfast_ArrayHostMemory[fast_pathway_InputIndex];
            new_width=framesimg.cols;
            new_height=framesimg.rows;
            for (size_t c=0; c < 3; c++)
            {
                for (size_t  h=0; h < new_height; h++)
                {
                    for (size_t w=0; w < new_width; w++)
                    {
                        float v=((float)framesimg.at<cv::Vec3b>(h, w)[c]) / 255.0f;
                        v -=0.45;
                        v /=0.225;
                        data[c*32*256*455+fast_index* new_width * new_height + h * new_width + w]=v;
                    }
                }
            }
            fast_index++;
            if(frames==0||frames==8||frames==16||frames==26||frames==34||frames==44||frames==52||frames==63){
                data=(float *)slowfast_ArrayHostMemory[slow_pathway_InputIndex];
                for (size_t c=0; c < 3; c++)
                {
                    for (size_t  h=0; h < new_height; h++)
                    {
                        for (size_t w=0; w < new_width; w++)
                        {
                           float v=((float)framesimg.at<cv::Vec3b>(h, w)[c]) / 255.0f;
                            v -=0.45;
                            v /=0.225;
                            data[c*8*256*455+slow_index* new_width * new_height + h * new_width + w]=v;
                        }
                    }
                }  
                slow_index++;
            }
    

    Step4: roi_align實現

    正如上一節所描述一樣,roi_align在當前版本中的Tensorrt中并沒有實現,而在torchvision.ops中實現了roi_align,python推理代碼可以直接調用。而C++代碼必須要實現roi_align,具體原理這里不講解了,可以簡單認為roi_align具體過程就是crop和resize的過程,從特征圖中提取bbox對應的特征,將提取到的特征resize到7*7。具體代碼實現如下

    void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
                         const int height, const int width, const int channels,
                         const int aligned_height, const int aligned_width, const float * bottom_rois,
                         float* top_data)
    {
        const int output_size=num_rois * aligned_height * aligned_width * channels;
    
        int idx=0;
        for (idx=0; idx < output_size; ++idx)
        {
            int pw=idx % aligned_width;
            int ph=(idx / aligned_width) % aligned_height;
            int c=(idx / aligned_width / aligned_height) % channels;
            int n=idx / aligned_width / aligned_height / channels;  
    
            float roi_batch_ind=0; 
            float roi_start_w=bottom_rois[n * 5 + 1] * spatial_scale;
            float roi_start_h=bottom_rois[n * 5 + 2] * spatial_scale;
            float roi_end_w=bottom_rois[n * 5 + 3] * spatial_scale;
            float roi_end_h=bottom_rois[n * 5 + 4] * spatial_scale; 
            float roi_width=fmaxf(roi_end_w - roi_start_w + 1., 0.);
            float roi_height=fmaxf(roi_end_h - roi_start_h + 1., 0.);
            float bin_size_h=roi_height / (aligned_height - 1.);
            float bin_size_w=roi_width / (aligned_width - 1.);
    
            float h=(float)(ph) * bin_size_h + roi_start_h;
            float w=(float)(pw) * bin_size_w + roi_start_w;
    
            int hstart=fminf(floor(h), height - 2); 
            int wstart=fminf(floor(w), width - 2);
    
            int img_start=roi_batch_ind * channels * height * width; 
            if (h < 0 || h >=height || w < 0 || w >=width)  
            {
                top_data[idx]=0.; 
            }
            else
            {
                float h_ratio=h - (float)(hstart); 
                float w_ratio=w - (float)(wstart);
                int upleft=img_start + (c * height + hstart) * width + wstart;
                
                int upright=upleft + 1;
                int downleft=upleft + width; 
                int downright=downleft + 1; 
    
                top_data[idx]=bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)
                    + bottom_data[upright] * (1. - h_ratio) * w_ratio
                    + bottom_data[downleft] * h_ratio * (1. - w_ratio)
                    + bottom_data[downright] * h_ratio * w_ratio;  
            }
        }
    }
    

    Step5:推理

    首先將 Step3中準備好的數據使用body進行推理,將推理結果使用Step4中的roi_align函數進行提取bbox對應的特征,最后將提取的特征使用head模型進行推理,得到output。具體代碼實現如下

    cudaMemcpyAsync(slowfast_ArrayDevMemory[slow_pathway_InputIndex], slowfast_ArrayHostMemory[slow_pathway_InputIndex], slowfast_ArraySize[slow_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);
        cudaMemcpyAsync(slowfast_ArrayDevMemory[fast_pathway_InputIndex], slowfast_ArrayHostMemory[fast_pathway_InputIndex], slowfast_ArraySize[fast_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);
        m_CudaslowfastContext->enqueueV2(slowfast_ArrayDevMemory , m_CudaStream, nullptr);    
       cudaMemcpyAsync(slowfast_ArrayHostMemory[slow_pathway_OutputIndex], slowfast_ArrayDevMemory[slow_pathway_OutputIndex], slowfast_ArraySize[slow_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);
        cudaMemcpyAsync(slowfast_ArrayHostMemory[fast_pathway_OutputIndex], slowfast_ArrayDevMemory[fast_pathway_OutputIndex], slowfast_ArraySize[fast_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);
        cudaStreamSynchronize(m_CudaStream);  
        data=(float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex];
        ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[slow_pathway_OutputIndex], 0.0625, 32,16,29, 2048,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[0]);
        ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex], 0.0625, 32,16,29, 256,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[1]);
        data=(float*)ROIAlign_ArrayHostMemory[0];
        cudaMemcpyAsync(ROIAlign_ArrayDevMemory[0], ROIAlign_ArrayHostMemory[0], ROIAlign_ArraySize[0], cudaMemcpyHostToDevice, m_CudaStream);
        cudaMemcpyAsync(ROIAlign_ArrayDevMemory[1], ROIAlign_ArrayHostMemory[1], ROIAlign_ArraySize[1], cudaMemcpyHostToDevice, m_CudaStream);
        m_CudaheadContext->enqueueV2(ROIAlign_ArrayDevMemory, m_CudaStream, nullptr); 
        cudaMemcpyAsync(ROIAlign_ArrayHostMemory[2], ROIAlign_ArrayDevMemory[2], ROIAlign_ArraySize[2], cudaMemcpyDeviceToHost, m_CudaStream);
        cudaStreamSynchronize(m_CudaStream); 
    

    參考鏈接

網站首頁   |    關于我們   |    公司新聞   |    產品方案   |    用戶案例   |    售后服務   |    合作伙伴   |    人才招聘   |   

友情鏈接: 餐飲加盟

地址:北京市海淀區    電話:010-     郵箱:@126.com

備案號:冀ICP備2024067069號-3 北京科技有限公司版權所有