wang-xinyu/tensorrtxPublic

NotificationsYou must be signed in to change notification settings
Fork1.8k
Star7.4k

Implementation of popular deep learning networks with TensorRT network definition API

License

MIT license

7.4k stars 1.8k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 481 Commits
.github		.github
alexnet		alexnet
arcface		arcface
centernet		centernet
crnn		crnn
csrnet		csrnet
dbnet		dbnet
densenet		densenet
detr		detr
docker		docker
efficient_ad		efficient_ad
efficientnet		efficientnet
ghostnet		ghostnet
googlenet		googlenet
hrnet		hrnet
ibnnet		ibnnet
inception		inception
lenet		lenet
lprnet		lprnet
mlp		mlp
mnasnet		mnasnet
mobilenet		mobilenet
psenet		psenet
rcnn		rcnn
real-esrgan		real-esrgan
refinedet		refinedet
repvgg		repvgg
resnet		resnet
retinaface		retinaface
retinafaceAntiCov		retinafaceAntiCov
scaled-yolov4		scaled-yolov4
senet		senet
shufflenetv2		shufflenetv2
squeezenet		squeezenet
superpoint		superpoint
swin-transformer/semantic-segmentation		swin-transformer/semantic-segmentation
tsm		tsm
tutorials		tutorials
ufld		ufld
unet		unet
vgg		vgg
yolo11		yolo11
yolo11_tripy		yolo11_tripy
yolop		yolop
yolov10		yolov10
yolov12		yolov12
yolov3-spp		yolov3-spp
yolov3-tiny		yolov3-tiny
yolov3		yolov3
yolov4		yolov4
yolov5		yolov5
yolov7		yolov7
yolov8		yolov8
yolov9		yolov9
.clang-format		.clang-format
.cmake-format.yaml		.cmake-format.yaml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Repository files navigation

TensorRTx

TensorRTx aims to implement popular deep learning networks with TensorRT network definition API.

Why don't we use a parser (ONNX parser, UFF parser, caffe parser, etc), but use complex APIs to build a network from scratch? I have summarized the advantages in the following aspects.

Flexible, easy to modify the network, add/delete a layer or input/output tensor, replace a layer, merge layers, integrate preprocessing and postprocessing into network, etc.
Debuggable, construct the entire network in an incremental development manner, easy to get middle layer results.
Educational, learn about the network structure during this development, rather than treating everything as a black box.

The basic workflow of TensorRTx is:

Get the trained models from pytorch, mxnet or tensorflow, etc. Some pytorch models can be found in my repopytorchx, the remaining are from popular open-source repos.
Export the weights to a plain text file --.wts file.
Load weights in TensorRT, define the network, build a TensorRT engine.
Load the TensorRT engine and run inference.

News

10 May 2025.pranavm-nvidia:YOLO11 writen inTripy.
2 May 2025.fazligorkembal: YOLO12
12 Apr 2025.pranavm-nvidia: FirstLenet example writen inTripy.
11 Apr 2025.mpj1234:YOLO11-obb
22 Oct 2024.lindsayshuo: YOLOv8-obb
18 Oct 2024.zgjja: Rafactor docker image.
11 Oct 2024.mpj1234: YOLO11
9 Oct 2024.Phoenix8215: GhostNet V1 and V2.
21 Aug 2024.Lemonononon: real-esrgan-general-x4v3
29 Jul 2024.mpj1234: Check the YOLOv5, YOLOv8 & YOLOv10 in TensorRT 10.x API, branch →trt10
29 Jul 2024.mpj1234: YOLOv10
21 Jun 2024.WuxinrongY: YOLOv9-T, YOLOv9-S, YOLOv9-M
28 Apr 2024.lindsayshuo: YOLOv8-pose
22 Apr 2024.B1SH0PP: EfficientAd: Accurate Visual Anomaly Detection at Millisecond-Level Latencies.
18 Apr 2024.lindsayshuo: YOLOv8-p2

Tutorials

Test Environment

TensorRT 7.x
TensorRT 8.x(Some of the models support 8.x)

How to run

Each folder has a readme inside, which explains how to run the models inside.

Models

Following models are implemented.

Name	Description
mlp	the very basic model for starters, properly documented
lenet	the simplest, as a "hello world" of this project
alexnet	easy to implement, all layers are supported in tensorrt
googlenet	GoogLeNet (Inception v1)
inception	Inception v3, v4
mnasnet	MNASNet with depth multiplier of 0.5 from the paper
mobilenet	MobileNet v2, v3-small, v3-large
resnet	resnet-18, resnet-50 and resnext50-32x4d are implemented
senet	se-resnet50
shufflenet	ShuffleNet v2 with 0.5x output channels
squeezenet	SqueezeNet 1.1 model
vgg	VGG 11-layer model
yolov3-tiny	weights and pytorch implementation fromultralytics/yolov3
yolov3	darknet-53, weights and pytorch implementation fromultralytics/yolov3
yolov3-spp	darknet-53, weights and pytorch implementation fromultralytics/yolov3
yolov4	CSPDarknet53, weights fromAlexeyAB/darknet, pytorch implementation fromultralytics/yolov3
yolov5	yolov5 v1.0-v7.0 ofultralytics/yolov5, detection, classification and instance segmentation
yolov7	yolov7 v0.1, pytorch implementation fromWongKinYiu/yolov7
yolov8	yolov8, pytorch implementation fromultralytics
yolov9	The Pytorch implementation isWongKinYiu/yolov9.
yolov10	The Pytorch implementation isTHU-MIG/yolov10.
yolo11	The Pytorch implementation isultralytics.
yolo12	The Pytorch implementation isultralytics.
yolop	yolop, pytorch implementation fromhustvl/YOLOP
retinaface	resnet50 and mobilnet0.25, weights frombiubug6/Pytorch_Retinaface
arcface	LResNet50E-IR, LResNet100E-IR and MobileFaceNet, weights fromdeepinsight/insightface
retinafaceAntiCov	mobilenet0.25, weights fromdeepinsight/insightface, retinaface anti-COVID-19, detect face and mask attribute
dbnet	Scene Text Detection, weights fromBaofengZan/DBNet.pytorch
crnn	pytorch implementation frommeijieru/crnn.pytorch
ufld	pytorch implementation fromUltra-Fast-Lane-Detection, ECCV2020
hrnet	hrnet-image-classification and hrnet-semantic-segmentation, pytorch implementation fromHRNet-Image-Classification andHRNet-Semantic-Segmentation
psenet	PSENet Text Detection, tensorflow implementation fromliuheng92/tensorflow_PSENet
ibnnet	IBN-Net, pytorch implementation fromXingangPan/IBN-Net, ECCV2018
unet	U-Net, pytorch implementation frommilesial/Pytorch-UNet
repvgg	RepVGG, pytorch implementation fromDingXiaoH/RepVGG
lprnet	LPRNet, pytorch implementation fromxuexingyu24/License_Plate_Detection_Pytorch
refinedet	RefineDet, pytorch implementation fromluuuyi/RefineDet.PyTorch
densenet	DenseNet-121, from torchvision.models
rcnn	FasterRCNN and MaskRCNN, model fromdetectron2
tsm	TSM: Temporal Shift Module for Efficient Video Understanding, ICCV2019
scaled-yolov4	yolov4-csp, pytorch fromWongKinYiu/ScaledYOLOv4
centernet	CenterNet DLA-34, pytorch fromxingyizhou/CenterNet
efficientnet	EfficientNet b0-b8 and l2, pytorch fromlukemelas/EfficientNet-PyTorch
detr	DE⫶TR, pytorch fromfacebookresearch/detr
swin-transformer	Swin Transformer - Semantic Segmentation, only support Swin-T. The Pytorch implementation ismicrosoft/Swin-Transformer
real-esrgan	Real-ESRGAN. The Pytorch implementation isreal-esrgan
superpoint	SuperPoint. The Pytorch model is frommagicleap/SuperPointPretrainedNetwork
csrnet	CSRNet. The Pytorch implementation isleeyeehoo/CSRNet-pytorch
EfficientAd	EfficientAd: Accurate Visual Anomaly Detection at Millisecond-Level Latencies. Fromanomalib

Model Zoo

The .wts files can be downloaded from model zoo for quick evaluation. But it is recommended to convert .wts from pytorch/mxnet/tensorflow model, so that you can retrain your own model.

GoogleDrive |BaiduPan pwd: uvv2

Tricky Operations

Some tricky operations encountered in these models, already solved, but might have better solutions.

Name	Description
BatchNorm	Implement by a scale layer, used in resnet, googlenet, mobilenet, etc.
MaxPool2d(ceil_mode=True)	use a padding layer before maxpool to solve ceil_mode=True, see googlenet.
average pool with padding	use setAverageCountExcludesPadding() when necessary, see inception.
relu6	use`Relu6(x) = Relu(x) - Relu(x-6)`, see mobilenet.
torch.chunk()	implement the 'chunk(2, dim=C)' by tensorrt plugin, see shufflenet.
channel shuffle	use two shuffle layers to implement`channel_shuffle`, see shufflenet.
adaptive pool	use fixed input dimension, and use regular average pooling, see shufflenet.
leaky relu	I wrote a leaky relu plugin, but PRelu in`NvInferPlugin.h` can be used, see yolov3 in branch`trt4`.
yolo layer v1	yolo layer is implemented as a plugin, see yolov3 in branch`trt4`.
yolo layer v2	three yolo layers implemented in one plugin, see yolov3-spp.
upsample	replaced by a deconvolution layer, see yolov3.
hsigmoid	hard sigmoid is implemented as a plugin, hsigmoid and hswish are used in mobilenetv3
retinaface output decode	implement a plugin to decode bbox, confidence and landmarks, see retinaface.
mish	mish activation is implemented as a plugin, mish is used in yolov4
prelu	mxnet's prelu activation with trainable gamma is implemented as a plugin, used in arcface
HardSwish	hard_swish = x * hard_sigmoid, used in yolov5 v3.0
LSTM	Implemented pytorch nn.LSTM() with tensorrt api

Speed Benchmark

Models	Device	BatchSize	Mode	Input Shape(HxW)	FPS
YOLOv3-tiny	Xeon E5-2620/GTX1080	1	FP32	608x608	333
YOLOv3(darknet53)	Xeon E5-2620/GTX1080	1	FP32	608x608	39.2
YOLOv3(darknet53)	Xeon E5-2620/GTX1080	1	INT8	608x608	71.4
YOLOv3-spp(darknet53)	Xeon E5-2620/GTX1080	1	FP32	608x608	38.5
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP32	608x608	35.7
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	4	FP32	608x608	40.9
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	8	FP32	608x608	41.3
YOLOv5-s v3.0	Xeon E5-2620/GTX1080	1	FP32	608x608	142
YOLOv5-s v3.0	Xeon E5-2620/GTX1080	4	FP32	608x608	173
YOLOv5-s v3.0	Xeon E5-2620/GTX1080	8	FP32	608x608	190
YOLOv5-m v3.0	Xeon E5-2620/GTX1080	1	FP32	608x608	71
YOLOv5-l v3.0	Xeon E5-2620/GTX1080	1	FP32	608x608	43
YOLOv5-x v3.0	Xeon E5-2620/GTX1080	1	FP32	608x608	29
YOLOv5-s v4.0	Xeon E5-2620/GTX1080	1	FP32	608x608	142
YOLOv5-m v4.0	Xeon E5-2620/GTX1080	1	FP32	608x608	71
YOLOv5-l v4.0	Xeon E5-2620/GTX1080	1	FP32	608x608	40
YOLOv5-x v4.0	Xeon E5-2620/GTX1080	1	FP32	608x608	27
RetinaFace(resnet50)	Xeon E5-2620/GTX1080	1	FP32	480x640	90
RetinaFace(resnet50)	Xeon E5-2620/GTX1080	1	INT8	480x640	204
RetinaFace(mobilenet0.25)	Xeon E5-2620/GTX1080	1	FP32	480x640	417
ArcFace(LResNet50E-IR)	Xeon E5-2620/GTX1080	1	FP32	112x112	333
CRNN	Xeon E5-2620/GTX1080	1	FP32	32x100	1000