UiTR: The First Uified Multi-modal Trasformer Backboe for 3D Perceptio

This repo is the official implemetatio of ICCV2023 paper: UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio as well as the follow-ups. Our UiTR achieves state-of-the-art performace o uScees Dataset with a real uified ad weight-sharig multi-modal (e.g., Cameras ad LiDARs) backboe. UiTR is built upo the codebase of DSVT, we have made every effort to esure that the codebase is clea, cocise, easily readable, state-of-the-art, ad relies oly o miimal depedecies.

UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio

Haiyag Wag, Hao Tag, Shaoshuai Shi $^\dagger$, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag $^\dagger$

Cotact: Haiyag Wag (waghaiyag6@stu.pku.edu.c), Hao Tag (taghao@stu.pku.edu.c), Shaoshuai Shi (shaoshuaics@gmail.com)

? Gratitude to Tag Hao for extesive code refactorig ad oteworthy cotributios to ope-source iitiatives. His ivaluable efforts were pivotal i esurig the seamless completio of UiTR.

? ? Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better strategies or some egieerig efforts, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models.

? I am goig to share my uderstadig ad future pla of the geeral 3D perceptio foudatio model without reservatio. Please refer to ? Potetial Research? . If you fid it useful for your research or ispirig, feel free to joi me i buildig this blueprit.

Iterpretive Articles: [CVer] [自动驾驶之心] [ReadPaper] [知乎] [CSDN] [TechBeat (将门创投)]

News

[23-09-21] ? Code of NuScees is released.
[23-08-16] ? SOTA Our sigle multi-modal UiTR outshies all other o-TTA approaches o uScees Detectio bechmark (Aug 2023) i terms of NDS 74.5.
[23-08-16] ? SOTA performace of multi-modal 3D object detectio ad BEV Map Segmetatio o NuScees validatio set.
[23-08-15] ? UiTR is released o arXiv.
[23-07-13] ? UiTR is accepted at ICCV 2023.

TODO

[x] Release the arXiv versio.
[x] SOTA performace of multi-modal 3D object detectio (Nuscees) ad BEV Map Segmetatio (Nuscees).
[x] Clea up ad release the code of NuScees.
[ ] Merge UiTR to OpePCDet.

Itroductio

Joitly processig iformatio from multiple sesors is crucial to achievig accurate ad robust perceptio for reliable autoomous drivig systems. However, curret 3D perceptio research follows a modality-specific paradigm, leadig to additioal computatio overheads ad iefficiet collaboratio betwee differet sesor data.

I this paper, we preset a efficiet multi-modal backboe for outdoor 3D perceptio, which processes a variety of modalities with uified modelig ad shared parameters. It is a fudametally task-agostic backboe that aturally supports differet 3D perceptio tasks. It sets a ew state-of-the-art performace o the uScees bechmark, achievig +1.1 NDS higher for 3D object detectio ad +12.0 mIoU higher for BEV map segmetatio with lower iferece latecy.

Mai results

3D Object Detectio (o NuScees validatio)

Model	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE	ckpt	Log
UiTR	73.0	70.1	26.3	24.7	26.8	24.6	17.9	ckpt	Log
UiTR+LSS	73.3	70.5	26.0	24.4	26.8	24.8	18.7	ckpt	Log

3D Object Detectio (o NuScees test)

Model	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE
UiTR	74.1	70.5	24.4	23.3	25.7	24.1	13.0
UiTR+LSS	74.5	70.9	24.1	22.9	25.6	24.0	13.1

Bev Map Segmetatio (o NuScees validatio)

Model	mIoU	Drivable	Ped.Cross.	Walkway	StopLie	Carpark	Divider	ckpt	Log
UiTR	73.2	90.4	73.1	78.2	66.6	67.3	63.8	ckpt	Log
UiTR+LSS	74.7	90.7	74.0	79.3	68.2	72.9	64.2	ckpt	Log

What's ew here?

? Beats previous SOTAs of outdoor multi-modal 3D Object Detectio ad BEV Segmetatio

Our approach has achieved the best performace o multiple tasks (e.g., 3D Object Detectio ad BEV Map Segmetatio), ad it is highly versatile, requirig oly the replacemet of the backboe.

3D Object Detectio

BEV Map Segmetatio

? Weight-Sharig amog all modalities

We itroduce a modality-agostic trasformer ecoder to hadle these view-discrepat sesor data for parallel modal-wise represetatio learig ad automatic cross-modal iteractio without additioal fusio steps.

? Prerequisite for 3D visio foudatio models

A weight-shared uified multimodal ecoder is a prerequisite for foudatio models, especially i the cotext of 3D perceptio, uifyig iformatio from both images ad LiDAR data. This is the first truly multimodal fusio backboe, seamlessly coectig to ay 3D detectio head.

Quick Start

Istallatio

coda create - uitr pytho=3.8
# Istall torch, we oly test it i pytorch 1.10
pip istall torch==1.10.1+cu113 torchvisio==0.11.2+cu113 -f https://dowload.pytorch.org/whl/torch_stable.html

git cloe https://github.com/Haiyag-W/UiTR
cd UiTR

# Istall extra depedecy
pip istall -r requiremets.txt

# Istall uscees-devkit
pip istall uscees-devkit==1.0.5

# Develop
pytho setup.py develop

Dataset Preparatio

Please dowload the official NuScees 3D object detectio dataset ad orgaize the dowloaded files as follows:

OpePCDet
├── data
│   ├── uscees
│   │   │── v1.0-traival (or v1.0-mii if you use mii)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── v1.0-traival  
├── pcdet
├── tools

(optioal) To istall the Map expasio for bev map segmetatio task, please dowload the files from Map expasio (Map expasio pack (v1.3)) ad copy the files ito your uScees maps folder, e.g. /data/uscees/v1.0-traival/maps as follows:

OpePCDet
├── maps
│   ├── ......
│   ├── bosto-seaport.jso
│   ├── sigapore-oeorth.jso
│   ├── sigapore-queestow.jso
│   ├── sigapore-holladvillage.jso

Geerate the data ifos by ruig the followig commad (it may take several hours):

```pytho

Create dataset ifo file, lidar ad image gt database

pytho -m pcdet.datasets.uscees.usceesdataset --fuc createusceesifos \ --cfgfile tools/cfgs/datasetcofigs/usceesdataset.yaml \ --versio v1.0-traival \ --withcam \ --withcamgt \ # --sharememory # if use share mem for lidar ad image gt samplig (about 24G+143G or 12G+72G)

share mem will greatly improve your traiig speed, but eed 150G or 75G extra cache mem.

NOTE: all the experimets used share memory. Share mem will ot affect performace

* The format of the geerated data is as follows:

OpePCDet ├── data │ ├── uscees │ │ │── v1.0-traival (or v1.0-mii if you use mii) │ │ │ │── samples │ │ │ │── sweeps │ │ │ │── maps │ │ │ │── v1.0-traival
│ │ │ │── imggtdatabase10sweepswithvelo │ │ │ │── gtdatabase10sweepswithvelo │ │ │ │── uscees10sweepswithvelolidar.py (optioal) # if ope share mem │ │ │ │── uscees10sweepswithveloimg.py (optioal) # if ope share mem │ │ │ │── usceesifos10sweepstrai.pkl
│ │ │ │── usceesifos10sweepsval.pkl │ │ │ │── usceesdbifos10sweepswithvelo.pkl ├── pcdet ├── tools

### Traiig
Please dowload pretraied checkpoit from [uitr_pretrai.pth](https://drive.google.com/u/0/uc?id=1SJQRI4TAKuO2GwqJ4otzMo7qGGjlBQ9u&export=dowload) ad copy the file uder the root folder, eg. `UiTR/uitr_pretrai.pth`. This file is the weight of pretraiig DSVT o Imageet ad Nuimage datasets.

3D object detectio:

shell

multi-gpu traiig

ormal

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000

add lss

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+lss.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000

BEV Map Segmetatio:

shell

multi-gpu traiig

ote that we do't use image pretrai i BEV Map Segmetatio

ormal

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000

add lss

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000

### Testig

3D object detectio:

shell

multi-gpu testig

ormal

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr.yaml --ckpt

add LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr+lss.yaml --ckpt

BEV Map Segmetatio

shell

multi-gpu testig

ormal

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --ckpt --eval_map

add LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap+lss.yaml --ckpt --eval_map

NOTE: evaluatio results will ot be logged i *.log, oly be prited i the temial

### Cache Testig 
- ?If the camera ad Lidar parameters of the dataset you are usig remai costat, the usig our cache mode will ot affect performace. You ca eve cache all mappig calculatios durig the traiig phase, which ca sigificatly accelerate your traiig speed.
- Each sample i Nuscees will `have some variatios i camera parameters`, ad durig ormal iferece, we disable the cache mode to esure result accuracy. However, due to the robustess of our mappig, eve i scearios with camera parameter variatios like Nuscees, the performace will oly drop slightly (aroud 0.4 NDS).
- Cache mode oly supports batch_size 1 ow, 8x1=8
- Backboe cachig will reduce 40% iferece latecy i our observatio.

shell

Oly for 3D Object Detectio

ormal

cache the mappig computatio of multi-modal backboe

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrcache.yaml --ckpt --batch_size 8

add LSS

cache the mappig computatio of multi-modal backboe

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScache.yaml --ckpt --batch_size 8

add LSS

cache the mappig computatio of multi-modal backboe ad LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScacheplus.yaml --ckpt --batchsize 8

#### Performace of cache testig o NuScees validatio (some variatios i camera parameters)
|  Model  | NDS | mAP |mATE | mASE | mAOE | mAVE| mAAE |
|---------|---------|--------|---------|---------|--------|---------|--------|
|  [UiTR (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr_cache.yaml) | 72.6(-0.4) | 69.4(-0.7) | 26.9 | 24.8 | 26.3 | 24.6 | 18.2 |
|  [UiTR+LSS (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache.yaml) | 73.1(-0.2) | 70.2(-0.3) | 25.8 | 24.4 | 26.0 | 25.3 | 18.2 | 
|  [UiTR+LSS (Cache Backboe ad LSS)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache_plus.yaml) | 72.6（-0.7） | 69.3（-1.2） | 26.7 | 24.3 | 25.9 | 25.3 | 18.2 | 

## Potetial Research
* **Ifrastructure of 3D Visio Foudatio Model.**
  A efficiet etwork desig is crucial for large models. With a reliable model structure, the developmet of large models ca be advaced. How to make a geeral multimodal backboe more efficiet ad easy to deploy. Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better `partitio strategies` or `some egieerig efforts`, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models. 
* **Multi-Modal Self-supervised Learig based o Image-Lidar pair ad UiTR.**
  Please refer to the followig figure. The images ad poit clouds both describe the same 3D scee; they complemet each other i terms of highly iformative correspodece. This allows for the usupervised learig of more geeric scee represetatio with shared parameters.
* **Sigle-Modal Pretraiig.** Our model is almost the same as ViT (except for some positio embeddig strategies). If we adjust the positio embeddig appropriately, DSVT ad UiTR ca directly load the pretraied parameters of ViT. This is beeficial for better itegratio with the 2D commuity.
* **Uifide Modelig of 3D Visio.**
  Please refer to the followig figure. 
<div alig="ceter">
  <img src="assets/Figure6.pg" width="800"/>
</div>

## Possible Issues
* If you ecouter a gradiet that becomes NaN durig fp16 traiig, ot support.
* If you could’t fid a solutio, search ope ad closed issues i our github issues page [here](https://github.com/Haiyag-W/UiTR/issues).
* We provide torch checkpoits optio [here](https://github.com/Haiyag-W/UiTR/blob/3f75dc1a362fe8f325dabd2e878ac57df2ab7323/tools/cfgs/uscees_models/uitr.yaml#L125) i traiig stage by default for savig CUDA memory 50%.
* Samples i Nuscees have some variatios i camera parameters. So, durig traiig, every sample recalculates the camera-lidar mappig, which sigificatly slows dow the traiig speed (~40%). If the extrisic parameters i your dataset are cosistet, I recommed cachig this computatio durig traiig.
* If still o-luck, ope a ew issue i our github. Our turaroud is usually a couple of days.

## Citatio
Please cosider citig our work as follows if it is helpful.

@iproceedigs{wag2023uitr, title={UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio}, author={Haiyag Wag, Hao Tag, Shaoshuai Shi, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag}, booktitle={ICCV}, year={2023} } ```

Ackowledgmets

UiTR uses code from a few ope source repositories. Without the efforts of these folks (ad their willigess to release their implemetatios), UiTR would ot be possible. We thaks these authors for their efforts!

Shaoshuai Shi: OpePCDet
Che Shi: DSVT
Zhijia Liu: BevFusio

UniTR

技术信息

作品详情

UiTR: The First Uified Multi-modal Trasformer Backboe for 3D Perceptio

News

Overview

TODO

Itroductio

Mai results

3D Object Detectio (o NuScees validatio)

3D Object Detectio (o NuScees test)

Bev Map Segmetatio (o NuScees validatio)

What's ew here?

? Beats previous SOTAs of outdoor multi-modal 3D Object Detectio ad BEV Segmetatio

3D Object Detectio

BEV Map Segmetatio

? Weight-Sharig amog all modalities

? Prerequisite for 3D visio foudatio models

Quick Start

Istallatio

Dataset Preparatio

Create dataset ifo file, lidar ad image gt database

share mem will greatly improve your traiig speed, but eed 150G or 75G extra cache mem.

NOTE: all the experimets used share memory. Share mem will ot affect performace

multi-gpu traiig

ormal

add lss

multi-gpu traiig

ote that we do't use image pretrai i BEV Map Segmetatio

ormal

add lss

multi-gpu testig

ormal

add LSS

multi-gpu testig

ormal

add LSS

NOTE: evaluatio results will ot be logged i *.log, oly be prited i the temial

Oly for 3D Object Detectio

ormal

cache the mappig computatio of multi-modal backboe

add LSS

cache the mappig computatio of multi-modal backboe

add LSS

cache the mappig computatio of multi-modal backboe ad LSS

Ackowledgmets

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐