UiTR: The First Uified Multi-modal Trasformer Backboe for 3D Perceptio
This repo is the official implemetatio of ICCV2023 paper: UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio as well as the follow-ups. Our UiTR achieves state-of-the-art performace o uScees Dataset with a real uified ad weight-sharig multi-modal (e.g., Cameras
ad LiDARs
) backboe. UiTR is built upo the codebase of DSVT, we have made every effort to esure that the codebase is clea, cocise, easily readable, state-of-the-art, ad relies oly o miimal depedecies.
UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio
Haiyag Wag, Hao Tag, Shaoshuai Shi $^\dagger$, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag $^\dagger$
Cotact: Haiyag Wag (waghaiyag6@stu.pku.edu.c), Hao Tag (taghao@stu.pku.edu.c), Shaoshuai Shi (shaoshuaics@gmail.com)
? Gratitude to Tag Hao for extesive code refactorig ad oteworthy cotributios to ope-source iitiatives. His ivaluable efforts were pivotal i esurig the seamless completio of UiTR.
? ? Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better strategies or some egieerig efforts, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models.
? I am goig to share my uderstadig ad future pla of the geeral 3D perceptio foudatio model without reservatio. Please refer to ? Potetial Research? . If you fid it useful for your research or ispirig, feel free to joi me i buildig this blueprit.
Iterpretive Articles: [CVer] [自动驾驶之心] [ReadPaper] [知乎] [CSDN] [TechBeat (将门创投)]
News
- [23-09-21] ? Code of NuScees is released.
- [23-08-16] ?
SOTA
Our sigle multi-modal UiTR outshies all other o-TTA approaches o uScees Detectio bechmark (Aug 2023) i terms of NDS 74.5
.
- [23-08-16] ?
SOTA
performace of multi-modal 3D object detectio ad BEV Map Segmetatio o NuScees validatio set.
- [23-08-15] ? UiTR is released o arXiv.
- [23-07-13] ? UiTR is accepted at ICCV 2023.
Overview
TODO
- [x] Release the arXiv versio.
- [x] SOTA performace of multi-modal 3D object detectio (Nuscees) ad BEV Map Segmetatio (Nuscees).
- [x] Clea up ad release the code of NuScees.
- [ ] Merge UiTR to OpePCDet.
Itroductio
Joitly processig iformatio from multiple sesors is crucial to achievig accurate ad robust perceptio for reliable autoomous drivig systems. However, curret 3D perceptio research follows a modality-specific paradigm, leadig to additioal computatio overheads ad iefficiet collaboratio betwee differet sesor data.
I this paper, we preset a efficiet multi-modal backboe for outdoor 3D perceptio, which processes a variety of modalities with uified modelig ad shared parameters. It is a fudametally task-agostic backboe that aturally supports differet 3D perceptio tasks. It sets a ew state-of-the-art performace o the uScees bechmark, achievig +1.1 NDS
higher for 3D object detectio ad +12.0 mIoU
higher for BEV map segmetatio with lower iferece latecy.
Mai results
3D Object Detectio (o NuScees validatio)
Model |
NDS |
mAP |
mATE |
mASE |
mAOE |
mAVE |
mAAE |
ckpt |
Log |
UiTR |
73.0 |
70.1 |
26.3 |
24.7 |
26.8 |
24.6 |
17.9 |
ckpt |
Log |
UiTR+LSS |
73.3 |
70.5 |
26.0 |
24.4 |
26.8 |
24.8 |
18.7 |
ckpt |
Log |
3D Object Detectio (o NuScees test)
Model |
NDS |
mAP |
mATE |
mASE |
mAOE |
mAVE |
mAAE |
UiTR |
74.1 |
70.5 |
24.4 |
23.3 |
25.7 |
24.1 |
13.0 |
UiTR+LSS |
74.5 |
70.9 |
24.1 |
22.9 |
25.6 |
24.0 |
13.1 |
Bev Map Segmetatio (o NuScees validatio)
Model |
mIoU |
Drivable |
Ped.Cross. |
Walkway |
StopLie |
Carpark |
Divider |
ckpt |
Log |
UiTR |
73.2 |
90.4 |
73.1 |
78.2 |
66.6 |
67.3 |
63.8 |
ckpt |
Log |
UiTR+LSS |
74.7 |
90.7 |
74.0 |
79.3 |
68.2 |
72.9 |
64.2 |
ckpt |
Log |
What's ew here?
? Beats previous SOTAs of outdoor multi-modal 3D Object Detectio ad BEV Segmetatio
Our approach has achieved the best performace o multiple tasks (e.g., 3D Object Detectio ad BEV Map Segmetatio), ad it is highly versatile, requirig oly the replacemet of the backboe.
3D Object Detectio
BEV Map Segmetatio
? Weight-Sharig amog all modalities
We itroduce a modality-agostic trasformer ecoder to hadle these view-discrepat sesor data for parallel modal-wise represetatio learig ad automatic cross-modal iteractio without additioal fusio steps.
? Prerequisite for 3D visio foudatio models
A weight-shared uified multimodal ecoder is a prerequisite for foudatio models, especially i the cotext of 3D perceptio, uifyig iformatio from both images ad LiDAR data. This is the first truly multimodal fusio backboe, seamlessly coectig to ay 3D detectio head.
Quick Start
Istallatio
coda create - uitr pytho=3.8
# Istall torch, we oly test it i pytorch 1.10
pip istall torch==1.10.1+cu113 torchvisio==0.11.2+cu113 -f https://dowload.pytorch.org/whl/torch_stable.html
git cloe https://github.com/Haiyag-W/UiTR
cd UiTR
# Istall extra depedecy
pip istall -r requiremets.txt
# Istall uscees-devkit
pip istall uscees-devkit==1.0.5
# Develop
pytho setup.py develop
Dataset Preparatio
OpePCDet
├── data
│ ├── uscees
│ │ │── v1.0-traival (or v1.0-mii if you use mii)
│ │ │ │── samples
│ │ │ │── sweeps
│ │ │ │── maps
│ │ │ │── v1.0-traival
├── pcdet
├── tools
- (optioal) To istall the Map expasio for bev map segmetatio task, please dowload the files from Map expasio (Map expasio pack (v1.3)) ad copy the files ito your uScees maps folder, e.g.
/data/uscees/v1.0-traival/maps
as follows:
OpePCDet
├── maps
│ ├── ......
│ ├── bosto-seaport.jso
│ ├── sigapore-oeorth.jso
│ ├── sigapore-queestow.jso
│ ├── sigapore-holladvillage.jso
- Geerate the data ifos by ruig the followig commad (it may take several hours):
```pytho
Create dataset ifo file, lidar ad image gt database
pytho -m pcdet.datasets.uscees.usceesdataset --fuc createusceesifos \
--cfgfile tools/cfgs/datasetcofigs/usceesdataset.yaml \
--versio v1.0-traival \
--withcam \
--withcamgt \
# --sharememory # if use share mem for lidar ad image gt samplig (about 24G+143G or 12G+72G)
share mem will greatly improve your traiig speed, but eed 150G or 75G extra cache mem.
NOTE: all the experimets used share memory. Share mem will ot affect performace
* The format of the geerated data is as follows:
OpePCDet
├── data
│ ├── uscees
│ │ │── v1.0-traival (or v1.0-mii if you use mii)
│ │ │ │── samples
│ │ │ │── sweeps
│ │ │ │── maps
│ │ │ │── v1.0-traival
│ │ │ │── imggtdatabase10sweepswithvelo
│ │ │ │── gtdatabase10sweepswithvelo
│ │ │ │── uscees10sweepswithvelolidar.py (optioal) # if ope share mem
│ │ │ │── uscees10sweepswithveloimg.py (optioal) # if ope share mem
│ │ │ │── usceesifos10sweepstrai.pkl
│ │ │ │── usceesifos10sweepsval.pkl
│ │ │ │── usceesdbifos10sweepswithvelo.pkl
├── pcdet
├── tools
### Traiig
Please dowload pretraied checkpoit from [uitr_pretrai.pth](https://drive.google.com/u/0/uc?id=1SJQRI4TAKuO2GwqJ4otzMo7qGGjlBQ9u&export=dowload) ad copy the file uder the root folder, eg. `UiTR/uitr_pretrai.pth`. This file is the weight of pretraiig DSVT o Imageet ad Nuimage datasets.
3D object detectio:
shell
multi-gpu traiig
ormal
cd tools
bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000
add lss
cd tools
bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+lss.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000
BEV Map Segmetatio:
shell
multi-gpu traiig
ote that we do't use image pretrai i BEV Map Segmetatio
ormal
cd tools
bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000
add lss
cd tools
bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000
### Testig
3D object detectio:
shell
multi-gpu testig
ormal
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr.yaml --ckpt
add LSS
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr+lss.yaml --ckpt
BEV Map Segmetatio
shell
multi-gpu testig
ormal
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --ckpt --eval_map
add LSS
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap+lss.yaml --ckpt --eval_map
NOTE: evaluatio results will ot be logged i *.log, oly be prited i the temial
### Cache Testig
- ?If the camera ad Lidar parameters of the dataset you are usig remai costat, the usig our cache mode will ot affect performace. You ca eve cache all mappig calculatios durig the traiig phase, which ca sigificatly accelerate your traiig speed.
- Each sample i Nuscees will `have some variatios i camera parameters`, ad durig ormal iferece, we disable the cache mode to esure result accuracy. However, due to the robustess of our mappig, eve i scearios with camera parameter variatios like Nuscees, the performace will oly drop slightly (aroud 0.4 NDS).
- Cache mode oly supports batch_size 1 ow, 8x1=8
- Backboe cachig will reduce 40% iferece latecy i our observatio.
shell
Oly for 3D Object Detectio
ormal
cache the mappig computatio of multi-modal backboe
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrcache.yaml --ckpt --batch_size 8
add LSS
cache the mappig computatio of multi-modal backboe
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScache.yaml --ckpt --batch_size 8
add LSS
cache the mappig computatio of multi-modal backboe ad LSS
cd tools
bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScacheplus.yaml --ckpt --batchsize 8
#### Performace of cache testig o NuScees validatio (some variatios i camera parameters)
| Model | NDS | mAP |mATE | mASE | mAOE | mAVE| mAAE |
|---------|---------|--------|---------|---------|--------|---------|--------|
| [UiTR (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr_cache.yaml) | 72.6(-0.4) | 69.4(-0.7) | 26.9 | 24.8 | 26.3 | 24.6 | 18.2 |
| [UiTR+LSS (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache.yaml) | 73.1(-0.2) | 70.2(-0.3) | 25.8 | 24.4 | 26.0 | 25.3 | 18.2 |
| [UiTR+LSS (Cache Backboe ad LSS)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache_plus.yaml) | 72.6(-0.7) | 69.3(-1.2) | 26.7 | 24.3 | 25.9 | 25.3 | 18.2 |
## Potetial Research
* **Ifrastructure of 3D Visio Foudatio Model.**
A efficiet etwork desig is crucial for large models. With a reliable model structure, the developmet of large models ca be advaced. How to make a geeral multimodal backboe more efficiet ad easy to deploy. Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better `partitio strategies` or `some egieerig efforts`, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models.
* **Multi-Modal Self-supervised Learig based o Image-Lidar pair ad UiTR.**
Please refer to the followig figure. The images ad poit clouds both describe the same 3D scee; they complemet each other i terms of highly iformative correspodece. This allows for the usupervised learig of more geeric scee represetatio with shared parameters.
* **Sigle-Modal Pretraiig.** Our model is almost the same as ViT (except for some positio embeddig strategies). If we adjust the positio embeddig appropriately, DSVT ad UiTR ca directly load the pretraied parameters of ViT. This is beeficial for better itegratio with the 2D commuity.
* **Uifide Modelig of 3D Visio.**
Please refer to the followig figure.
<div alig="ceter">
<img src="assets/Figure6.pg" width="800"/>
</div>
## Possible Issues
* If you ecouter a gradiet that becomes NaN durig fp16 traiig, ot support.
* If you could’t fid a solutio, search ope ad closed issues i our github issues page [here](https://github.com/Haiyag-W/UiTR/issues).
* We provide torch checkpoits optio [here](https://github.com/Haiyag-W/UiTR/blob/3f75dc1a362fe8f325dabd2e878ac57df2ab7323/tools/cfgs/uscees_models/uitr.yaml#L125) i traiig stage by default for savig CUDA memory 50%.
* Samples i Nuscees have some variatios i camera parameters. So, durig traiig, every sample recalculates the camera-lidar mappig, which sigificatly slows dow the traiig speed (~40%). If the extrisic parameters i your dataset are cosistet, I recommed cachig this computatio durig traiig.
* If still o-luck, ope a ew issue i our github. Our turaroud is usually a couple of days.
## Citatio
Please cosider citig our work as follows if it is helpful.
@iproceedigs{wag2023uitr,
title={UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio},
author={Haiyag Wag, Hao Tag, Shaoshuai Shi, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag},
booktitle={ICCV},
year={2023}
}
```
Ackowledgmets
UiTR uses code from a few ope source repositories. Without the efforts of these folks (ad their willigess to release their implemetatios), UiTR would ot be possible. We thaks these authors for their efforts!
评论