匿名用户2024年07月31日
66阅读

技术信息

开源地址
https://modelscope.cn/models/AiQlearing/UniTR

作品详情

UiTR: The First Uified Multi-modal Trasformer Backboe for 3D Perceptio

This repo is the official implemetatio of ICCV2023 paper: UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio as well as the follow-ups. Our UiTR achieves state-of-the-art performace o uScees Dataset with a real uified ad weight-sharig multi-modal (e.g., Cameras ad LiDARs) backboe. UiTR is built upo the codebase of DSVT, we have made every effort to esure that the codebase is clea, cocise, easily readable, state-of-the-art, ad relies oly o miimal depedecies.

UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio

Haiyag Wag, Hao Tag, Shaoshuai Shi $^\dagger$, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag $^\dagger$

Cotact: Haiyag Wag (waghaiyag6@stu.pku.edu.c), Hao Tag (taghao@stu.pku.edu.c), Shaoshuai Shi (shaoshuaics@gmail.com)

? Gratitude to Tag Hao for extesive code refactorig ad oteworthy cotributios to ope-source iitiatives. His ivaluable efforts were pivotal i esurig the seamless completio of UiTR.

? ? Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better strategies or some egieerig efforts, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models.

? I am goig to share my uderstadig ad future pla of the geeral 3D perceptio foudatio model without reservatio. Please refer to ? Potetial Research? . If you fid it useful for your research or ispirig, feel free to joi me i buildig this blueprit.

Iterpretive Articles: [CVer] [自动驾驶之心] [ReadPaper] [知乎] [CSDN] [TechBeat (将门创投)]

News

  • [23-09-21] ? Code of NuScees is released.
  • [23-08-16] ? SOTA Our sigle multi-modal UiTR outshies all other o-TTA approaches o uScees Detectio bechmark (Aug 2023) i terms of NDS 74.5.
  • [23-08-16] ? SOTA performace of multi-modal 3D object detectio ad BEV Map Segmetatio o NuScees validatio set.
  • [23-08-15] ? UiTR is released o arXiv.
  • [23-07-13] ? UiTR is accepted at ICCV 2023.

Overview

TODO

  • [x] Release the arXiv versio.
  • [x] SOTA performace of multi-modal 3D object detectio (Nuscees) ad BEV Map Segmetatio (Nuscees).
  • [x] Clea up ad release the code of NuScees.
  • [ ] Merge UiTR to OpePCDet.

Itroductio

Joitly processig iformatio from multiple sesors is crucial to achievig accurate ad robust perceptio for reliable autoomous drivig systems. However, curret 3D perceptio research follows a modality-specific paradigm, leadig to additioal computatio overheads ad iefficiet collaboratio betwee differet sesor data.

I this paper, we preset a efficiet multi-modal backboe for outdoor 3D perceptio, which processes a variety of modalities with uified modelig ad shared parameters. It is a fudametally task-agostic backboe that aturally supports differet 3D perceptio tasks. It sets a ew state-of-the-art performace o the uScees bechmark, achievig +1.1 NDS higher for 3D object detectio ad +12.0 mIoU higher for BEV map segmetatio with lower iferece latecy.

Mai results

3D Object Detectio (o NuScees validatio)

Model NDS mAP mATE mASE mAOE mAVE mAAE ckpt Log
UiTR 73.0 70.1 26.3 24.7 26.8 24.6 17.9 ckpt Log
UiTR+LSS 73.3 70.5 26.0 24.4 26.8 24.8 18.7 ckpt Log

3D Object Detectio (o NuScees test)

Model NDS mAP mATE mASE mAOE mAVE mAAE
UiTR 74.1 70.5 24.4 23.3 25.7 24.1 13.0
UiTR+LSS 74.5 70.9 24.1 22.9 25.6 24.0 13.1

Bev Map Segmetatio (o NuScees validatio)

Model mIoU Drivable Ped.Cross. Walkway StopLie Carpark Divider ckpt Log
UiTR 73.2 90.4 73.1 78.2 66.6 67.3 63.8 ckpt Log
UiTR+LSS 74.7 90.7 74.0 79.3 68.2 72.9 64.2 ckpt Log

What's ew here?

? Beats previous SOTAs of outdoor multi-modal 3D Object Detectio ad BEV Segmetatio

Our approach has achieved the best performace o multiple tasks (e.g., 3D Object Detectio ad BEV Map Segmetatio), ad it is highly versatile, requirig oly the replacemet of the backboe.

3D Object Detectio
BEV Map Segmetatio

? Weight-Sharig amog all modalities

We itroduce a modality-agostic trasformer ecoder to hadle these view-discrepat sesor data for parallel modal-wise represetatio learig ad automatic cross-modal iteractio without additioal fusio steps.

? Prerequisite for 3D visio foudatio models

A weight-shared uified multimodal ecoder is a prerequisite for foudatio models, especially i the cotext of 3D perceptio, uifyig iformatio from both images ad LiDAR data. This is the first truly multimodal fusio backboe, seamlessly coectig to ay 3D detectio head.

Quick Start

Istallatio

coda create - uitr pytho=3.8
# Istall torch, we oly test it i pytorch 1.10
pip istall torch==1.10.1+cu113 torchvisio==0.11.2+cu113 -f https://dowload.pytorch.org/whl/torch_stable.html

git cloe https://github.com/Haiyag-W/UiTR
cd UiTR

# Istall extra depedecy
pip istall -r requiremets.txt

# Istall uscees-devkit
pip istall uscees-devkit==1.0.5

# Develop
pytho setup.py develop

Dataset Preparatio

OpePCDet
├── data
│   ├── uscees
│   │   │── v1.0-traival (or v1.0-mii if you use mii)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── v1.0-traival  
├── pcdet
├── tools
  • (optioal) To istall the Map expasio for bev map segmetatio task, please dowload the files from Map expasio (Map expasio pack (v1.3)) ad copy the files ito your uScees maps folder, e.g. /data/uscees/v1.0-traival/maps as follows:
OpePCDet
├── maps
│   ├── ......
│   ├── bosto-seaport.jso
│   ├── sigapore-oeorth.jso
│   ├── sigapore-queestow.jso
│   ├── sigapore-holladvillage.jso
  • Geerate the data ifos by ruig the followig commad (it may take several hours):

```pytho

Create dataset ifo file, lidar ad image gt database

pytho -m pcdet.datasets.uscees.usceesdataset --fuc createusceesifos \ --cfgfile tools/cfgs/datasetcofigs/usceesdataset.yaml \ --versio v1.0-traival \ --withcam \ --withcamgt \ # --sharememory # if use share mem for lidar ad image gt samplig (about 24G+143G or 12G+72G)

share mem will greatly improve your traiig speed, but eed 150G or 75G extra cache mem.

NOTE: all the experimets used share memory. Share mem will ot affect performace

* The format of the geerated data is as follows:

OpePCDet ├── data │ ├── uscees │ │ │── v1.0-traival (or v1.0-mii if you use mii) │ │ │ │── samples │ │ │ │── sweeps │ │ │ │── maps │ │ │ │── v1.0-traival
│ │ │ │── imggtdatabase10sweepswithvelo │ │ │ │── gtdatabase10sweepswithvelo │ │ │ │── uscees10sweepswithvelolidar.py (optioal) # if ope share mem │ │ │ │── uscees10sweepswithveloimg.py (optioal) # if ope share mem │ │ │ │── usceesifos10sweepstrai.pkl
│ │ │ │── usceesifos10sweepsval.pkl │ │ │ │── usceesdbifos10sweepswithvelo.pkl ├── pcdet ├── tools

### Traiig
Please dowload pretraied checkpoit from [uitr_pretrai.pth](https://drive.google.com/u/0/uc?id=1SJQRI4TAKuO2GwqJ4otzMo7qGGjlBQ9u&export=dowload) ad copy the file uder the root folder, eg. `UiTR/uitr_pretrai.pth`. This file is the weight of pretraiig DSVT o Imageet ad Nuimage datasets.

3D object detectio:

shell

multi-gpu traiig

ormal

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000

add lss

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+lss.yaml --sycb --pretraiedmodel ../uitrpretrai.pth --loggeriteriterval 1000

BEV Map Segmetatio:

shell

multi-gpu traiig

ote that we do't use image pretrai i BEV Map Segmetatio

ormal

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000

add lss

cd tools bash scripts/disttrai.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --sycb --evalmap --loggeriteriterval 1000

### Testig

3D object detectio:

shell

multi-gpu testig

ormal

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr.yaml --ckpt

add LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/uscees_models/uitr+lss.yaml --ckpt

BEV Map Segmetatio

shell

multi-gpu testig

ormal

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap.yaml --ckpt --eval_map

add LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrmap+lss.yaml --ckpt --eval_map

NOTE: evaluatio results will ot be logged i *.log, oly be prited i the temial

### Cache Testig 
- ?If the camera ad Lidar parameters of the dataset you are usig remai costat, the usig our cache mode will ot affect performace. You ca eve cache all mappig calculatios durig the traiig phase, which ca sigificatly accelerate your traiig speed.
- Each sample i Nuscees will `have some variatios i camera parameters`, ad durig ormal iferece, we disable the cache mode to esure result accuracy. However, due to the robustess of our mappig, eve i scearios with camera parameter variatios like Nuscees, the performace will oly drop slightly (aroud 0.4 NDS).
- Cache mode oly supports batch_size 1 ow, 8x1=8
- Backboe cachig will reduce 40% iferece latecy i our observatio.

shell

Oly for 3D Object Detectio

ormal

cache the mappig computatio of multi-modal backboe

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitrcache.yaml --ckpt --batch_size 8

add LSS

cache the mappig computatio of multi-modal backboe

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScache.yaml --ckpt --batch_size 8

add LSS

cache the mappig computatio of multi-modal backboe ad LSS

cd tools bash scripts/disttest.sh 8 --cfgfile ./cfgs/usceesmodels/uitr+LSScacheplus.yaml --ckpt --batchsize 8

#### Performace of cache testig o NuScees validatio (some variatios i camera parameters)
|  Model  | NDS | mAP |mATE | mASE | mAOE | mAVE| mAAE |
|---------|---------|--------|---------|---------|--------|---------|--------|
|  [UiTR (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr_cache.yaml) | 72.6(-0.4) | 69.4(-0.7) | 26.9 | 24.8 | 26.3 | 24.6 | 18.2 |
|  [UiTR+LSS (Cache Backboe)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache.yaml) | 73.1(-0.2) | 70.2(-0.3) | 25.8 | 24.4 | 26.0 | 25.3 | 18.2 | 
|  [UiTR+LSS (Cache Backboe ad LSS)](https://github.com/Haiyag-W/UiTR/blob/mai/tools/cfgs/uscees_models/uitr%2Blss_cache_plus.yaml) | 72.6(-0.7) | 69.3(-1.2) | 26.7 | 24.3 | 25.9 | 25.3 | 18.2 | 

## Potetial Research
* **Ifrastructure of 3D Visio Foudatio Model.**
  A efficiet etwork desig is crucial for large models. With a reliable model structure, the developmet of large models ca be advaced. How to make a geeral multimodal backboe more efficiet ad easy to deploy. Hoestly, the partitio i Uitr is slow ad takes about 40% of the total time, but this ca be optimized to zero with better `partitio strategies` or `some egieerig efforts`, idicatig that there is still huge room for speed optimizatio. We're ot the HPC experts, but if ayoe i the idustry wats to improve this, we believe it could be halved. Importatly, this part does't scale with model size, makig it friedly for larger models. 
* **Multi-Modal Self-supervised Learig based o Image-Lidar pair ad UiTR.**
  Please refer to the followig figure. The images ad poit clouds both describe the same 3D scee; they complemet each other i terms of highly iformative correspodece. This allows for the usupervised learig of more geeric scee represetatio with shared parameters.
* **Sigle-Modal Pretraiig.** Our model is almost the same as ViT (except for some positio embeddig strategies). If we adjust the positio embeddig appropriately, DSVT ad UiTR ca directly load the pretraied parameters of ViT. This is beeficial for better itegratio with the 2D commuity.
* **Uifide Modelig of 3D Visio.**
  Please refer to the followig figure. 
<div alig="ceter">
  <img src="assets/Figure6.pg" width="800"/>
</div>

## Possible Issues
* If you ecouter a gradiet that becomes NaN durig fp16 traiig, ot support.
* If you could’t fid a solutio, search ope ad closed issues i our github issues page [here](https://github.com/Haiyag-W/UiTR/issues).
* We provide torch checkpoits optio [here](https://github.com/Haiyag-W/UiTR/blob/3f75dc1a362fe8f325dabd2e878ac57df2ab7323/tools/cfgs/uscees_models/uitr.yaml#L125) i traiig stage by default for savig CUDA memory 50%.
* Samples i Nuscees have some variatios i camera parameters. So, durig traiig, every sample recalculates the camera-lidar mappig, which sigificatly slows dow the traiig speed (~40%). If the extrisic parameters i your dataset are cosistet, I recommed cachig this computatio durig traiig.
* If still o-luck, ope a ew issue i our github. Our turaroud is usually a couple of days.

## Citatio
Please cosider citig our work as follows if it is helpful.

@iproceedigs{wag2023uitr, title={UiTR: A Uified ad Efficiet Multi-Modal Trasformer for Bird's-Eye-View Represetatio}, author={Haiyag Wag, Hao Tag, Shaoshuai Shi, Aoxue Li, Zheguo Li, Bert Schiele, Liwei Wag}, booktitle={ICCV}, year={2023} } ```

Ackowledgmets

UiTR uses code from a few ope source repositories. Without the efforts of these folks (ad their willigess to release their implemetatios), UiTR would ot be possible. We thaks these authors for their efforts!

功能介绍

UniTR: The First Unified Multi-modal Transformer Backbone for 3D Perception This repo is the officia

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论