Mmdet

Latest version: v3.3.0

Safety actively analyzes 624472 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 15

51.3

Users can use the test script we provided to run evaluation as well. Here is a basic example:

shell
1 gpu
python tools/test.py configs/glip/glip_atss_swin-t_fpn_dyhead_pretrain_obj365.py glip_tiny_a_mmdet-b3654169.pth

8 GPU
./tools/dist_test.sh configs/glip/glip_atss_swin-t_fpn_dyhead_pretrain_obj365.py glip_tiny_a_mmdet-b3654169.pth 8

The result will be similar to this:

shell
Average Precision (AP) [ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.428
Average Precision (AP) [ IoU=0.50 | area= all | maxDets=1000 ] = 0.594
Average Precision (AP) [ IoU=0.75 | area= all | maxDets=1000 ] = 0.466
Average Precision (AP) [ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.300
Average Precision (AP) [ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.477
Average Precision (AP) [ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.534
Average Recall (AR) [ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.634
Average Recall (AR) [ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.634
Average Recall (AR) [ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.634
Average Recall (AR) [ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.473
Average Recall (AR) [ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.690
Average Recall (AR) [ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.789

XDecoder

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/cb126615-9402-4c19-8ea9-133722d7519c" width="70%"/>
</div>

Installation

shell
if source
pip install -r requirements/multimodal.txt

if wheel
mim install mmdet[multimodal]

How to use it?

For convenience, you can download the weights to the `mmdetection` root dir

shell
wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_last_novg.pt
wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_best_openseg.pt

The above two weights are directly copied from the official website without any modification. The specific source is https://github.com/microsoft/X-Decoder

For convenience of demonstration, please download [the folder](https://github.com/microsoft/X-Decoder/tree/main/images) and place it in the root directory of mmdetection.

**(1) Open Vocabulary Semantic Segmentation**

shell
cd projects/XDecoder
python demo.py ../../images/animals.png configs/xdecoder-tiny_zeroshot_open-vocab-semseg_coco.py --weights ../../xdecoder_focalt_last_novg.pt --texts zebra.giraffe

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/c397c0ed-859a-4004-8725-78a591742bc8" width="70%"/>
</div>

**(2) Open Vocabulary Instance Segmentation**

shell
cd projects/XDecoder
python demo.py ../../images/owls.jpeg configs/xdecoder-tiny_zeroshot_open-vocab-instance_coco.py --weights ../../xdecoder_focalt_last_novg.pt --texts owl

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/494b0b1c-4a42-4019-97ae-d33ee68af3d2" width="70%"/>
</div>

**(3) Open Vocabulary Panoptic Segmentation**

shell
cd projects/XDecoder
python demo.py ../../images/street.jpg configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_coco.py --weights ../../xdecoder_focalt_last_novg.pt --text car.person --stuff-text tree.sky

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/9ad1e0f4-75ce-4e37-a5cc-83e0e8a722ed" width="70%"/>
</div>

**(4) Referring Expression Segmentation**

shell
cd projects/XDecoder
python demo.py ../../images/fruit.jpg configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py --weights ../../xdecoder_focalt_last_novg.pt --text "The larger watermelon. The front white flower. White tea pot."

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/f3ecdb50-20f0-4dc4-aa9c-90995ae04893" width="70%"/>
</div>

**(5) Image Caption**

shell
cd projects/XDecoder
python demo.py ../../images/penguin.jpeg configs/xdecoder-tiny_zeroshot_caption_coco2014.py --weights ../../xdecoder_focalt_last_novg.pt

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/7690ab79-791e-4011-ab0c-01f46c4a3d80" width="70%"/>
</div>

**(6) Referring Expression Image Caption**

shell
cd projects/XDecoder
python demo.py ../../images/fruit.jpg configs/xdecoder-tiny_zeroshot_ref-caption.py --weights ../../xdecoder_focalt_last_novg.pt --text 'White tea pot'

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/bae2fdba-0172-4fc8-8ad1-73b54c64ec30" width="70%"/>
</div>

**(7) Text Image Region Retrieval**

shell
cd projects/XDecoder
python demo.py ../../images/coco configs/xdecoder-tiny_zeroshot_text-image-retrieval.py --weights ../../xdecoder_focalt_last_novg.pt --text 'pizza on the plate'

text
The image that best matches the given text is ../../images/coco/000.jpg and probability is 0.998

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/479de6b2-88e7-41f0-8228-4b9a48f52954" width="70%"/>
</div>

We have also prepared a gradio program in the `projects/gradio_demo` directory, which you can run interactively all the inference supported by mmdetection in your browser.

Models and results

Semantic segmentation on ADE20K

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdade20k-2016-dataset-preparation).

**Test Command**

Since semantic segmentation is a pixel-level task, we don't need to use a threshold to filter out low-confidence predictions. So we set `model.test_cfg.use_thr_for_mc=False` in the test command.

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-semseg_ade20k.py xdecoder_focalt_best_openseg.pt 8 --cfg-options model.test_cfg.use_thr_for_mc=False

| Model | mIoU | mIOU(official) | Config |
| :-------------------------------- | :---: | :------------: | :------------------------------------------------------------------: |
| `xdecoder_focalt_best_openseg.pt` | 25.24 | 25.13 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-semseg_ade20k.py) |

Instance segmentation on ADE20K

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdade20k-2016-dataset-preparation).

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-instance_ade20k.py xdecoder_focalt_best_openseg.pt 8

| Model | mIoU | mIOU(official) | Config |
| :-------------------------------- | :--: | :------------: | :--------------------------------------------------------------------: |
| `xdecoder_focalt_best_openseg.pt` | 10.1 | 10.1 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-instance_ade20k.py) |

Panoptic segmentation on ADE20K

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdade20k-2016-dataset-preparation).

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_ade20k.py xdecoder_focalt_best_openseg.pt 8

| Model | mIoU | mIOU(official) | Config |
| :-------------------------------- | :---: | :------------: | :--------------------------------------------------------------------: |
| `xdecoder_focalt_best_openseg.pt` | 19.11 | 18.97 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_ade20k.py) |

Semantic segmentation on COCO2017

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdcoco-semantic-dataset-preparation) of `(2) use panoptic dataset` part.

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-semseg_coco.py xdecoder_focalt_last_novg.pt 8 --cfg-options model.test_cfg.use_thr_for_mc=False

| Model | mIOU | mIOU(official) | Config |
| :---------------------------------------------- | :--: | :------------: | :----------------------------------------------------------------: |
| `xdecoder-tiny_zeroshot_open-vocab-semseg_coco` | 62.1 | 62.1 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-semseg_coco.py) |

Instance segmentation on COCO2017

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdbasic-detection-dataset-preparation).

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-instance_coco.py xdecoder_focalt_last_novg.pt 8

| Model | Mask mAP | Mask mAP(official) | Config |
| :------------------------------------------------ | :------: | :----------------: | :------------------------------------------------------------------: |
| `xdecoder-tiny_zeroshot_open-vocab-instance_coco` | 39.8 | 39.7 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-instance_coco.py) |

Panoptic segmentation on COCO2017

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdbasic-detection-dataset-preparation).

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_coco.py xdecoder_focalt_last_novg.pt 8

| Model | PQ | PQ(official) | Config |
| :------------------------------------------------ | :---: | :----------: | :------------------------------------------------------------------: |
| `xdecoder-tiny_zeroshot_open-vocab-panoptic_coco` | 51.42 | 51.16 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-panoptic_coco.py) |

Referring segmentation on RefCOCO

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdrefcoco-dataset-preparation).

shell
./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py xdecoder_focalt_last_novg.pt 8 --cfg-options test_dataloader.dataset.split='val'

| Model | text mode | cIoU | cIOU(official) | Config |
| :----------------------------- | :----------: | :-----: | :------------: | :---------------------------------------------------------------------: |
| `xdecoder_focalt_last_novg.pt` | select first | 58.8415 | 57.85 | [config](configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py) |
| `xdecoder_focalt_last_novg.pt` | original | 60.0321 | - | [config](configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py) |
| `xdecoder_focalt_last_novg.pt` | concat | 60.3551 | - | [config](configs/xdecoder-tiny_zeroshot_open-vocab-ref-seg_refcocog.py) |

**Note:**

1. If you set the scale of `Resize` to (1024, 512), the result will be `57.69`.
2. `text mode` is the `RefCoCoDataset` parameter in MMDetection, it determines the texts loaded to the data list. It can be set to `select_first`, `original`, `concat` and `random`.
- `select_first`: select the first text in the text list as the description to an instance.
- `original`: use all texts in the text list as the description to an instance.
- `concat`: concatenate all texts in the text list as the description to an instance.
- `random`: randomly select one text in the text list as the description to an instance, usually used for training.

Image Caption on COCO2014

Prepare your dataset according to the [docs](../../docs/en/user_guides/dataset_prepare.mdcoco-caption-dataset-preparation).

Before testing, you need to install jdk 1.8, otherwise it will prompt that java does not exist during the evaluation process

./tools/dist_test.sh projects/XDecoder/configs/xdecoder-tiny_zeroshot_caption_coco2014.py xdecoder_focalt_last_novg.pt 8

| Model | BLEU-4 | CIDER | Config |
| :---------------------------------------- | :----: | :----: | :----------------------------------------------------------: |
| `xdecoder-tiny_zeroshot_caption_coco2014` | 35.26 | 116.81 | [config](configs/xdecoder-tiny_zeroshot_caption_coco2014.py) |

Gradio Demo

<div align="center">
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/6c29886f-ae7a-4a55-8be4-352ee85b7d3e"/>
</div>

Please refer to https://github.com/open-mmlab/mmdetection/blob/dev-3.x/projects/gradio_demo/README.md for details.

Contributors

A total of 30 developers contributed to this release.

Thanks jjjkkkjjj lovelykite, minato-ellie, freepoet, wufan-tb, yalibian, keyakiluo, gihanjayatilaka, i-aki-y, xin-li-67, RangeKing, JingweiZhang12, MambaWong, lucianovk, tall-josh, xiuqhou, jamiechoi1995, YQisme, yechenzhi, bjzhb666, xiexinch, jamiechoi1995, yarkable, Renzhihan, nijkah, amaizr, Lum1104, zwhus, Czm369, hhaAndroid

44.9

43.0

3.3.0

MM Grounding DINO

[An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361)

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/4214e282-a553-4abf-b8a4-84ea566851c9"/>
</div>

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/fb14d1ee-5469-44d2-b865-aac9850c429c"/>
</div>

Detail: https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino

3.2.0

**1. 检测 Transformer SOTA 模型大合集**
(1) 支持了 [DDQ](configs/ddq/README.md)、[CO-DETR](projects/CO-DETR/README.md)、[AlignDETR](projects/AlignDETR/README.md) 和 [H-DINO](projects/HDINO/README.md) 4 个更新更强的 SOTA Transformer 模型
(2) 基于 CO-DETR, MMDet 中发布了 COCO 性能为 64.1 mAP 的模型
(3) DINO 等算法支持 AMP/Checkpoint/FrozenBN，可以有效降低显存

**2. [提供了全面的 CNN 和 Transformer 的性能对比](projects/RF100-Benchmark/README_zh-CN.md)**
RF100 是由 100 个现实收集的数据集组成，包括 7 个域，可以验证 DINO 等 Transformer 模型和 CNN 类算法在不同场景不同数据量下的性能差异。用户可以用这个 Benchmark 快速验证自己的算法在不同场景下的鲁棒性。

<div align=center>
<img src="https://github.com/open-mmlab/mmdetection/assets/17425982/86420903-36a8-410d-9251-4304b9704f7d"/>
</div>

**3. 支持了 [GLIP](configs/glip/README.md) 和 [Grounding DINO](configs/grounding_dino/README.md) 微调，全网唯一支持 Grounding DINO 微调**
MMDet 中的 Grounding DINO 是全网唯一支持微调的算法库，且性能高于官方 1 个点，当然 GLIP 也比官方高。
我们还提供了详细的 Grounding DINO 在自定义数据集上训练评估的流程，欢迎大家试用。

| Model | Backbone | Style | COCO mAP | Official COCO mAP |
| :----------------: | :------: | :-------: | :--------: | :---------------: |
| Grounding DINO-T | Swin-T | Zero-shot | 48.5 | 48.4 |
| Grounding DINO-T | Swin-T | Finetune | 58.1(+0.9) | 57.2 |
| Grounding DINO-B | Swin-B | Zero-shot | 56.9 | 56.7 |
| Grounding DINO-B | Swin-B | Finetune | 59.7 | |
| Grounding DINO-R50 | R50 | Scratch | 48.9(+0.8) | 48.1 |

**4. 支持开放词汇检测算法 [Detic](projects/Detic_new/README.md) 并提供多数据集联合训练可能**

**5. 轻松使用 [FSDP 和 DeepSpeed 训练检测模型](projects/example_largemodel/README_zh-CN.md)**

| ID | AMP | GC of Backbone | GC of Encoder | FSDP | Peak Mem (GB) | Iter Time (s) |
| :-: | :-: | :------------: | :-----------: | :--: | :-----------: | :-----------: |
| 1 | | | | | 49 (A100) | 0.9 |
| 2 | √ | | | | 39 (A100) | 1.2 |
| 3 | | √ | | | 33 (A100) | 1.1 |
| 4 | √ | √ | | | 25 (A100) | 1.3 |
| 5 | | √ | √ | | 18 | 2.2 |
| 6 | √ | √ | √ | | 13 | 1.6 |
| 7 | | √ | √ | √ | 14 | 2.9 |
| 8 | √ | √ | √ | √ | 8.5 | 2.4 |

**6. 支持了 [V3Det](configs/v3det/README.md) 1.3w+ 类别的超大词汇检测数据集**

<div align=center>
<img width=960 src="https://github.com/open-mmlab/mmdetection/assets/17425982/9c216387-02be-46e6-b0f2-b856f80f6d84"/>
</div>

3.1.0

Highlights

- Supports tracking algorithms including multi-object tracking (MOT) algorithms SORT, DeepSORT, StrongSORT, OCSORT, ByteTrack, QDTrack, and video instance segmentation (VIS) algorithm MaskTrackRCNN, Mask2Former-VIS.
- Support [ViTDet](https://github.com/open-mmlab/mmdetection/tree/dev-3.x/projects/ViTDet)
- Supports inference and evaluation of multimodal algorithms [GLIP](https://github.com/open-mmlab/mmdetection/tree/dev-3.x/configs/glip) and [XDecoder](https://github.com/open-mmlab/mmdetection/tree/dev-3.x/projects/XDecoder), and also supports datasets such as COCO semantic segmentation, COCO Caption, ADE20k general segmentation, and RefCOCO. GLIP fine-tuning will be supported in the future.
- Provides a [gradio demo](https://github.com/open-mmlab/mmdetection/blob/dev-3.x/projects/gradio_demo/README.md) for image type tasks of MMDetection, making it easy for users to experience.

Exciting Features

GLIP inference and evaluation

s multimodal vision algorithms continue to evolve, MMDetection has also supported such algorithms. This section demonstrates how to use the demo and eval scripts corresponding to multimodal algorithms using the GLIP algorithm and model as the example. Moreover, MMDetection integrated a [gradio_demo project](../../../projects/gradio_demo/), which allows developers to quickly play with all image input tasks in MMDetection on their local devices. Check the [document](../../../projects/gradio_demo/README.md) for more details.

Preparation

Please first make sure that you have the correct dependencies installed:

shell
if source
pip install -r requirements/multimodal.txt

if wheel
mim install mmdet[multimodal]

MMDetection has already implemented GLIP algorithms and provided the weights, you can download directly from urls:

shell
cd mmdetection
wget https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth

Inference

Once the model is successfully downloaded, you can use the `demo/image_demo.py` script to run the inference.

shell
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts bench

Demo result will be similar to this:

<div align=center>
<img src="https://user-images.githubusercontent.com/17425982/234547841-266476c8-f987-4832-8642-34357be621c6.png" height="300"/>
</div>

If users would like to detect multiple targets, please declare them in the format of `xx . xx .` after the `--texts`.

shell
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts 'bench . car .'

And the result will be like this one:

<div align=center>
<img src="https://user-images.githubusercontent.com/17425982/234548156-ef9bbc2e-7605-4867-abe6-048b8578893d.png" height="300"/>
</div>

You can also use a sentence as the input prompt for the `--texts` field, for example:

shell
python demo/image_demo.py demo/demo.jpg glip_tiny_a_mmdet-b3654169.pth --texts 'There are a lot of cars here.'

The result will be similar to this:

<div align=center>
<img src="https://user-images.githubusercontent.com/17425982/234548490-d2e0a16d-1aad-4708-aea0-c829634fabbd.png" height="300"/>
</div>

Evaluation

The GLIP implementation in MMDetection does not have any performance degradation, our benchmark is as follows:

| Model | official mAP | mmdet mAP |
| ----------------------- | :----------: | :-------: |

Page 1 of 15

Releases

Has known vulnerabilities

Mmdet

Page 1 of 15

51.3

44.9

43.0

3.3.0

3.2.0

3.1.0

Page 1 of 15

Links

Releases