This release brings several new features to torchvision, including models for semantic segmentation, object detection, instance segmentation and person keypoint detection, and custom C++ / CUDA ops specific to computer vision.
**Note: torchvision 0.3 requires PyTorch 1.1 or newer**
Highlights
Reference training / evaluation scripts
We now provide under the `references/` folder scripts for training and evaluation of the following tasks: classification, semantic segmentation, object detection, instance segmentation and person keypoint detection.
Their purpose is twofold:
* serve as a log of how to train a specific model.
* provide baseline training and evaluation scripts to bootstrap research
They all have an entry-point `train.py` which performs both training and evaluation for a particular task. Other helper files, specific to each training script, are also present in the folder, and they might get integrated into the torchvision library in the future.
We expect users should copy-paste and modify those reference scripts and use them for their own needs.
TorchVision Ops
TorchVision now contains custom C++ / CUDA operators in `torchvision.ops`. Those operators are specific to computer vision, and make it easier to build object detection models.
Those operators currently do not support PyTorch script mode, but support for it is planned for future releases.
List of supported ops
* `roi_pool` (and the module version `RoIPool`)
* `roi_align` (and the module version `RoIAlign`)
* `nms`, for non-maximum suppression of bounding boxes
* `box_iou`, for computing the intersection over union metric between two sets of bounding boxes
All the other ops present in `torchvision.ops` and its subfolders are experimental, in particular:
* `FeaturePyramidNetwork` is a module that adds a FPN on top of a module that returns a set of feature maps.
* `MultiScaleRoIAlign` is a wrapper around `roi_align` that works with multiple feature map scales
Here are a few examples on using torchvision ops:
python
import torch
import torchvision
create 10 random boxes
boxes = torch.rand(10, 4) * 100
they need to be in [x0, y0, x1, y1] format
boxes[:, 2:] += boxes[:, :2]
create a random image
image = torch.rand(1, 3, 200, 200)
extract regions in `image` defined in `boxes`, rescaling
them to have a size of 3x3
pooled_regions = torchvision.ops.roi_align(image, [boxes], output_size=(3, 3))
check the size
print(pooled_regions.shape)
torch.Size([10, 3, 3, 3])
or compute the intersection over union between
all pairs of boxes
print(torchvision.ops.box_iou(boxes, boxes).shape)
torch.Size([10, 10])
Models for more tasks
The 0.3 release of torchvision includes pre-trained models for other tasks than image classification on ImageNet.
We include two new categories of models: region-based models, like Faster R-CNN, and dense pixelwise prediction models, like DeepLabV3.
Object Detection, Instance Segmentation and Person Keypoint Detection models
**Warning: The API is currently experimental and might change in future versions of torchvision**
The 0.3 release contains pre-trained models for Faster R-CNN, Mask R-CNN and Keypoint R-CNN, all of them using ResNet-50 backbone with FPN.
They have been trained on COCO train2017 following the reference scripts in `references/`, and give the following results on COCO val2017
Network | box AP | mask AP | keypoint AP
-- | -- | -- | --