Automatic annotation

Automatic annotation of tasks

Automatic annotation in CVAT is a tool that you can use to automatically pre-annotate your data with pre-trained models.

CVAT can use models from the following sources:

The following table describes the available options:

Self-hosted Cloud
Price Free See Pricing
Models You have to add models You can use pre-installed models
Hugging Face & Roboflow
Not supported Supported


Running Automatic annotation

To start automatic annotation, do the following:

  1. On the top menu, click Tasks.

  2. Find the task you want to annotate and click Action > Automatic annotation.

  3. In the Automatic annotation dialog, from the drop-down list, select a model.

  4. Match the labels of the model and the task.

  5. (Optional) In case you need the model to return masks as polygons, switch toggle Return masks as polygons.

  6. (Optional) In case you need to remove all previous annotations, switch toggle Clean old annotations.

  7. Click Annotate.

CVAT will show the progress of annotation on the progress bar.

Progress bar

You can stop the automatic annotation at any moment by clicking cancel.

Labels matching

Each model is trained on a dataset and supports only the dataset’s labels.

For example:

  • DL model has the label car.
  • Your task (or project) has the label vehicle.

To annotate, you need to match these two labels to give CVAT a hint that, in this case, car = vehicle.

If you have a label that is not on the list of DL labels, you will not be able to match them.

For this reason, supported DL models are suitable only for certain labels.

To check the list of labels for each model, see Models papers and official documentation.


Automatic annotation uses pre-installed and added models.

For self-hosted solutions, you need to install Automatic Annotation first and add models.

List of pre-installed models:

Model Description
Attributed face detection Three OpenVINO models work together:

  • Face Detection 0205: face detector based on MobileNetV2 as a backbone with a FCOS head for indoor and outdoor scenes shot by a front-facing camera.
  • Emotions recognition retail 0003: fully convolutional network for recognition of five emotions (‘neutral’, ‘happy’, ‘sad’, ‘surprise’, ‘anger’).
  • Age gender recognition retail 0013: fully convolutional network for simultaneous Age/Gender recognition. The network can recognize the age of people in the [18 - 75] years old range; it is not applicable for children since their faces were not in the training set.
  • RetinaNet R101 RetinaNet is a one-stage object detection model that utilizes a focal loss function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks.

    For more information, see:
  • Site: RetinaNET
  • Text detection Text detector based on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone for indoor/outdoor scenes.

    For more information, see:
  • Site: OpenVINO Text detection 004
  • YOLO v3 YOLO v3 is a family of object detection architectures and models pre-trained on the COCO dataset.

    For more information, see:
  • Site: YOLO v3
  • YOLO v7 YOLOv7 is an advanced object detection model that outperforms other detectors in terms of both speed and accuracy. It can process frames at a rate ranging from 5 to 160 frames per second (FPS) and achieves the highest accuracy with 56.8% average precision (AP) among real-time object detectors running at 30 FPS or higher on the V100 graphics processing unit (GPU).

    For more information, see:
  • GitHub: YOLO v7
  • Paper: YOLO v7
  • Adding models from Hugging Face and Roboflow

    In case you did not find the model you need, you can add a model of your choice from Hugging Face or Roboflow.

    Note, that you cannot add models from Hugging Face and Roboflow to self-hosted CVAT.

    For more information, see Streamline annotation by integrating Hugging Face and Roboflow models.

    This video demonstrates the process: