Auto-annotation API

Overview

This layer provides functionality that allows you to automate the process of annotating a CVAT dataset by delegating this process (or parts of it) to a program running on a machine under your control.

To make use of this delegation, you must implement an “auto-annotation function”, or “AA function” for short. This is a Python object that implements one of the protocols defined by this layer. The particular protocol implemented defines which part of the annotation process the AA function will be able to automate.

An AA function may be used in one of the following modes:

Immediate mode. This involves annotating a specific CVAT task by passing the AA function to a driver, along with the identifier of the task and optional additional parameters. This may be done either:
- programmatically (consult the “Auto-annotation driver” section (TODO)); or
- via the CVAT CLI (consult the description of the task auto-annotate command in the CLI documentation).
Agent mode. This involves registering the AA function with the CVAT server (creating a resource on the server known as a “native function”) and then running one or more agent processes.

This makes the AA function usable from the CVAT UI. CVAT users can choose to use the native function as the model when using CVAT’s AI tools. When they do, the agents detect this, and process their requests by calling appropriate methods on the corresponding AA function.

Depending on how you create the native function, it’ll be accessible to only you, or your organization.

For more details, consult the descriptions of the function create-native and function run-agent commands in the CLI documentation.

This SDK layer can be divided into several parts:

The interface, containing the protocols that an AA function must implement, as well as helpers for use by such functions. Consult “…”
The driver, containing functionality to annotate a CVAT dataset using an AA function. Consult “…”
Predefined AA functions based on torchvision. Consult “…”

Example

An AA function may be implemented in any way that is appropriate for your use case. However, a typical AA function will be based on a machine learning model and consist of the following basic elements:

Code to load the ML model.
A specification defining which protocol the AA function implements, as well as static properties of the AA function (such as a description of the annotations that the AA function can produce).
Code to convert data from SDK data structures to a format the ML model can understand.
Code to run the ML model.
Code to convert resulting annotations to SDK data structures.

The following code snippet shows an example AA function implementation (specifically, a detection function), as well as code that creates an instance of the function and uses it for auto-annotation.

from typing import List
import PIL.Image

import torchvision.models

from cvat_sdk import make_client
import cvat_sdk.models as models
import cvat_sdk.auto_annotation as cvataa

class TorchvisionDetectionFunction:
    def __init__(self, model_name: str, weights_name: str, **kwargs) -> None:
        # load the ML model
        weights_enum = torchvision.models.get_model_weights(model_name)
        self._weights = weights_enum[weights_name]
        self._transforms = self._weights.transforms()
        self._model = torchvision.models.get_model(model_name, weights=self._weights, **kwargs)
        self._model.eval()

    @property
    def spec(self) -> cvataa.DetectionFunctionSpec:
        # describe the annotations
        return cvataa.DetectionFunctionSpec(
            labels=[
                cvataa.label_spec(cat, i, type="rectangle")
                for i, cat in enumerate(self._weights.meta["categories"])
                if cat != "N/A"
            ]
        )

    def detect(
        self, context: cvataa.DetectionFunctionContext, image: PIL.Image.Image
    ) -> list[models.LabeledShapeRequest]:
        # determine the threshold for filtering results
        conf_threshold = context.conf_threshold or 0

        # convert the input into a form the model can understand
        transformed_image = [self._transforms(image)]

        # run the ML model
        results = self._model(transformed_image)

        # convert the results into the form SDK requires
        return [
            cvataa.rectangle(label.item(), [x.item() for x in box])
            for result in results
            for box, label, score in zip(result["boxes"], result["labels"], result["scores"])
            if score >= conf_threshold
        ]

# log into the CVAT server
with make_client(host="http://localhost", credentials=("user", "password")) as client:
    # create a function that uses Faster R-CNN
    func = TorchvisionDetectionFunction("fasterrcnn_resnet50_fpn_v2", "DEFAULT", box_score_thresh=0.5)

    # annotate task 12345 using the function
    cvataa.annotate_task(client, 12345, func)

Auto-annotation interface

This part of the auto-annotation layer defines the protocols that an AA function must implement.

Detection function protocol

A detection function is a type of AA function that accepts an image and returns a list of shapes found in that image.

A detection function can be used in the following ways:

In immediate mode, the AA function is run for every image in a given CVAT task, and the resulting lists of shapes are combined and uploaded to CVAT.
In agent mode, the AA function can be used from the CVAT UI to either annotate a complete task (similar to immediate mode) or a single frame in a task.

A detection function must have two attributes, spec and detect.

spec must contain the AA function’s specification, which is an instance of DetectionFunctionSpec.

DetectionFunctionSpec must be initialized with a sequence of PatchedLabelRequest objects that represent the labels that the AA function knows about. See the docstring of DetectionFunctionSpec for more information on the constraints that these objects must follow. BadFunctionError will be raised if any constraint violations are detected.

detect must be a function/method accepting two parameters:

context (DetectionFunctionContext). Contains invocation parameters and information about the current image. The following fields are available:
- frame_name (str). The file name of the frame on the CVAT server.
- conf_threshold (float | None). The confidence threshold that the function should use to filter objects. If None, the function may apply a default threshold at its discretion.
image (PIL.Image.Image). Contains image data.

detect must return a list of LabeledShapeRequest objects, representing shapes found in the image. See the docstring of DetectionFunctionSpec for more information on the constraints that these objects must follow.

The same AA function may be used with any dataset that contain labels with the same name as the AA function’s specification. The way it works is that the driver matches labels between the spec and the dataset, and replaces the label IDs in the shape objects with those defined in the dataset.

For example, suppose the AA function’s spec defines the following labels:

Name	ID
`bat`	0
`rat`	1

And the dataset defines the following labels:

Name	ID
`bat`	100
`cat`	101
`rat`	102

Then suppose detect returns a shape with label_id equal to 1. The driver will see that it refers to the rat label, and replace it with 102, since that’s the ID this label has in the dataset.

The same logic is used for sublabel and attribute IDs.

Helper factory functions

The CVAT API model types used in the detection function protocol are somewhat unwieldy to work with, so it’s recommended to use the helper factory functions provided by this layer. These helpers instantiate an object of their corresponding model type, passing their arguments to the model constructor and sometimes setting some attributes to fixed values.

The following helpers are available for building specifications:

Name	Model type	Fixed attributes
`label_spec`	`PatchedLabelRequest`	-
`skeleton_label_spec`	`PatchedLabelRequest`	`type="skeleton"`
`keypoint_spec`	`SublabelRequest`	`type="points"`
`attribute_spec`	`AttributeRequest`	`mutable=False`
`checkbox_attribute_spec`	`AttributeRequest`	`mutable=False`, `input_type="checkbox"`, `values=[]`
`number_attribute_spec`	`AttributeRequest`	`mutable=False`, `input_type="number"`
`radio_attribute_spec`	`AttributeRequest`	`mutable=False`, `input_type="radio"`
`select_attribute_spec`	`AttributeRequest`	`mutable=False`, `input_type="select"`
`text_attribute_spec`	`AttributeRequest`	`mutable=False`, `input_type="number"`, `values=[]`

For number_attribute_spec, it’s recommended to use the cvat_sdk.attributes.number_attribute_values function to create the values argument, since this function will enforce the constraints expected for attribute specs of this type. For example:

cvataa.number_attribute_spec("size", 1, number_attribute_values(0, 10))

The following helpers are available for use in detect:

Name	Model type	Fixed attributes
`shape`	`LabeledShapeRequest`	`frame=0`
`mask`	`LabeledShapeRequest`	`frame=0`, `type="mask"`
`polygon`	`LabeledShapeRequest`	`frame=0`, `type="polygon"`
`rectangle`	`LabeledShapeRequest`	`frame=0`, `type="rectangle"`
`skeleton`	`LabeledShapeRequest`	`frame=0`, `type="skeleton"`
`keypoint`	`SubLabeledShapeRequest`	`frame=0`, `type="points"`

For mask, it is recommended to create the points list using the cvat_sdk.masks.encode_mask function, which will convert a bitmap into a list in the format that CVAT expects. For example:

cvataa.mask(my_label, encode_mask(
    my_mask,  # boolean 2D array, same size as the input image
    [x1, y1, x2, y2],  # top left and bottom right coordinates of the mask
))

To create shapes with attributes, it’s recommended to use the cvat_sdk.attributes.attribute_vals_from_dict function, which returns a list of objects that can be passed to an attributes argument:

cvataa.rectangle(
    my_label, [x1, y2, x2, y2],
    attributes=attribute_vals_from_dict({my_attr1: val1, my_attr2: val2})
)

Tracking function protocol

A tracking function is a type of AA function that analyzes an image with one or more shapes on it, and then predicts the positions of those shapes on subsequent images.

A tracking function can only be used in agent mode. When used with a tracking function, an agent will use it to process requests from the AI tracking tools in the CVAT UI.

WARNING: Currently, only one agent should be run for each tracking function. If multiple agents for one tracking function are run at the same time, CVAT users may experience intermittent “Tracking state not found” errors when using the function.

A tracking function must have three attributes, spec, init_tracking_state, and track. It may also optionally have a preprocess_image attribute.

spec must contain the AA function’s specification, which is an instance of TrackingFunctionSpec. This specification must be initialized with a single supported_shape_types parameter, defining which types of shapes the AA function is able to track. For example:

spec = cvataa.TrackingFunctionSpec(supported_shape_types=["rectangle"])

init_tracking_state must be a function accepting the following parameters:

context (TrackingFunctionShapeContext). An object with information about the shape being tracked. See details below.
pp_image (type varies). A preprocessed image. Consult the description of preprocess_image for more details.
shape (TrackableShape). A shape within the preprocessed image. TrackableShape is a minimal version of the LabeledShape SDK model, containing only the type and points fields. The shape’s type is guaranteed to be one of the types listed in the supported_shape_types field of the spec.

init_tracking_state must analyze the shape and create a state object containing any information that the AA function will need to predict its location on a subsequent image. It must then return this object.

init_tracking_state must not modify either pp_image or shape.

track must be a function accepting the following parameters:

context (TrackingFunctionShapeContext). An object with information about the shape being tracked. See details below.
pp_image (type varies). A preprocessed image. Consult the description of preprocess_image for more details. This image will have the same dimensions as those of the image used to create the state object.
state (type varies). The object returned by a previous call to init_tracking_state.

track must locate the shape that was used to create the state object on the new preprocessed image. If it is able to do that, it must return its prediction as a new TrackableShape object. This object must have the same value of type as the original shape.

If track is unable to locate the shape, it must return None.

track may modify state as needed to improve prediction accuracy on subsequent frames. It must not modify pp_image.

A TrackingFunctionShapeContext object passed to both init_tracking_state and track will have the following field:

original_shape_type (str). The type of the shape being tracked. In init_tracking_state, this is the same as shape.type. In track, this is the type of the shape that state was created from.

preprocess_image, if implemented, must accept the following parameters:

context (TrackingFunctionContext). This is currently a dummy object and should be ignored. In future versions, this may contain additional information.
image (PIL.Image.Image). An image that will be used to either start or continue tracking.

preprocess_image must perform any analysis on the image that the function can perform independently of the shapes being tracked and return an object representing the results of that analysis. This object will be passed as pp_image to init_tracking_state and track.

If preprocess_image is not implemented, then the pp_image object will be the original image. In other words, the default implementation is:

def preprocess_image(context, image):
    return image

Auto-annotation driver

The annotate_task function uses a detection function to annotate a CVAT task. It must be called as follows:

annotate_task(<client>, <task ID>, <AA function>, <optional arguments...>)

The supplied client will be used to make all API calls.

By default, new annotations will be appended to the old ones. Use clear_existing=True to remove old annotations instead.

If a detection function declares a label that has no matching label in the task, then by default, BadFunctionError is raised, and auto-annotation is aborted. If you use allow_unmatched_label=True, then such labels will be ignored, and any shapes referring to them will be dropped. Same logic applies to sublabels and attributes.

It’s possible to pass a custom confidence threshold to the function via the conf_threshold parameter.

annotate_task will raise a BadFunctionError exception if it detects that the function violated the detection function protocol.

Predefined AA functions

This layer includes several predefined detection functions. You can use them as-is, or as a base on which to build your own.

Each function is implemented as a module to allow usage via the CLI auto-annotate command. Therefore, in order to use it from the SDK, you’ll need to import the corresponding module.

`cvat_sdk.auto_annotation.functions.torchvision_detection`

This AA function uses object detection models from the torchvision library. It produces rectangle annotations.

To use it, install CVAT SDK with the pytorch extra:

$ pip install "cvat-sdk[pytorch]"

Usage from Python:

from cvat_sdk.auto_annotation.functions.torchvision_detection import create as create_torchvision
annotate_task(<client>, <task ID>, create_torchvision(<model name>, ...))

Usage from the CLI:

cvat-cli auto-annotate "<task ID>" --function-module cvat_sdk.auto_annotation.functions.torchvision_detection \
      -p model_name=str:"<model name>" ...

The create function accepts the following parameters:

model_name (str) - the name of the model, such as fasterrcnn_resnet50_fpn_v2. This parameter is required.
weights_name (str) - the name of a weights enum value for the model, such as COCO_V1. Defaults to DEFAULT.

It also accepts arbitrary additional parameters, which are passed directly to the model constructor.

`cvat_sdk.auto_annotation.functions.torchvision_instance_segmentation`

This AA function is analogous to torchvision_detection, except it uses torchvision’s instance segmentation models and produces mask or polygon annotations (depending on the value of conv_mask_to_poly).

Refer to that function’s description for usage instructions and parameter information.

`cvat_sdk.auto_annotation.functions.torchvision_keypoint_detection`

This AA function is analogous to torchvision_detection, except it uses torchvision’s keypoint detection models and produces skeleton annotations. Keypoints which the model marks as invisible will be marked as occluded in CVAT.

Refer to that function’s description for usage instructions and parameter information.