Auto-annotation API
Overview
This layer provides functionality that allows you to automate the process of annotating a CVAT dataset by delegating this process (or parts of it) to a program running on a machine under your control.
To make use of this delegation, you must implement an “auto-annotation function”, or “AA function” for short. This is a Python object that implements one of the protocols defined by this layer. The particular protocol implemented defines which part of the annotation process the AA function will be able to automate.
An AA function may be used in one of the following modes:
-
Immediate mode. This involves annotating a specific CVAT task by passing the AA function to a driver, along with the identifier of the task and optional additional parameters. This may be done either:
-
programmatically (consult the “Auto-annotation driver” section (TODO)); or
-
via the CVAT CLI (consult the description of the
task auto-annotate
command in the CLI documentation).
-
-
Agent mode. This involves registering the AA function with the CVAT server (creating a resource on the server known as a “native function”) and then running one or more agent processes.
This makes the AA function usable from the CVAT UI. CVAT users can choose to use the native function as the model when using CVAT’s AI tools. When they do, the agents detect this, and process their requests by calling appropriate methods on the corresponding AA function.
Depending on how you create the native function, it’ll be accessible to only you, or your organization.
For more details, consult the descriptions of the
function create-native
andfunction run-agent
commands in the CLI documentation.
This SDK layer can be divided into several parts:
-
The interface, containing the protocols that an AA function must implement, as well as helpers for use by such functions. Consult “…”
-
The driver, containing functionality to annotate a CVAT dataset using an AA function. Consult “…”
-
Predefined AA functions based on torchvision. Consult “…”
Example
An AA function may be implemented in any way that is appropriate for your use case. However, a typical AA function will be based on a machine learning model and consist of the following basic elements:
-
Code to load the ML model.
-
A specification defining which protocol the AA function implements, as well as static properties of the AA function (such as a description of the annotations that the AA function can produce).
-
Code to convert data from SDK data structures to a format the ML model can understand.
-
Code to run the ML model.
-
Code to convert resulting annotations to SDK data structures.
The following code snippet shows an example AA function implementation (specifically, a detection function), as well as code that creates an instance of the function and uses it for auto-annotation.
from typing import List
import PIL.Image
import torchvision.models
from cvat_sdk import make_client
import cvat_sdk.models as models
import cvat_sdk.auto_annotation as cvataa
class TorchvisionDetectionFunction:
def __init__(self, model_name: str, weights_name: str, **kwargs) -> None:
# load the ML model
weights_enum = torchvision.models.get_model_weights(model_name)
self._weights = weights_enum[weights_name]
self._transforms = self._weights.transforms()
self._model = torchvision.models.get_model(model_name, weights=self._weights, **kwargs)
self._model.eval()
@property
def spec(self) -> cvataa.DetectionFunctionSpec:
# describe the annotations
return cvataa.DetectionFunctionSpec(
labels=[
cvataa.label_spec(cat, i, type="rectangle")
for i, cat in enumerate(self._weights.meta["categories"])
if cat != "N/A"
]
)
def detect(
self, context: cvataa.DetectionFunctionContext, image: PIL.Image.Image
) -> list[models.LabeledShapeRequest]:
# determine the threshold for filtering results
conf_threshold = context.conf_threshold or 0
# convert the input into a form the model can understand
transformed_image = [self._transforms(image)]
# run the ML model
results = self._model(transformed_image)
# convert the results into the form SDK requires
return [
cvataa.rectangle(label.item(), [x.item() for x in box])
for result in results
for box, label, score in zip(result["boxes"], result["labels"], result["scores"])
if score >= conf_threshold
]
# log into the CVAT server
with make_client(host="http://localhost", credentials=("user", "password")) as client:
# create a function that uses Faster R-CNN
func = TorchvisionDetectionFunction("fasterrcnn_resnet50_fpn_v2", "DEFAULT", box_score_thresh=0.5)
# annotate task 12345 using the function
cvataa.annotate_task(client, 12345, func)
Auto-annotation interface
This part of the auto-annotation layer defines the protocols that an AA function must implement.
Detection function protocol
A detection function is a type of AA function that accepts an image and returns a list of shapes found in that image.
A detection function can be used in the following ways:
-
In immediate mode, the AA function is run for every image in a given CVAT task, and the resulting lists of shapes are combined and uploaded to CVAT.
-
In agent mode, the AA function can be used from the CVAT UI to either annotate a complete task (similar to immediate mode) or a single frame in a task.
A detection function must have two attributes, spec
and detect
.
spec
must contain the AA function’s specification,
which is an instance of DetectionFunctionSpec
.
DetectionFunctionSpec
must be initialized with a sequence of PatchedLabelRequest
objects
that represent the labels that the AA function knows about.
See the docstring of DetectionFunctionSpec
for more information on the constraints
that these objects must follow.
BadFunctionError
will be raised if any constraint violations are detected.
detect
must be a function/method accepting two parameters:
-
context
(DetectionFunctionContext
). Contains invocation parameters and information about the current image. The following fields are available:frame_name
(str
). The file name of the frame on the CVAT server.conf_threshold
(float | None
). The confidence threshold that the function should use to filter objects. IfNone
, the function may apply a default threshold at its discretion.
-
image
(PIL.Image.Image
). Contains image data.
detect
must return a list of LabeledShapeRequest
objects,
representing shapes found in the image.
See the docstring of DetectionFunctionSpec
for more information on the constraints
that these objects must follow.
The same AA function may be used with any dataset that contain labels with the same name as the AA function’s specification. The way it works is that the driver matches labels between the spec and the dataset, and replaces the label IDs in the shape objects with those defined in the dataset.
For example, suppose the AA function’s spec defines the following labels:
Name | ID |
---|---|
bat |
0 |
rat |
1 |
And the dataset defines the following labels:
Name | ID |
---|---|
bat |
100 |
cat |
101 |
rat |
102 |
Then suppose detect
returns a shape with label_id
equal to 1.
The driver will see that it refers to the rat
label, and replace it with 102,
since that’s the ID this label has in the dataset.
The same logic is used for sublabel and attribute IDs.
Helper factory functions
The CVAT API model types used in the detection function protocol are somewhat unwieldy to work with, so it’s recommended to use the helper factory functions provided by this layer. These helpers instantiate an object of their corresponding model type, passing their arguments to the model constructor and sometimes setting some attributes to fixed values.
The following helpers are available for building specifications:
Name | Model type | Fixed attributes |
---|---|---|
label_spec |
PatchedLabelRequest |
- |
skeleton_label_spec |
PatchedLabelRequest |
type="skeleton" |
keypoint_spec |
SublabelRequest |
type="points" |
attribute_spec |
AttributeRequest |
mutable=False |
checkbox_attribute_spec |
AttributeRequest |
mutable=False , input_type="checkbox" , values=[] |
number_attribute_spec |
AttributeRequest |
mutable=False , input_type="number" |
radio_attribute_spec |
AttributeRequest |
mutable=False , input_type="radio" |
select_attribute_spec |
AttributeRequest |
mutable=False , input_type="select" |
text_attribute_spec |
AttributeRequest |
mutable=False , input_type="number" , values=[] |
For number_attribute_spec
,
it’s recommended to use the cvat_sdk.attributes.number_attribute_values
function
to create the values
argument, since this function will enforce the constraints expected
for attribute specs of this type.
For example:
cvataa.number_attribute_spec("size", 1, number_attribute_values(0, 10))
The following helpers are available for use in detect
:
Name | Model type | Fixed attributes |
---|---|---|
shape |
LabeledShapeRequest |
frame=0 |
mask |
LabeledShapeRequest |
frame=0 , type="mask" |
polygon |
LabeledShapeRequest |
frame=0 , type="polygon" |
rectangle |
LabeledShapeRequest |
frame=0 , type="rectangle" |
skeleton |
LabeledShapeRequest |
frame=0 , type="skeleton" |
keypoint |
SubLabeledShapeRequest |
frame=0 , type="points" |
For mask
, it is recommended to create the points list using
the cvat_sdk.masks.encode_mask
function, which will convert a bitmap into a
list in the format that CVAT expects. For example:
cvataa.mask(my_label, encode_mask(
my_mask, # boolean 2D array, same size as the input image
[x1, y1, x2, y2], # top left and bottom right coordinates of the mask
))
To create shapes with attributes,
it’s recommended to use the cvat_sdk.attributes.attribute_vals_from_dict
function,
which returns a list of objects that can be passed to an attributes
argument:
cvataa.rectangle(
my_label, [x1, y2, x2, y2],
attributes=attribute_vals_from_dict({my_attr1: val1, my_attr2: val2})
)
Tracking function protocol
A tracking function is a type of AA function that analyzes an image with one or more shapes on it, and then predicts the positions of those shapes on subsequent images.
A tracking function can only be used in agent mode. When used with a tracking function, an agent will use it to process requests from the AI tracking tools in the CVAT UI.
WARNING: Currently, only one agent should be run for each tracking function. If multiple agents for one tracking function are run at the same time, CVAT users may experience intermittent “Tracking state not found” errors when using the function.
A tracking function must have three attributes, spec
, init_tracking_state
, and track
.
It may also optionally have a preprocess_image
attribute.
spec
must contain the AA function’s specification,
which is an instance of TrackingFunctionSpec
.
This specification must be initialized with a single supported_shape_types
parameter,
defining which types of shapes the AA function is able to track.
For example:
spec = cvataa.TrackingFunctionSpec(supported_shape_types=["rectangle"])
init_tracking_state
must be a function accepting the following parameters:
-
context
(TrackingFunctionShapeContext
). An object with information about the shape being tracked. See details below. -
pp_image
(type varies). A preprocessed image. Consult the description ofpreprocess_image
for more details. -
shape
(TrackableShape
). A shape within the preprocessed image.TrackableShape
is a minimal version of theLabeledShape
SDK model, containing only thetype
andpoints
fields. The shape’stype
is guaranteed to be one of the types listed in thesupported_shape_types
field of the spec.
init_tracking_state
must analyze the shape and create a state object containing
any information that the AA function will need to predict its location on a subsequent image.
It must then return this object.
init_tracking_state
must not modify either pp_image
or shape
.
track
must be a function accepting the following parameters:
-
context
(TrackingFunctionShapeContext
). An object with information about the shape being tracked. See details below. -
pp_image
(type varies). A preprocessed image. Consult the description ofpreprocess_image
for more details. This image will have the same dimensions as those of the image used to create thestate
object. -
state
(type varies). The object returned by a previous call toinit_tracking_state
.
track
must locate the shape that was used to create the state
object
on the new preprocessed image.
If it is able to do that, it must return its prediction as a new TrackableShape
object.
This object must have the same value of type
as the original shape.
If track
is unable to locate the shape, it must return None
.
track
may modify state
as needed to improve prediction accuracy on subsequent frames.
It must not modify pp_image
.
A TrackingFunctionShapeContext
object passed to both init_tracking_state
and track
will have the following field:
original_shape_type
(str
). The type of the shape being tracked. Ininit_tracking_state
, this is the same asshape.type
. Intrack
, this is the type of the shape thatstate
was created from.
preprocess_image
, if implemented, must accept the following parameters:
-
context
(TrackingFunctionContext
). This is currently a dummy object and should be ignored. In future versions, this may contain additional information. -
image
(PIL.Image.Image
). An image that will be used to either start or continue tracking.
preprocess_image
must perform any analysis on the image that the function can perform
independently of the shapes being tracked
and return an object representing the results of that analysis.
This object will be passed as pp_image
to init_tracking_state
and track
.
If preprocess_image
is not implemented, then the pp_image
object will be the original image.
In other words, the default implementation is:
def preprocess_image(context, image):
return image
Auto-annotation driver
The annotate_task
function uses a detection function to annotate a CVAT task.
It must be called as follows:
annotate_task(<client>, <task ID>, <AA function>, <optional arguments...>)
The supplied client will be used to make all API calls.
By default, new annotations will be appended to the old ones.
Use clear_existing=True
to remove old annotations instead.
If a detection function declares a label that has no matching label in the task,
then by default, BadFunctionError
is raised, and auto-annotation is aborted.
If you use allow_unmatched_label=True
, then such labels will be ignored,
and any shapes referring to them will be dropped.
Same logic applies to sublabels and attributes.
It’s possible to pass a custom confidence threshold to the function via the
conf_threshold
parameter.
annotate_task
will raise a BadFunctionError
exception
if it detects that the function violated the detection function protocol.
Predefined AA functions
This layer includes several predefined detection functions. You can use them as-is, or as a base on which to build your own.
Each function is implemented as a module
to allow usage via the CLI auto-annotate
command.
Therefore, in order to use it from the SDK,
you’ll need to import the corresponding module.
cvat_sdk.auto_annotation.functions.torchvision_detection
This AA function uses object detection models from the torchvision library. It produces rectangle annotations.
To use it, install CVAT SDK with the pytorch
extra:
$ pip install "cvat-sdk[pytorch]"
Usage from Python:
from cvat_sdk.auto_annotation.functions.torchvision_detection import create as create_torchvision
annotate_task(<client>, <task ID>, create_torchvision(<model name>, ...))
Usage from the CLI:
cvat-cli auto-annotate "<task ID>" --function-module cvat_sdk.auto_annotation.functions.torchvision_detection \
-p model_name=str:"<model name>" ...
The create
function accepts the following parameters:
model_name
(str
) - the name of the model, such asfasterrcnn_resnet50_fpn_v2
. This parameter is required.weights_name
(str
) - the name of a weights enum value for the model, such asCOCO_V1
. Defaults toDEFAULT
.
It also accepts arbitrary additional parameters, which are passed directly to the model constructor.
cvat_sdk.auto_annotation.functions.torchvision_instance_segmentation
This AA function is analogous to torchvision_detection
,
except it uses torchvision’s instance segmentation models and produces mask
or polygon annotations (depending on the value of conv_mask_to_poly
).
Refer to that function’s description for usage instructions and parameter information.
cvat_sdk.auto_annotation.functions.torchvision_keypoint_detection
This AA function is analogous to torchvision_detection
,
except it uses torchvision’s keypoint detection models and produces skeleton annotations.
Keypoints which the model marks as invisible will be marked as occluded in CVAT.
Refer to that function’s description for usage instructions and parameter information.