This animation here is compressed. When you use the tracker the framerate will be higher and the resolution perfectly sharp. This tracker supports an arbitrary number of advancements, recipes, custom ...
The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task ...