Streaming video captioning

Check out the new paper from Google, a first in the class of models for streaming video captioning -
[Streaming Dense Video Captioning - https://lnkd.in/ghXgigfF]

The paper proposes a new method for streaming dense video captioning, which can handle long videos and generate summaries before the entire video is processed. The method utilizes a clustering memory module and a streaming decoding algorithm, achieving significant improvements on several benchmarks.

The memory module is a k-means clustering of image tokens from the frames seen at any point in the video, and image tokens are generated from a CLIP image-encoder. Prior methods use aggressive and lossy downsampling of frames. K-means provides an alternative fixed-memory method to summarize the information in the video.

The algorithm decodes captions at designated points in the video stream. From the second decoding point onwards, the output of previous decoding points is concatenated as the prefix to the text decoder. Further, some captions are randomly removed from previous event captions and added to the target instead, to increase robustness to potential errors in earlier predictions. GIT and T5 decoders are used for text.  The model improves SOTA by 4-11 points on popular video captioning datasets - ActivityNet, YouCook2 and ViTT.

Streaming video captioning is an exciting step forward. It can be used to enhance accessibility by providing real-time subtitles for viewers with hearing impairments. Moreover, it can improve user engagement and comprehension for non-native speakers or those in noisy environments. The technology can also be used to facilitate content indexing and search, optimizing the discoverability of videos across platforms.

Previous
Previous

Phi-3

Next
Next

Cross modal retrieval