Why Captioning Videos is Difficult

Text superimposed on video content is everywhere: YouTube even offers to transcribe the audio on uploaded videos to text and write it synchronously over the video. Most video sources (including DVDs) have an option to show subtitles any one of a number of languages. The distinction between subtitles and closed captioning is explained here. Closed captions are optional – you can view them or not, whereas subtitles are part of the video.

If text over video is so common, why is it difficult to do? The short answer is that videos contain much more information than images and placing text in a video requires re-creation of all the content, which requires much more computation than re-creating a single image. However, showing a caption only requires storing the text, the time it is to be shown and the location on the screen. This information can be easily encoded in a file and rendered by the video player software and this is how most text on video is displayed. One problem is that different video players use different file formats for the data. Another is that the file is separate from the video file. If you download a video that has been auto-captioned from YouTube using a 3rd party application any auto-captions will not be included. Web-based facilities for displaying remotely stored video files, such as YouTube, can ensure that all videos are displayed using a video player that supports the display of separately stored captions, but the separation of the captions from the video file means that the caption data is not easily available, or not available at all. YouTube’s takeout facility, which allows users to download all their YouTube video content, includes a JSON metadata file for each video file. This file includes many metadata fields for videos but does not include the caption data.

Using a separate file for text and video is great for flexibility, but does require that the file be kept along with the video content. A further problem is that not all video players support all the available caption file formats. Perhaps some future video format will allow incorporation of text captions as metadata of the main video file but this will require the video players to be able to read it. The default Windows 10 video player, Photos, supports a number of caption file formats and there are many online facilities for generating them, some of which are reviewed here.

So if you want to ensure that your captions are readable on any video player and are embedded in the video data, what are the options? If you want to keep the entire original video frame and place the caption beneath it, then the video needs to be padded out with a uniform colour bar below the video frame. This can be done using the Windows command line application ffmpeg. A complexity of this operation is that portrait mode videos from smart phones may be padded with black at the sides to make the video frame the same dimensions as in landscape mode. Videos copied from analog media such as Hi 8 or VHS cassette tapes may have similar padding added. The caption can then be written on the uniform colour bar or on top of the video, either using a web-based service such as Kapwing Subtitler or a desktop video editor such as Photos for Windows 10. Photos does not offer the flexibility of caption font, colour, and position selection offered by Kapwing Subtitler, but it is simple to use and available as part of Windows 10. Using a desktop application is likely to be much slower than using a cloud-based facility, which can apply more computing resources than are available on the average domestic computer to the task of reading video frames, adding text and re-encoding the video.

Leave a Reply

Your email address will not be published. Required fields are marked *