US20230300399A1

Movatterモバイル変換

Info

Publication number: US20230300399A1
Application number: US17/698,570
Authority: US
Inventors: Christopher Stone
Original assignee: Comcast Cable Communications LLC
Current assignee: Comcast Cable Communications LLC
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-09-21
Anticipated expiration: 2042-03-18
Also published as: US12273582B2; US20240080514A1; US11785278B1

Abstract

Alignment between closed caption and audio/video content may be improved by determining text associated with a portion of the audio or a portion of the video and comparing the determined text to a portion of closed caption text. Based on the comparison, a delay may be determined and the audio/video content may be buffered based on the determined delay.

Description

BACKGROUND

Closed captions and subtitles allow users to display text on a display of a device to provide additional or interpretive information. “Closed” indicates that captions are not visible until activated by a user, whereas “open” captions are visible to all viewers. Accordingly, closed captions may allow a user to display textual transcriptions of an audio portion of content or textual descriptions of non-speech elements of the content to a user. Ideally, closed captions and subtitles are in synchronization with audio/visual content. However, there may be lag between closed captions and audio and/or video content (e.g., by several seconds) due to, for example, technical delays associated with manual or live transcriptions. Improvements are needed for synchronization of closed captioning systems to improve viewing experience.

SUMMARY

Methods and systems are disclosed for improved alignment between closed captioned text and audio output (e.g., audio from a content creator, content provider, video player, etc.). Content including video, audio, and closed caption text may be received and, based on a portion of the audio or a portion of the video, text associated with the portion of the audio or the portion of the video may be determined. The determined text may be compared to a portion of the closed caption text and, based on the comparison, a delay may be determined. The audio or video of the content may be buffered based on the determined delay. If the closed caption text is ahead, the closed caption text stream may be buffered. For example, encoded audio may be removed from an audiovisual stream, decoded, converted to text, and then compared to a closed captioned stream. Based on the comparison, the closed captioned stream may be realigned with the audiovisual stream.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples and together with the description, serve to explain the principles of the methods and systems:

FIG.1 shows an example environment;

FIG.2 shows an example encoder;

FIG.3 shows an example user device;

FIG.4 shows an example method;

FIG.5 shows an example method;

FIG.6 shows an example method; and

FIG.7 shows an example computing device.

DETAILED DESCRIPTION

Closed captions provide a text version of what takes place on a screen. For example, closed captions can provide a text version of dialogue, sound effects, or music to a viewer with a hearing impairment. However, the viewer may have difficulty understanding a program if associated closed captions do not line up properly with events or dialogue taking place on screen.

For live programming, audio including spoken words, soundtrack, sound effects, etc. may be transcribed by a human operator. For example, a speech-to-text reporter may use a stenotype (e.g., shorthand) or stenomask (e.g., voice writing) machine to convert audio into text so it may be displayed on a screen. As another example, voice recognition software may be used to convert audio into text. Due to processing times associated with existing approaches, closed captions of live broadcasts (e.g., news bulletins, sports events, live entertainment shows, etc.) often lag by several seconds. For prerecorded programs, unlike live programs, audio may be transcribed and closed captions may be prepared, positioned, and timed in advance.

In National Television Standards Committee (NTSC) programming, closed captions may be encoded into a part of the television (TV) picture that sits just above the visible portion and is usually unseen (e.g., line 21 of the vertical blanking interval). In Advanced Television Systems Committee (ATSC) programming, three streams may be encoded in the video. For example, two streams may be backward compatible “line 21” captions and a third stream may be a set of up to 63 additional caption streams (e.g., encoded in EIA-708 format).

FIG.1 shows an example environment in which the systems and methods described herein may be implemented. Such an environment may comprise acontent database102, an encoder/packager112, and at least one device116 (e.g., a player). Thecontent database102, the encoder/packager112, and the at least one device116 may be in communication via anetwork114. Thecontent database102, the encoder/packager112, or the at least one device116 may be associated with an individual or entity seeking to align content with closed captions or subtitles.

The encoder/packager112 may implement a number of the functions and techniques described herein. For example, the encoder/packager112 may receivecontent104 from thecontent database102. Thecontent104 may comprise, for example,audio106,video108, and/or closedcaptions110.Audio104 orvideo106 may refer generally to any audio or video content produced for viewer consumption regardless of the type, format, genre, or delivery method.Audio104 orvideo106 may comprise audio or video content produced for broadcast via over-the-air radio, cable, satellite, or the internet.Audio104 orvideo106 may comprise digital audio or video content produced for digital video or audio streaming (e.g., video- or audio-on-demand).Audio104 orvideo106 may comprise a movie, a television show or program, an episodic or serial television series, or a documentary series, such as a nature documentary series. As yet another example,video106 may comprise a regularly scheduled video program series, such as a nightly news program. Thecontent104 may be associated with one or more content distributors that distribute thecontent104 to viewers for consumption.

Thecontent104 may comprise text data associated with content, such as closedcaptions110. The closedcaptions110 may indicate textual information associated with thecontent104. For example, the closedcaptions110 may comprise text associated with spoken dialogue, sound effects, music, etc.Content104 may be associated with one or more genres, including sports, news, music or concert, documentary, or movie. For example, if the content is associated with the genre “sports,” this may indicate that the content is a sports game, such as a livestream of a sports game.

The closedcaptions110 may indicate speech associated with thecontent104. For example, the closedcaptions110 may indicate which speech associated with portions ofcontent104. Subtitles may be part of a content track included in the closedcaptions110. The presence or absence of dialogue may be detected through subtitling, for example, using Supplemental Enhancement Information (SEI) messages in the video elementary stream. If the subtitles for content are part of a separate track, the absence of dialogue may be detected, for example, by detecting an “empty segment.”

The content104 (e.g., video108) may indicate movement associated with the content. For example, thevideo108 may indicate which specific movements may be associated with portions of content. The movement associated with thecontent104 may be based on the encoding parameters of thecontent104. The movement associated with thecontent104 may comprise camera movement, where the entire scene moves. For example, if the content is a soccer game, camera movement may involve a camera panning over the soccer field. The movement associated with thecontent104 may additionally, or alternatively, comprise movement of objects in the content. For example, if the content is a soccer game, object movement may involve the soccer ball being kicked.

AI or machine learning may be used (e.g., by encoder/packager112 or user devices116) to align andsync audio106 and/orvideo108 withclosed captions110. For example, encoder/packager112 or user devices116 may implement a software algorithm that listens toaudio106 and/or processesvideo108 to determine when words being spoken incontent104 match those ofclosed captions110.

Audio-to-text translation may be used to find accompanying text in closed captions110 (e.g., transcribed conversations, subtitles, descriptive text, etc.) to serve as a point in the audiovisual stream (e.g., a first marker in time) and closed caption stream (e.g., a second marker in time) to establish a sync. Audio-to-text translation may also be used to find accompanying text to audio content that describes aspects of the video that are purely visual (e.g., descriptive audio, audio description, and/or video description). For example, an audiovisual presentation device (e.g., user devices116) may be equipped with sufficient storage, e.g., dynamic random-access memory (DRAM), hard disk drive (HDD), embedded multimedia card (MMC), to buffer an incoming audiovisual stream (e.g., content104) for several seconds. The buffered content may be concurrently demultiplexed and the audio106,video108,closed caption110 components extracted. For example, the audio106 may be decoded by a digital signal processor (DSP) or a central processing unit (CPU) (e.g., associated with encoder/packager112 or user devices116) and further processed by algorithms that convert the audio106 to text.

Theclosed captions110 may be decoded by the CPU and an algorithm in the CPU may compare the closed caption text to an audio-to-text translation, e.g., looking for a match in words. If the words do not match, the CPU may hold onto the closed caption text and continue to process and compare the audio-to-text translation until it finds a match. Moreover, one or more markers in time may be used by the CPU as a reference to compare the closed caption text and the audio-to-text translation. For example, one or more first markers in time may be associated with the closed caption text and one or more second markers in time may be associated with the audio-to-text translation. The one or more first markers in time and the one or more second markers in time may correspond to a time in the playback of the content104 (e.g., when a delay is to occur).

If the audiovisual stream and closed captions are in sync, the CPU may determine there is no need to add any delay to the audiovisual stream and thecontent104 may be sent to a video render engine and/or audio engine for output, e.g., over High-Definition Multimedia Interface (HDMI). According to some examples, if thecontent104 andclosed captions110 are not in sync, then the CPU may determine a delay (e.g., in milliseconds) that is needed to be applied toaudio106 and/orvideo108 to align thecontent104 with theclosed captions110. For example, the CPU may calculate the delay by comparing one or more first markers in time associated with the closed caption text to one or more second markers in time associated with the audio-to-text translation. Moreover, the one or more markers in time may be used to identify a point in time associated with the content at which a delay may occur.

A synchronization component of the CPU may synchronize a timer to one or multiple synchronization markers (e.g., markers in time) of the content, the closed caption text, and/or the audio-to-text translation. In some examples, the markers in time may be determined by the synchronization component processing the audiovisual stream and the closed captions.

AI, machine learning, or other video processing techniques may be used to processvideo108, e.g., where scenes, movements, or other visual features align withclosed captions110. For example,closed captions110 may textually describe video associated with a scene (e.g., a car stopping or a door opening) and the CPU may processvideo108 to identify a scene where a car is stopping or a door opens. Upon identifying the scene matchingclosed captions110, the scene may be flagged (e.g., by recording a time, location, stamp, etc.) and a delay may be determined by comparing the flagged scene with the matching captions or subtitles. The delay may then be applied toaudio106 and/orvideo108 to align thecontent104 with theclosed captions110.

Artificial Intelligence (AI), machine learning, or other audio processing techniques may be used to process audio106, e.g., where sounds, sound effects, or other audio features align withclosed captions110. A machine learning model may comprise Deep Speech, a Deep Learning Network (DLN), a recurrent neural network (RNN), or any other suitable learning algorithm. For example,closed captions110 may textually describe audio associated with a scene (e.g., a car brakes squealing or a door creaking as it opens) and the CPU may process audio106 to identify a scene including a squealing or creaking sound. Upon identifying the scene matching theclosed captions110, the scene may be flagged (e.g., by recording a time, location, stamp, etc.) and a delay may be determined by comparing the flagged scene with the matching captions or subtitles. The delay may then be applied toaudio106 and/orvideo108 to align thecontent104 with theclosed captions110.

The encoder/packager112 may use thecontent104 andcontent106 to determine portions of content that are candidates for inserting a delay. For example, a scene changes in thecontent104 may be indicative of a start of a new portion of thecontent104. A scene in thecontent104 may be a single camera shot of an event. A scene change may occur when a viewer perspective is switched to a different camera shot. In order to any negative reaction from a viewer, a delay may be associated with a scene change. For example, the delay may be inserted at a scene change, immediately before, or immediately after.

It may be desirable to adjust the output speed of content during certain portions of the content when an output speed change is not as detectable as a delay to viewers of the content. An output speed change may be less detectable to viewers, for example, during a portion of the content that contains less motion or less speech. For example, a scenery shot without any dialogue may be a good candidate for an output speed change. Accordingly, the encoder/packager112 may use thecontent104 to determine portions of content that do not contain large amounts of motion or speech. Different genres may be able to be sped up or slowed down at different rates or for different periods of time without being easily detectable to viewers of the content. For example, for sports content, output may be sped up or slowed down for only for 5 seconds or during a timeout with less motion. For news content, output may be sped up or slowed down for only for 10 seconds or during transitions between stories. For concert or music content, output may be sped up or slowed down at any time but only for 2 seconds.

Thecontent database102 may provide the content and the content data to the encoder/packager112. Thecontent database102 may be integrated with one or more of the encoder/packager112 or the at least one device116. Thenetwork114 may comprise one or more public networks (e.g., the Internet) and/or one or more private networks. A private network may comprise a wireless local area network (WLAN), a local area network (LAN), a wide area network (WAN), a cellular network, or an intranet. Thenetwork114 may comprise wired network(s) and/or wireless network(s).

Thecontent database102, the encoder/packager112, and the at least one device116 may each be implemented on the same or different computing devices. For example, thecontent database102 may be located in a datastore of the same organization as the encoder/packager112, or in the datastore of a different organization. Such a computing device may comprise one or more processors and memory storing instructions that, when executed by the one or more processors, cause the computing device to perform one or more of the various methods or techniques described here. The memory may comprise volatile memory (e.g., random access memory (RAM)) and/or non-volatile memory (e.g., a hard or solid-state drive). The memory may comprise a non-transitory computer-readable medium.

FIG.2 shows anexemplary encoding environment200. Theencoding environment200 may comprisesource content202, an encoder/packager204, and an encodedbitstream206. Thesource content202 may comprise content, such as content104 (e.g., includingaudio106,video108, and/or closed captions110). Thesource content202 may be input into the encoder/packager204. For example, the encoder/packager204 may be the encoder/packager112 ofFIG.1. The encoder/packager204 may generate the encodedbitstream206 associated with thesource content202. For example, the encodedbitstream206 may comprise one or more of aclosed caption bitstream206a, avideo bitstream206b, or anaudio bitstream206c. If the encodedbitstream206 comprises theclosed caption bitstream206a, theclosed caption bitstream206amay comprise textual data associated with thesource content202. If the encodedbitstream206 comprises thevideo bitstream206b, thevideo bitstream206bmay indicate video data associated with thesource content202. If the encodedbitstream206 comprises theaudio bitstream206c, theaudio bitstream206cmay indicate audio data associated with thesource content202.

The encodedbitstream206 may comprise at least one indication of portions of content that are good candidates for delay or output speed change, such as portions208a-c. As discussed above, it may be desirable to insert a delay and/or adjust the output speed of content during certain portions of the content when an output speed change is less detectable to viewers of the content. Accordingly, the portions208a-cmay be portions of content during which a delay or an output speed change may not be easily detectable by viewers of the content. The encodedbitstream206 may indicate at least one of a start time (e.g., a marker in time) associated with each of these portions of content or a duration of each of these portions of content. For example, the encodedbitstream206 may indicate that theportion208ahas a start time t₁and a duration d₁, theportion208bhas a start time t₂and a duration d₂, and theportion208chas a duration d₃. The durations of the different portions may be different or may be the same. The encodedbitstream206 may comprise an indication of a rate of output speed change associated with each portion of content that is a good candidate for output speed change, such as the portions208a-c. The rate of output speed change associated with a particular portion of content may indicate how much output of content may be sped up or slowed down, or both sped up and slowed down, during that portion of content without being easily detectable to viewers of the content. For example, output of content may be either sped up or slowed down during a portion of content in a scenery view that contains no dialogue. The encodedbitstream206 may be used by a device, such as the at least one device116 ofFIG.1, to output the content associated with thesource content202 and to adjust the output speed of the content during the portions208a-c.

FIG.3 shows anexample user device300 in which the systems and methods described herein may be implemented.User device300 may comprise a buffer302 (e.g., DRAM, FLASH, HDD, etc.), adecoder304, a digital signal processor (DSP)306, a central processing unit (CPU)308, and a graphics processing unit (GPU)310. Thedevice300 may provide playback of anaudiovisual stream316 by avideo rendering engine312 and/or a high-definition multimedia interface (HDMI)314. Moreover, thedevice300 may be associated with an individual or entity seeking to align content with closed captions or subtitles.

Thedevice300 may receive anaudiovisual stream316. Theaudiovisual stream316 may comprise encodedvideo318, encodedaudio320, andclosed captions322. The encodedvideo318 may be decoded by thedecoder304 resulting in decodedvideo324. The decodedvideo324 may be provided by thedecoder304 to theGPU310. The encodedaudio320 may be decoded by theDSP306 resulting in decodedaudio326.

TheCPU308 may be configured to receive the decodedaudio326. TheCPU308 may be configured to perform an audio-to-text conversion of the decodedaudio326. TheCPU308 may be configured to compare the converted text to theclosed captions322. One or more markers in time may be used by theCPU308 as a reference to compare theclosed captions322 and the audio-to-text conversion of the decodedaudio326. For example, one or more first markers in time may be associated with theclosed captions322 and one or more second markers in time may be associated with the audio-to-text conversion of the decodedaudio326. The one or more first markers in time may synchronize to a time of the decodedaudio326 and the one or more second markers in time may synchronize to a time of the audio-to-text translation, so as to correspond to a time in the playback of the audiovisual stream316 (e.g., when a buffering delay is to occur).

Based on the comparison, theCPU308 may be configured to determine a buffering delay to synchronize theclosed captioning content322 with theaudiovisual stream316. For example, the buffering delay may compensate for an offset between timing of the decodedaudio326 and theclosed captions322. For example, theCPU308 may calculate the buffering delay by comparing one or more first markers in time associated with theclosed captions322 to one or more second markers in time associated with the decodedaudio326. Moreover, the one or more markers in time may be used to identify a point in time associated with theaudiovisual stream316 at which a delay may occur.

A synchronization component of theCPU308 may synchronize a timer to one or multiple synchronization markers in time of theaudiovisual stream316, theclosed captions322, or the decodedaudio326. In some examples, the markers in time may be determined by the synchronization component processing the decodedaudio326 and theclosed captions322.

TheCPU308 may be configured to provide the determined delay to thebuffer302. Thebuffer302 may be configured to insert the determined delay into the audiovisual stream to synchronize audio and visual components of the audiovisual stream (e.g., encodedvideo318 and encoded audio320) with closed caption content (e.g., closed captions322). For example, thebuffer302 may synchronize the audio and visual components of theaudiovisual stream316 with closed caption text by buffering one or more of the audio (e.g., encoded audio320), the video (e.g., encoded video318), and the closed caption text (e.g., closed captions322).

FIG.4 shows anexemplary method400. Themethod400 may be used to align closed captions with audiovisual content, such as thecontent104 associated withFIG.1. Themethod400 may be performed, for example, by thesystem100 ofFIG.1 ordevice300 ofFIG.3. Content may be received. Content may comprise a content asset or program, such as linear content, and may further comprise sequential content such as, for example, a television show, a movie, a sports event broadcast, or the like. Moreover, the content may comprise livestreaming video content or other types of content. As used herein, content may additionally include a portion of a program or content asset.

Atstep402, at least one content may be received. For example, the at least one content may be received by an encoder, such as the encoder/packager112 or encoder/packager204. The at least one content may comprise video content (e.g.,video206bor encoded video318), audio content (e.g., audio206cor encoded audio320), and closed captioning (e.g.,closed captions206aor closed captions322). The at least one content may comprise livestreaming content (e.g.,source content202 or audiovisual stream316) and, for example, the at least one content may comprise a livestreaming concert, sports program, news program, documentary, or movie. One or more markers in time may be associated with the video content (e.g.,video206bor encoded video318), audio content (e.g., audio206cor encoded audio320), and closed captioning (e.g.,closed captions206aor closed captions322).

Atstep404, the audiovisual content may be buffered (e.g., based on a computed delay from step430) and, atstep406, the buffered content from step404 (e.g., carrying multiple encoded data streams) may enter demultiplexer (demux)406. Thisdemux406 may serve as a switch, e.g., by selecting which video and which audio data stream in a multiplexed transport stream to pass on. For example, demux406 may pass on an audio stream, a video stream, and/or a closed caption stream.

Atstep408, it may be determined whether the content comprises closed captions. If the content does not comprise closed captions, it may be determined to play the audiovisual stream in real-time atstep436. If the content does comprise closed captions, the closed captions may be decoded atstep410 and the audio may be decoded atstep412. Voice or audio to text conversion may be performed atstep414.

Atstep416, the decoded captions fromstep410 may be compared to the converted text fromstep414 and, atstep418, it may be determined, based on the comparison atstep416, whether the closed captions (e.g., from step410) match the converted text (e.g., from step414). If the content does match the converted text, it may be determined to play the audiovisual stream in real-time atstep436. If the closed captions do not match the converted text, the closed captions may be held at step420 (e.g., for a constant or variable time period) and the audio may be decoded atstep422. Voice or audio to text conversion may be performed atstep424.

Atstep426, the held captions fromstep420 may be compared to the converted text fromstep424. Atstep428, it may be determined, based on the comparison atstep426, whether the held closed captions (e.g., from step420) match the converted text (e.g., from step426). If the held closed captions do not match the converted text, the process may repeat itself by once again holding the closed captions atstep420, decoding the audio atstep422, performing voice or audio to text conversion atstep424, and, atstep426, comparing the held captions fromstep420 with the converted text fromstep424. This process may be repeated iteratively until, atstep428, it is determined (e.g., based on the comparison at step426) that the held closed captions (e.g., from step420) match the converted text (e.g., from step426).

Once a match has been determined atstep428, a delay may be competed atstep430. The delay may be a time offset of the closed captions to the decoded audio. For example, the delay may be computed based on a length of time and/or a number of times that the closed captions are held atstep420. Moreover, the delay may be computed based on a comparison of markers in time associated with the closed captions and the decoded audio. At step432, the audiovisual stream may be buffered and played with the computed delay and, atstep434, the captions may be played without delay.

Atstep438, the audiovisual stream fromstep436 and/or the captions fromstep434 and the buffered audiovisual stream from step432 may be stored to disk. If the user has turned on/enabled closed captions, then atstep440, the video and captions may be rendered and, atstep442, the process may end (e.g., terminate).

FIG.5 shows anexemplary method500. Themethod500 may be used for aligning closed captions with audiovisual content, such as thecontent104 associated with FIG.1. Themethod500 may be performed, for example, by one or more components ofsystem100 ofFIG.1 ordevice300 ofFIG.3.

Atstep502, content including video, audio, and/or closed captions may be received. For example, livestreaming video content may be received, although content may refer generally to any audio or video content produced for viewer consumption regardless of the type, format, genre, or delivery method. The content may be associated with one or more content distributors that distribute the content to viewers for consumption. The closed captions may indicate textual information associated with the content. For example, the closed captions may comprise text associated with spoken dialogue, sound effects, music, etc. The content may be associated with one or more genres, including sports, news, music or concert, documentary, or movie.

The closed captions may indicate speech associated with the content. For example, the closed captions may indicate which speech associated with portions of content. Subtitles may be part of a content track included in the closed captions. The presence or absence of dialogue may be detected through subtitling, for example, using SEI messages in the video elementary stream. If the subtitles for content are part of a separate track, the absence of dialogue may be detected, for example, by detecting an “empty segment.”

Atstep504, text associated with at least a portion of the audio or a portion of the video may be determined based on the at least the portion of the audio or the portion of the video. An audio-to-text conversion (e.g., transcribed conversations, subtitles, descriptive text, etc.) may be performed on a portion of the audio or a visual analysis map be performed to describe a portion of the video with text. For example, a software algorithm may listen to audio and/or processes video associated with the content to determine words being spoken in the content. As another example, a software algorithm may identify descriptive audio (e.g., additional audio content that describes aspects of the video that are purely visual) and may convert the descriptive audio of the content to text.

An audiovisual presentation device may be equipped with sufficient storage (e.g., DRAM, HDD, eMMC) to buffer an incoming audiovisual stream for several seconds. The buffered content may be concurrently demultiplexed and the audio, video, and/or closed caption components extracted. For example, the audio or video may be decoded by a digital signal processor (DSP) or a central processing unit (CPU) and further processed by algorithms that convert the audio or video to text. AI, machine learning, or other video or audio processing techniques may be used to process video or audio associated with the content, e.g., where scenes, movements, sounds, sound effects, or other visual/audio features align with closed captions.

Atstep504, a first time marker associated with the closed caption text may be determined based on a timeline associated with the content. For example, closed captions at a first time of a timeline associated with the content may textually describe video associated with a scene (e.g., a car stopping or a door opening).

Atstep506, a second time marker associated with the determined text (e.g., from step502) may be determined based on the timeline associated with the content and a comparison of the determined text to at least a portion of the closed caption text. For example, the video associated with the content may be processed to identify a scene where a car is stopping or a door opens. Upon identifying the scene matching closed captions at a second time of the timeline associated with the content, the scene may be flagged (e.g., by recording a time, location, stamp, etc.).

Atstep508, a delay may be determined based on a comparison of the first time marker and the second time marker. For example, the delay may be determined by comparing the second marker associated with the flagged scene to the first marker associated with the matching captions or subtitles.

The closed captions may be decoded and the closed caption text may be compared to a audio-to-text translation, e.g., looking for a match in words. If the words do not match, the closed caption text may be held and the audio-to-text translation may be iteratively processed and compared to the held closed caption text until a match is identified. If the audiovisual stream and closed captions are in sync, it may be determined that there is no need to add any delay to the audiovisual stream and the content may be sent to a video render engine and/or audio engine for output (e.g., over HDMI). If the content and closed captions are not in sync, then a delay may be determined (e.g., in milliseconds) that is needed to be applied to audio and/or video to align the content with the closed captions.

Atstep510, at least one of the audio, the video, or the closed captions of the content may be buffered based on the determined delay. For example, the determined delay may be applied to the audio and/or the video associated with the content to align the content with the closed captions. As another example, the determined delay may be applied to the closed captions associated with the content to align the audio and/or the video with the closed captions. Portions of the content may be identified as candidates for inserting a delay. For example, in order to avoid any negative reaction from a viewer, a delay may be associated with a scene change. For example, the delay may be inserted at a scene change, immediately before, or immediately after.

It may be desirable to adjust the output speed of content during certain portions of the content when an output speed change is not as detectable as a delay to viewers of the content. An output speed change may be less detectable to viewers, for example, during a portion of the content that contains less motion or less speech. For example, a scenery shot without any dialogue may be a good candidate for an output speed change. Accordingly, the portions of content may be identified that do not contain large amounts of motion or speech. Different genres may be able to be sped up or slowed down at different rates or for different periods of time without being easily detectable to viewers of the content. For example, for sports content, output may be sped up or slowed down for only for 5 seconds or during a timeout with less motion. For news content, output may be sped up or slowed down for only for 10 seconds or during transitions between stories. For concert or music content, output may be sped up or slowed down at any time but only for 2 seconds. As another example, transitions into and out of commercials may be used to align the content with the closed captions, e.g., metadata associated with the transitions may indicate timing for a local advertisement insertion and may be used to align content with the closed captions.

FIG.6 shows anexemplary method600. Themethod600 may be used for aligning closed captions with audiovisual content, such as thecontent104 associated withFIG.1. Themethod600 may be performed, for example, by one or more components ofsystem100 ofFIG.1 ordevice300 ofFIG.3.

Atstep602, content (e.g., including video, audio, and/or closed caption text) may be received. For example, livestreaming video content may be received, although content may refer generally to any audio or video content produced for viewer consumption regardless of the type, format, genre, or delivery method. The content may be associated with one or more content distributors that distribute the content to viewers for consumption. The closed captions may indicate textual information associated with the content. For example, the closed captions may comprise text associated with spoken dialogue, sound effects, music, etc. The content may be associated with one or more genres, including sports, news, music or concert, documentary, or movie.

Atstep604, text may be determined based on an audio-to-text conversion (e.g., transcribed conversations, subtitles, descriptive text, etc.) of at least a portion of the audio. For example, a software algorithm may listen to audio associated with the content to determine words being spoken in the content.

An audiovisual presentation device may be equipped with sufficient storage (e.g., DRAM, HDD, eMMC) to buffer an incoming audiovisual stream for several seconds. The buffered content may be concurrently demultiplexed and the audio, video, and/or closed caption components extracted. For example, the audio may be decoded by a digital signal processor (DSP) or a central processing unit (CPU) and further processed by algorithms that convert the audio to text. AI, machine learning, or other video or audio processing techniques may be used to process audio associated with the content, e.g., where sounds, sound effects, or other audio features align with closed captions.

The closed captions may be decoded and the closed caption text may be compared to a audio-to-text translation, e.g., looking for a match in words. If the words do not match, the closed caption text may be held and the audio-to-text translation may be iteratively processed and compared to the held closed caption text until a match is identified. If the audiovisual stream and closed captions are in sync, it may be determined that there is no need to add any delay to the audiovisual stream and the content may be sent to a video render engine and/or audio engine for output (e.g., over HDMI). If the content and closed captions are not in sync, then a delay may be determined (e.g., in milliseconds) that is needed to be applied to audio to align the content with the closed captions.

Atstep608, at least one of the audio, the video, or the closed captions of the content may be buffered based on the determined delay. For example, the determined delay may be applied to the audio and/or the video associated with the content to align the content with the closed captions. As another example, the determined delay may be applied to the closed captions associated with the content to align the audio and/or the video with the closed captions. Portions of the content may be identified as candidates for inserting a delay. For example, in order to avoid any negative reaction from a viewer, a delay may be associated with a scene change. For example, the delay may be inserted at a scene change, immediately before, or immediately after.

It may be desirable to adjust the output speed of content during certain portions of the content when an output speed change is not as detectable as a delay to viewers of the content. An output speed change may be less detectable to viewers, for example, during a portion of the content that contains less motion or less speech. For example, a scenery shot without any dialogue may be a good candidate for an output speed change. Accordingly, the portions of content may be identified that do not contain large amounts of motion or speech. Different genres may be able to be sped up or slowed down at different rates or for different periods of time without being easily detectable to viewers of the content. For example, for sports content, output may be sped up or slowed down for only for 5 seconds or during a timeout with less motion. For news content, output may be sped up or slowed down for only for 10 seconds or during transitions between stories. For concert or music content, output may be sped up or slowed down at any time but only for 2 seconds.

FIG.7 shows an example computing device that may be used in various examples. With regard to the example environment ofFIG.1, one or more of thecontent database102, the encoder/packager112, or the at least one device116 may be implemented in an instance of acomputing device700 ofFIG.7. The computer architecture shown inFIG.7 shows a conventional server computer, workstation, desktop computer, laptop, tablet, network appliance, PDA, e-reader, digital cellular phone, or other computing node, and may be utilized to execute any aspects of the computers described herein, such as to implement the methods described inFIGS.3-5.

Thecomputing device700 may comprise a baseboard, or “motherboard,” which is a printed circuit board to which a multitude of components or devices may be connected by way of a system bus or other electrical communication paths. One or more central processing units (CPUs)704 may operate in conjunction with achipset706. The CPU(s)704 may be standard programmable processors that perform arithmetic and logical operations necessary for the operation of thecomputing device700.

The CPU(s)704 may perform the necessary operations by transitioning from one discrete physical state to the next through the manipulation of switching elements that differentiate between and change these states. Switching elements may generally comprise electronic circuits that maintain one of two binary states, such as flip-flops, and electronic circuits that provide an output state based on the logical combination of the states of one or more other switching elements, such as logic gates. These basic switching elements may be combined to create more complex logic circuits including registers, adders-subtractors, arithmetic logic units, floating-point units, and the like.

The CPU(s)704 may be augmented with or replaced by other processing units, such as graphic processing units (GPUs)705. The GPU(s)705 may comprise processing units specialized for but not necessarily limited to highly parallel computations, such as graphics and other visualization-related processing.

A user interface may be provided between the CPU(s)704 and the remainder of the components and devices on the baseboard. The interface may be used to access a random-access memory (RAM)708 used as the main memory in thecomputing device700. The interface may be used to access a computer-readable storage medium, such as a read-only memory (ROM)720 or non-volatile RAM (NVRAM) (not shown), for storing basic routines that may help to start up thecomputing device700 and to transfer information between the various components and devices.ROM720 or NVRAM may also store other software components necessary for the operation of thecomputing device700 in accordance with the examples described herein. The user interface may be provided by a one or more electrical components such as thechipset706.

Thecomputing device700 may operate in a networked environment using logical connections to remote computing nodes and computer systems through local area network (LAN)716. Thechipset706 may comprise functionality for providing network connectivity through a network interface controller (NIC)722, such as a gigabit Ethernet adapter. ANIC722 may be capable of connecting thecomputing device700 to other computing nodes over anetwork716. It should be appreciated thatmultiple NICs722 may be present in thecomputing device700, connecting the computing device to other types of networks and remote computer systems.

Thecomputing device700 may be connected to astorage device728 that provides non-volatile storage for the computer. Thestorage device728 may store system programs, application programs, other program modules, and data, which have been described in greater detail herein. Thestorage device728 may be connected to thecomputing device700 through astorage controller724 connected to thechipset706. Thestorage device728 may consist of one or more physical storage units. Astorage controller724 may interface with the physical storage units through a serial attached SCSI (SAS) interface, a serial advanced technology attachment (SATA) interface, a fiber channel (FC) interface, or other type of interface for physically connecting and transferring data between computers and physical storage units.

Thecomputing device700 may store data on astorage device728 by transforming the physical state of the physical storage units to reflect the information being stored. The specific transformation of a physical state may depend on various factors and on different implementations of this description. Examples of such factors may include, but are not limited to, the technology used to implement the physical storage units and whether thestorage device728 is characterized as primary or secondary storage and the like.

For example, thecomputing device700 may store information to thestorage device728 by issuing instructions through astorage controller724 to alter the magnetic characteristics of a particular location within a magnetic disk drive unit, the reflective or refractive characteristics of a particular location in an optical storage unit, or the electrical characteristics of a particular capacitor, transistor, or other discrete component in a solid-state storage unit. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this description. Thecomputing device700 may read information from thestorage device728 by detecting the physical states or characteristics of one or more particular locations within the physical storage units.

In addition or alternatively to thestorage device728 described herein, thecomputing device700 may have access to other computer-readable storage media to store and retrieve information, such as program modules, data structures, or other data. It should be appreciated by those skilled in the art that computer-readable storage media may be any available media that provides for the storage of non-transitory data and that may be accessed by thecomputing device700.

By way of example and not limitation, computer-readable storage media may comprise volatile and non-volatile, transitory computer-readable storage media and non-transitory computer-readable storage media, and removable and non-removable media implemented in any method or technology. Computer-readable storage media includes, but is not limited to, RAM, ROM, erasable programmable ROM (“EPROM”), electrically erasable programmable ROM (“EEPROM”), flash memory or other solid-state memory technology, compact disc ROM (“CD-ROM”), digital versatile disk (“DVD”), high definition DVD (“HD-DVD”), BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, other magnetic storage devices, or any other medium that may be used to store the desired information in a non-transitory fashion.

A storage device, such as thestorage device728 depicted inFIG.7, may store an operating system utilized to control the operation of thecomputing device700. The operating system may comprise a version of the LINUX operating system. The operating system may comprise a version of the WINDOWS SERVER operating system from the MICROSOFT Corporation. According to additional examples, the operating system may comprise a version of the UNIX operating system. Various mobile phone operating systems, such as IOS and ANDROID, may also be utilized. It should be appreciated that other operating systems may also be utilized. Thestorage device728 may store other system or application programs and data utilized by thecomputing device700.

Thestorage device728 or other computer-readable storage media may also be encoded with computer-executable instructions, which, when loaded into thecomputing device700, transforms the computing device from a general-purpose computing system into a special-purpose computer capable of implementing the examples described herein. These computer-executable instructions transform thecomputing device700 by specifying how the CPU(s)704 transition between states, as described herein. Thecomputing device700 may have access to computer-readable storage media storing computer-executable instructions, which, when executed by thecomputing device700, may perform the methods described in relation toFIGS.4-6.

A computing device, such as thecomputing device700 depicted inFIG.7, may also comprise an input/output controller732 for receiving and processing input from a number of input devices, such as a keyboard, a mouse, a touchpad, a touch screen, an electronic stylus, or other type of input device. Similarly, an input/output controller732 may provide output to a display, such as a computer monitor, a flat-panel display, a digital projector, a printer, a plotter, or other type of output device. It will be appreciated that thecomputing device700 may not comprise all of the components shown inFIG.7, may comprise other components that are not explicitly shown inFIG.7, or may utilize an architecture completely different than that shown inFIG.7. In some implementations of thecomputing device700, certain components, such as for example, thenetwork interface controller722, input/output controller732,

CPUs

704,705 andstorage controller724, may be implemented using a System on Chip (SoC) architecture.

As described herein, a computing device may be a physical computing device, such as thecomputing device700 ofFIG.7. A computing node may also comprise a virtual machine host process and one or more virtual machine instances. Computer-executable instructions may be executed by the physical hardware of a computing device indirectly through interpretation and/or execution of instructions stored and executed in the context of a virtual machine.

Components are described that may be used to perform the described methods and systems. When combinations, subsets, interactions, groups, etc., of these components are described, it is understood that while specific references to each of the various individual and collective combinations and permutations of these may not be explicitly described, each is specifically contemplated and described herein, for all methods and systems. This applies to all aspects of this application including, but not limited to, operations in described methods. Thus, if there are a variety of additional operations that may be performed it is understood that each of these additional operations may be performed with any specific example or combination of examples of the described methods.

The present methods and systems may be understood more readily by reference to the following detailed description of examples included therein and to the Figures and their descriptions.

As will be appreciated by one skilled in the art, the methods and systems may take the form of an entirely hardware example, an entirely software example, or an example combining software and hardware aspects. Furthermore, the methods and systems may take the form of a computer program product on a computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. More particularly, the present methods and systems may take the form of web-implemented computer software. Any suitable computer-readable storage medium may be utilized including hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.

Examples of the methods and systems are described below with reference to block diagrams and flowchart illustrations of methods, systems, apparatuses and computer program products. It will be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, may be implemented by computer program instructions. These computer program instructions may be loaded on a general-purpose computer, special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create a means for implementing the functions specified in the flowchart block or blocks.

These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.

The various features and processes described herein may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. In addition, certain methods or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto may be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically described, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the described examples. The example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the described examples.

It will also be appreciated that various items are illustrated as being stored in memory or on storage while being used, and that these items or portions thereof may be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other examples, some or all of the software modules and/or systems may execute in memory on another device and communicate with the illustrated computing systems via inter-computer communication. Furthermore, in some examples, some or all of the systems and/or modules may be implemented or provided in other ways, such as at least partially in firmware and/or hardware, including, but not limited to, one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers (e.g., by executing appropriate instructions, and including microcontrollers and/or embedded controllers), field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), etc. Some or all of the modules, systems, and data structures may also be stored (e.g., as software instructions or structured data) on a computer-readable medium, such as a hard disk, a memory, a network, or a portable media article to be read by an appropriate device or via an appropriate connection. The systems, modules, and data structures may also be transmitted as determined data signals (e.g., as part of a carrier wave or other analog or digital propagated signal) on a variety of computer-readable transmission media, including wireless-based and wired/cable-based media, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other examples. Accordingly, the present examples may be practiced with other computer system configurations.

While the methods and systems have been described in connection with specific examples, it is not intended that the scope be limited to the particular examples set forth, as the examples herein are intended in all respects to be illustrative rather than restrictive.

Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its operations be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its operations or it is not otherwise specifically stated in the claims or descriptions that the operations are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including: matters of logic with respect to arrangement of steps or operational flow; plain meaning derived from grammatical organization or punctuation; and the number or type of examples described in the specification.

It will be apparent to those skilled in the art that various modifications and variations may be made without departing from the scope or spirit of the present disclosure. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practices described herein. It is intended that the specification and example figures be considered as exemplary only, with a true scope and spirit being indicated by the following claims.

Claims

1. A method comprising:

receiving content, wherein the content comprises at least video, audio, and closed caption text;

determining, based on at least a portion of the audio or a portion of the video, text associated with the at least the portion of the audio or the portion of the video;

determining, based on a timeline associated with the content, a first time marker associated with the closed caption text;

determining, based on the timeline associated with the content and a comparison of the determined text to at least a portion of the closed caption text, a second time marker associated with the determined text;

determining, based on a comparison of the first time marker and the second time marker, a delay; and

buffering, based on the determined delay, at least one of the audio or video of the content.

2. The method recited inclaim 1, wherein the determining the text comprises converting the at least the portion of the audio to text.

3. The method recited inclaim 1, wherein the determining the text comprises decoding, by a player based on an audio-to-text translation, the at least the portion of the audio or the portion of the video by a player.

4. The method recited inclaim 1, wherein the determining the text comprises:

determining an event associated with the at least the portion of the video, and

determining the text based on the determined event.

5. The method recited inclaim 4, wherein the event comprises at least one of lip movement, an object movement, a content transition, or a change of state of an object.

6. The method recited inclaim 4, wherein the determining the event comprises inputting the at least the portion of the video to a machine learning algorithm.

7. The method recited inclaim 1, wherein the determining the text comprises converting descriptive audio of the content to text.

8. The method recited inclaim 1, wherein the content comprises an audiovisual stream.

9. The method recited inclaim 1, wherein the buffering facilitates alignment of the at least one of the audio or video of the content with the closed caption text.

10. The method recited inclaim 1, wherein the closed caption text comprises decoded closed captions.

11. The method recited inclaim 1, wherein the closed caption text comprises one or more subtitles.

12. A method comprising:

receiving content, wherein the content comprises video, audio, and closed caption text;

determining, based on an audio-to-text conversion of at least a portion of the audio, text;

determining, based on a comparison of the determined text to at least a portion of the closed caption text, a delay; and

buffering, based on the determined delay, at least one of the audio, the video, or the closed caption text.

13. The method recited inclaim 12, wherein the method further comprises:

outputting the buffered audio or the buffered video; and

outputting the closed caption text.

14. The method recited inclaim 12, wherein the method further comprises synchronizing output of the buffered audio or the buffered video with output of the closed caption text.

15. The method recited inclaim 12, wherein the determining the text comprises decoding, by a player based on the audio-to-text translation of the at least the portion of the audio, the at least the portion of the audio.

16. The method recited inclaim 12, wherein the closed caption text comprises decoded closed captions.

17. The method recited inclaim 12, wherein the audio-to-text conversion comprises performing visual speech recognition on at least a portion of the video.

18. The method recited inclaim 12, wherein the content comprises an audiovisual stream.

19. The method recited inclaim 12, wherein the buffering facilitates alignment of the audio or the video with the closed caption text.

20. The method recited inclaim 12, wherein the closed caption text comprises one or more subtitles.

21. A method comprising:

outputting, based on the determined delay, the at least one of the audio or video; and

outputting the closed caption text without the determined delay.

22. The method recited inclaim 21, wherein outputting, based on the determined delay, the at least one of the audio or video comprises buffering the at least one of the audio or video based on the delay.