Movatterモバイル変換


[0]ホーム

URL:


WO2025101169A1 - Cross-lingual media editing - Google Patents

Cross-lingual media editing
Download PDF

Info

Publication number
WO2025101169A1
WO2025101169A1PCT/US2023/036870US2023036870WWO2025101169A1WO 2025101169 A1WO2025101169 A1WO 2025101169A1US 2023036870 WUS2023036870 WUS 2023036870WWO 2025101169 A1WO2025101169 A1WO 2025101169A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
content
transcript
edits
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2023/036870
Other languages
French (fr)
Inventor
Feifan CHEN
Vicky ZAYATS
Noah Benjamin MURAD
Melissa Lianna FERRARI
Dirk Ryan PADFIELD
Daniel David WALKER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLCfiledCriticalGoogle LLC
Priority to PCT/US2023/036870priorityCriticalpatent/WO2025101169A1/en
Publication of WO2025101169A1publicationCriticalpatent/WO2025101169A1/en
Pendinglegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Definitions

Landscapes

Abstract

A media editing system (100) utilizes translated text (118) that is edited in one language (406) to be analyzed by a trained model (122, 410) using contextual information (408) to provide corresponding edits in a second language (412). The second language may be the original language of a piece of multimedia content, or may be a different language. The model output is used to modify original video that tracks changes in the original language (126, 504). Thus, a user can edit multimedia content in a different language with fidelity to the original language and have the video correctly sync with the edited original language. The output video can include system-generated content. The model may not produce an intermediate textual representation of the edited video in its original language, rather directly generating the edited video.

Description

CROSS-LINGUAL MEDIA EDITING
BACKGROUND
[0001] Multimedia content including audio and video is available in many languages today. It is possible to automatically caption videos in different languages, including languages other than from the original languagc(s). It is also possible to edit a video in different languages. However, it can be very difficult to edit a video in a language that one does not understand. It is also very challenging to know the correct places to cut the video so as to preserve phrases and sentences, and to properly preserve different parts in order for the edited video to make sense after editing. Simply translating a transcript into a different language, editing it, and then translating the edited document back to the original language can introduce different types of linguistic errors associated with the audio content, and timing errors associated with the video itself. This can create distortions from the original media content, rendering the new version in the different language unsuitable for its intended use.
BRIEF SUMMARY
[0002] The technology relates to multimedia systems that enables editing and creation of content in a language other than the original (source) language. For instance, editing may be performed on a transcript that has been translated from the original language to a different language. The edited transcript is applied to a language model, e g., a large language model (LLM), which uses the edited document and contextual information to suggest or generate appropriate edits in the original language that can match the semantics, intent and tone of the edits in the other language. This process is able to determine how the edits fit into the original language, given that it may have very different idioms, grammar and/or morphology than the other language. Once any edits have been finalized in the original language, the system is able to align these changes with the source recording and the source recording is edited to match. In addition, the video with the original language may be cut down or otherwise modified to incorporate the changes.
[0003] According to one aspect of the technology, a method is provided that comprises: receiving, by one or more processors of a computing system, edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery or video content; applying, by the one or more processors, the edits in the first language and contextual information of the item of c ontent to a trained model to generate corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; creating, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits; and generating, by the one or more processors, a revised version of the item of content having the revised audio in the selected language. [0004] The revised version of the item of content may include one or more changes to sync to the revised audio with the imagery or video content. For instance, the changes to sync to the revised audio with the imagery or video content may include modifications to the imagery or video content. The selected language may be the second language. Alternatively or additionally, the second language may be a source language of the item of content.
[0005] Alternatively or additionally to any of the above, the method may further comprise: prior to receiving the edits in the first language, generating a transcript, in the first language, corresponding to the item of content; and causing the transcript in the first language to be presented to a user. Here, the method may further comprising: prior to generating the transcript in the first language, generating a transcript in the second language of the audio of the item of content; wherein the transcript in the first language is generated by translating the transcript in the second language into the first language. Alternatively or additionally to any of the above, creating the revised audio in the selected language may include aligning the revised audio with the corresponding edits in the selected language.
[0006] Alternatively or additionally to any of the above, the corresponding edits may comprise a set of suggestions in the selected language, and in this case the method further comprises: causing the set of suggestions to be presented to a user; receiving acceptance of at least a portion of the set of suggestions in response to the presentation; and upon receiving the acceptance, performing the creating of the revised audio using the at least the portion of the set of suggestions.
[0007] Alternatively or additionally to any of the above the method may further comprise generating an edited transcript in the second language. Here, the method may further comprise incorporating the edited transcript in the second language as captioning for the revised version of the item of content.
[0008] Alternatively or additionally to any of the above, the edits in the first language may include one or more of an insertion, deletion or re-arrangement of a word or phrase. Alternatively or additionally to any of the above, creating the revised audio in the selected language may be done using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language. Alternatively or additionally to any of the above the contextual information of the item of content may comprise one or more of audio information or video information associated with the item of content.
[0009] According to another aspect of the technology, a computing system is provided that comprises memory configured to store a trained neural network model, and one or more processors operatively coupled to the memory'. The one or more processors are configured to: receive edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery or video content; apply the edits in the first language and contextual information of the item of content to the trained neural network model to generate corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; create revised audio in the selected language for the item of content according to the corresponding edits; and generate a revised version of the item of content having the revised audio in the selected language.
[0010] The revised version of the item of content may include one or more changes to sync to the revised audio with the imagery or video content. Alternatively or additionally to any of the above, the one or more processors may be further configured: prior to reception of the edits in the first language, to generate a transcript, in the first language, corresponding to the item of content; and to cause the transcript in the first language to be presented to a user. Here, the one or more processors may be further configured, prior to generation of the transcript in the first language, to generate a transcript in the second language of the audio of the item of content; and the transcript in the first language being generated by translation of the transcript in the second language into the first language.
[0011] Alternatively or additionally to any of the above, the creation of the revised audio in the selected language may include alignment of the revised audio with the corresponding edits in the selected language. Alternatively or additionally to any of the above, the corresponding edits may comprise a set of suggestions in the selected language, and the one or more processors arc further configured to: cause the set of suggestions to be presented to a user; receive acceptance of at least a portion of the set of suggestions in response to the presentation; and upon reception of the acceptance, perform the creation of the revised audio using the at least the portion of the set of suggestions.
[0012] Alternatively or additionally to any of the above, the one or more processors may be further configured to generate an edited transcript in the second language. Alternatively or additionally to any of the above, the creation of the revised audio in the selected language may be done using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language. [0013] According to a further aspect of the technology, a method comprises: receiving, by one or more processors of a computing device, edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery or video content; providing, by the one or more processors, the edits in the first language and contextual information of the item of content to a trained model to obtain generated corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; obtaining, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits; obtaining, by the one or more processors, a revised version of the item of content having the revised audio in the selected language; and presenting, by the one or more processors via a user interface subsystem of the computing device, the revised version of the item of content.
[0014] Here, providing the edits in the first language and the contextual information of the item of content to the trained model to obtain generated corresponding edits in the selected language may include communicating the edits in the first language and the contextual information to a remote computing device; and obtaining the revised audio in the selected language may include obtaining the revised audio from the remote computing device.
[0015] Alternatively or additionally to any of the above, the revised version of the item of content may include one or more changes to sync to the revised audio with the imagery or video content. Here, the changes to sync to the revised audio with the imagery or video content may include modifications to the imagery or video content. Alternatively or additionally to any of the above, the selected language may be the second language. Alternatively or additionally to any of the above, the second language may be a source language of the item of content.
[0016] Alternatively or additionally to any of the above, the method may further comprise: prior to receiving the edits in the first language, causing generation of a transcript in the first language corresponding to the item of content; and presenting the transcript in the first language to a user of the computing device. In this case, prior to causing generation of the transcript in the first language, the method may include causing generation of a transcript in the second language of the audio of the item of content; wherein the transcript in the first language is generated by translating the transcript in the second language into the first language.
[0017] Alternatively or additionally to any of the above, the revised audio in the selected language may be obtained via aligning the revised audio with the corresponding edits in the selected language. Alternatively or additionally to any of the above, the corresponding edits may comprise a set of suggestions in the selected language, and in this case the method further comprises: presenting the set of suggestions to a user of the computing device; receiving acceptance of at least a portion of the set of suggestions in response to the presentation; and upon receiving the acceptance, obtaining the revised audio using the at least the portion of the set of suggestions.
[0018] Alternatively or additionally to any of the above, the method may further comprise obtaining an edited transcript in the second language. Here, the method may further comprise presenting, via the user interface subsystem, the edited transcript in the second language as captioning for the revised version of the item of content. Alternatively or additionally to any of the above, the edits in the first language may include one or more of an insertion, deletion or re-arrangement of a word or phrase. Alternatively or additionally to any of the above, the revised audio in the selected language may be created using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] Figs. 1 A-B illustrate an example cross-lingual media editing system in accordance with aspects of the technology.
[0020] Fig. 2 illustrates a Transformer-type architecture for use in accordance with aspects of the technology.
[0021] Fig. 3 illustrates an exemplary training system in accordance with aspects of the technology.
[0022] Fig. 4 illustrates an example of a text editing phase of a cross- lingual media editing process in accordance with aspects of the technology.
[0023] Fig. 5 illustrates an example of a video editing phase of a cross-lingual media editing process in accordance with aspects of the technology.
[0024] Figs. 6A-B illustrate an exemplary computing system for use with aspects of the technology.
[0025] Figs. 7 A-B illustrate example methods in accordance with aspects of the technology.
DETAILED DESCRIPTION
[0026] The technology employs a large language model-based media editing system. As discussed herein, translated text that is edited in one language can be analyzed by an LLM using contextual information to suggest, generate or otherwise provide corresponding edits in a second language. The second language may be the original language of a piece of multimedia content, such as a training video, course tutorial, show, movie or other content. The LLM output is used to make changes to original video that tracks changes in the original language. In this way, even though a person may not speak a particular language, or may not be fluent enough to work with text in that language, they are able to edit multimedia content in a different language with fidelity to the original language and having the video correctly sync with the edited original language. The output video could include or entirely comprise generated content. For instance, the large language model could be a multi-modal language model that docs not produce an intermediate textual representation of the edited video in its original language, but instead directly generates the edited video given the original video in a language that the user may not understand or be fluent in, together with the transcript and edits in another language the user does understand or is fluent in.
[0027] Fig. 1A illustrates an example cross-lingual media editing system 100. The system 100 may include one or more processors 102 and memory 104 for storing data. In one example, the memory 104 may store one or more trained LLMs 106 and/or a multimedia corpus 108 such as videos or other content to be edited and already-edited content. A user 110 can edit transcripts in a selected language on their client device, which may be, e.g., a laptop or desktop computer, a tablet PC, a mobile phone or PDA, etc. The system applies an LLM to the edited material in view of a set of contextual information, in order to generate suggested edits in one or more different languages. The LLM may be a multimodal model, can be trained in various ways, and may have a Transformer or different type of neural network architecture, as discussed further below. The user input and system interaction may be presented via an app displayable to the user 110 on a graphical user interface (GUI) 112 of the user’s client device.
[0028] For instance, in this example assume that a physics tutorial was originally created in Turkish. As shown in the GUI 112, the original content 114 is a tutorial that has video and textual content. There is also corresponding audio content by the presenter of the tutorial (e.g., a lecture). In this example, the user 110 may not speak or write in Turkish, but does speak and write in English. The system is able to support editing in English. First, a transcript 116 in the original language (here, Turkish) is created from the original content. Then another, translated transcript 118 in the user’s language (here, English) is created.
[0029] The user is able to edit the translated transcript 118 as they choose, resulting in an edited transcript 120. This could include changing certain words or phrases, reorganizing or adding certain sections of the text (e.g., using a brief part of the tutorial as an introduction to the video), and/or make other changes. By way of example, editing may include insertions, deletions, re-arrangements, etc.
[0030] Once the changes have been made, edited transcript 120 is fed to a media module 122. The module 122 utilizes the LLM(s) 106 to process the edited transcript 120 and contextual information, generating one or more suggestions for appropriate edits in the original language (or even a different language) that will match the semantics, intent, and tone of the edits from the other language (here, English), as shown by output 124. Note that according to one aspect of the technology, such models 106 may include one or more multimodal large foundation models. The contextual information may comprise audio and or visual content of the video (e.g., audio or video metadata associated with the video). This information can be incorporated into the model via multimodal training which may include techniques like visual and audio language models, audio/visual tokenizers, or other neural multimodal input encoding models. Once accepted, the output 124 becomes an edited version of the original transcript 116. This can then be aligned with the source recording (e.g., audio content from the original video 114) and the source recording is edited to match by the system, as shown at 126. The system may also generate a corresponding video in the other language, as shown at 128. In one scenario, the system need not produce an intermediate textual representation (124) of the edited video in its original language. Rather, as noted above, the model may be trained to directly make the edits to the multimedia content in the original language. This may be efficiently done when the model is a multi-modal model. [0031] The system can be employed in a variety of scenarios and use cases. These can include, by way of example, educational videos and tutorials, movie and show translations (with or without subtitles), music videos, memes, advertisements, etc. Captioning may be provided for videos based on the translated text. According to another aspect, the system may prepare content for publication in a location or according to a format where the cultural norms may involve modifying the content so as not to offend the target audience. By way of example, a movie preview may have originally been created using language intended for one audience (e.g., a “rated R” movie), but the editing may have included changes so that the preview can be presented to a broader audience (e.g., with language satisfactory for a “rated PG-13” movie). Such an approach may address other cultural norms.
[0032] Fig. IB illustrates an approach 150, in which the user uses keyboard 152 of user device 154 to make text edits on the translated transcript 118. Alternatively or additionally, the user device may support spoken and/or gesture-based edits. In this approach, the user device 154 has tire components described above, including one or more processors 102 and memory’ 104, which stores one or more trained LLM (or LLMs) 106. The memory 104 stores corpus 108, and may also store one or more user profiles 156, which may be used in conjunction with the LLM(s) 106 to generate personalized rewritten text, for instance according to language or other preference information associated with the user(s).
[0033] As shown, the user device 154 also includes one or more applications 158 (e.g., a video editing or document writing app, etc.), a communication module 160 (e.g., to communicate with other computing devices), and a display module 162 configured to generate the GUI 112 including any? rewritten text for edited transcript 120. The memory may also store edit-related information, -which may be associated with the user profile(s) 156. In one scenario, the media module 122 may be part of a particular application 158. In another scenario, the media module 122 may be separate from the application(s) 158, and can be called as needed.
[0034] In this scenario, once the translated transcript 118 is presented to the user and modified into the edited transcript 120 using application 158, edited video 128 in the user’s language can be created and presented via the GUI 112. This can help ensure that the user’s text edits are properly integrated with the video. One or both of the edited transcript and video may be saved locally in memory', and/or sent to other computing devices (such as a server or client devices). Note that in different architectures, the LLMs, corpus and media module may be maintained together, such as for comprehensive processing by a back- end server or by a user device. In one such scenario, these components may be employed by the back-end server, with edit information supplied by a user device. An example of back-end processing is described below with regard to Figs. 6A-B.
EXAMPLE SYSTEMS AND METHODS [0035] As noted above, one or more LLMs may be employed in the system 100. While there are a number of different possible system configurations, they each incorporate LLMs. The system can employ one or more cross-lingual media editing modules, such as module 122. For example, the module(s) can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Certain machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models, e.g., having a Transformer architecture. The arrangements discussed herein can utilize one or more encoders.
[0036] In particular, such models may excel in the zero or few-shot learning setting, where through appropriately engineered prompts they can be adapted to specific tasks without modifying the model parameters. When more training data is available, parameter efficient tuning methods such as prompt tuning can achieve even better performance while still enabling a single ELM to handle multiple types of tasks. LLMs may also execute multi-step reasoning using chain of thought prompting. The technology described herein shows how to harness the attributes of LLMs within the cross-lingual media editing space.
Transformer-type Neural Network
[0037] By way of example only, a suitable Transformer architecture is presented in Fig. 2. hr particular, system 200 of Fig. 2 is implementable via a computer program by processors of one or more computers in one or more locations. The system 200 receives an input sequence 202 (e.g., a query) and processes the input sequence 202 to transduce the input sequence 202 into an output sequence 204 (e.g., an answer). The input sequence 202 has a respective network input at each of multiple input positions in an input order and the output sequence 204 has a respective network output at each of multiple output positions in an output order.
[0038] System 200 can perform any of a variety of tasks that require processing sequential inputs to generate sequential outputs. System 200 includes an attention-based sequence transduction neural network 206, which in turn includes an encoder neural network 208 and a decoder neural network 210. The encoder neural network 208 is configured to receive the input sequence 202 and generate a respective encoded representation of each of the network inputs in the input sequence. An encoded representation is a vector or other ordered collection of numeric values. The decoder neural network 210 is then configured to use the encoded representations of the network inputs to generate the output sequence 204. Generally, both the encoder 208 and the decoder 210 are attention-based. In some cases, neither the encoder nor the decoder includes any convolutional layers or any recurrent layers. The encoder neural network 208 includes an embedding layer (input embedding) 212 and a sequence of one or more encoder subnetworks 214. The encoder neural 208 network may N encoder subnetworks 214.
[0039] The embedding layer 212 is configured, for each network input in the input sequence, to map the network input to a numeric representation of the network input in an embedding space, e.g., into a vector in the embedding space. The embedding layer 212 then provides the numeric representations of the network inputs to the first subnetwork in the sequence of encoder subnetworks 214. The embedding layer 212 may be configured to map each network input to an embedded representation of the network input and then combine, e.g., sum or average, the embedded representation of the network input with a positional embedding of the input position of the network input in the input order to generate a combined embedded representation of the network input. In some cases, the positional embeddings are learned. As used herein, “learned” means that an operation or a value has been adjusted during the training of the sequence transduction neural network 206. In other cases, the positional embeddings may be fixed and are different for each position.
[0040] The combined embedded representation is then used as the numeric representation of the network input. Each of the encoder subnetworks 214 is configured to receive a respective encoder subnetwork input for each of the plurality of input positions and to generate a respective subnetwork output for each of the plurality of input positions. The encoder subnetwork outputs generated by the last encoder subnetwork in the sequence are then used as the encoded representations of tire network inputs. For the first encoder subnetwork in the sequence, the encoder subnetwork input is the numeric representations generated by the embedding layer 212, and, for each encoder subnetwork other than the first encoder subnetwork in the sequence, the encoder subnetwork input is the encoder subnetwork output of the preceding encoder subnetwork in the sequence.
[0041] Each encoder subnetwork 214 includes an encoder self-attention sub-layer 216. The encoder self-attention sub-layer 216 is configured to receive the subnetwork input for each of the plurality of input positions and, for each particular input position in the input order, apply an attention mechanism over the encoder subnetwork inputs at the input positions using one or more queries derived from the encoder subnetwork input at the particular input position to generate a respective output for the particular input position. In some cases, the attention mechanism is a multi-head attention mechanism as shown. In some implementations, each of the encoder subnetworks 214 may also include a residual connection layer that combines the outputs of the encoder self-attention sub-layer with the inputs to the encoder self-attention sub-layer to generate an encoder self-attention residual output and a layer normalization layer that applies layer normalization to the encoder self-attention residual output. These two layers are collectively referred to as an “Add & Norm” operation in Fig. 2. [0042] Some or all of the encoder subnetworks can also include a position-wise feed-forward layer 218 that is configured to operate on each position in the input sequence separately. In particular, for each input position, the feed-forward layer 218 is configured to receive an input at the input position and apply a sequence of transformations to the input at the input position to generate an output for the input position. The inputs received by the position-wise feed-forward layer 218 can be the outputs of the layer normalization layer when the residual and layer normalization layers arc included or the outputs of the encoder self-attention sub-layer 216 when the residual and layer normalization layers are not included. The transformations applied by the layer 218 will generally be the same for each input position (but different feed-forward layers in different subnetworks may apply different transformations).
[0043] In cases where an encoder subnetwork 214 includes a posit ion-wise feed-forward layer 218 as shown, the encoder subnetwork can also include a residual connection layer that combines the outputs of the position-wise feed-forwar d layer with the inputs to the position-wise feed-forward layer to generate an encoder position-wise residual output and a layer normalization layer that applies layer normalization to the encoder position -wise residual output. As noted above, these two layers are also collectively referred to as an "Add & Norm" operation. The outputs of this layer normalization layer can then be used as the outputs of the encoder subnetwork 214.
[0044] Once the encoder neural network 208 has generated the encoded representations, the decoder neural network 210 is configured to generate the output sequence in an auto-regressive manner. That is, the decoder neural network 210 generates the output sequence, by at each of a plurality of generation time steps, generating a network output for a corresponding output position conditioned on (i) the encoded representations and (ii) network outputs at output positions preceding the output position in the output order. In particular, for a given output position, the decoder neural network generates an output that defines a probability distribution over possible network outputs at the given output position. The decoder neural network can then select a network output for the output position by sampling from the probability distribution or by selecting the network output with the highest probability.
[0045] Because the decoder neural network 210 is auto-regressive, at each generation time step, the decoder network 210 operates on the network outputs that have already been generated before the generation time step, i.e., the network outputs at output positions preceding the corresponding output position in the output order. In some implementations, to ensure this is the case during both inference and training, at each generation time step the decoder neural network 210 shifts the already generated netw ork outputs right by one output order position (i.e., introduces a one position offset into the already generated network output sequence) and (as wall be described in more detail below') masks certain operations so that positions can only attend to positions up to and including that position in the output sequence (and not subsequent positions). While the remainder of the description below describes that, when generating a given output at a given output position, various components of the decoder 210 operate on data at output positions preceding the given output positions (and not on data at any other output positions), it will be understood that this type of conditioning can be effectively implemented using shifting.
[0046] The decoder neural network 210 includes an embedding layer (output embedding) 220, a sequence of decoder subnetworks 222, a linear layer 224, and a softmax layer 226. In particular, the decoder neural network can include N decoder subnetworks 222. However, while the example of Fig. 2 shows the encoder 208 and the decoder 210 including the same number of subnetworks, in some cases the encoder 208 and the decoder 210 include different numbers of subnetworks. The embedding layer 220 is configured to, at each generation time step, for each network output at an output position that precedes the current output position in the output order, map the network output to a numeric representation of the network output in the embedding space. The embedding layer 220 then provides the numeric representations of the network outputs to the first subnetwork 222 in the sequence of decoder subnetworks.
[0047] In some implementations, the embedding layer 220 is configured to map each network output to an embedded representation of the network output and combine the embedded representation of the network output with a positional embedding of the output position of the network output in the output order to generate a combined embedded representation of the network output. The combined embedded representation is then used as the numeric representation of the network output. The embedding layer 220 generates the combined embedded representation in the same manner as described above with reference to the embedding layer 212.
[0048] Each decoder subnetwork 222 is configured to, at each generation time step, receive a respective decoder subnetwork input for each of the plurality of output positions preceding the corresponding output position and to generate a respective decoder subnetwork output for each of the plurality of output positions preceding the corresponding output position (or equivalently, when the output sequence has been shifted right, each network output at a position up to and including the current output position). In particular, each decoder subnetwork 222 includes two different attention sub-layers: a decoder self-attention sub-layer 228 and an encoder-decoder attention sub-layer 230. Each decoder self-attention sub-layer 228 is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the particular output positions, apply an attention mechanism over the inputs at the output positions preceding the corresponding position using one or more queries derived from the input at the particular output position to generate a updated representation for the particular output position. That is, the decoder self-attention sub-layer 228 applies an attention mechanism that is masked so that it does not attend over or otherwise process any data that is not at a position preceding the current output position in the output sequence.
[0049] Each encoder-decoder attention sub-layer 230. on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position. Thus, the encoder-decoder attention sub-layer 230 applies attention over encoded representations while the decoder self-attention sub-layer 228 applies attention over inputs at output positions.
[0050] In the example of Fig. 2, the decoder self-attention sub-layer 228 is shown as being before the encoder-decoder attention sub-layer in the processing order within the decoder subnetwork 222. In other examples, however, the decoder self- attention sub-layer 228 may be after the encoder-decoder attention sub-layer 230 in the processing order within the decoder subnetwork 222 or different subnetworks may have different processing orders. In some implementations, each decoder subnetwork 222 includes, after the decoder self-attention sub-layer 228, after the encoder-decoder attention sub-layer 230, or after each of the two sub-layers, a residual connection layer that combines the outputs of the attention sub-layer with the inputs to the attention sub-layer to generate a residual output and a layer normalization layer that applies layer normalization to the residual output. These two layers being inserted after each of the two sub-layers, both referred to as an "Add & Norm" operation.
[0051] Some or all of the decoder subnetwork 222 also include a position-wise feed-forward layer 232 that is configured to operate in a similar manner as the position-wise feed-foiward layer 218 from the encoder 208. In particular, the layer 232 is configured to, at each generation time step: for each output position preceding the corresponding output position: receive an input at the output position, and apply a sequence of transformations to the input at the output position to generate an output for the output position. The inputs received by the position-wise feed-forward layer 232 can be the outputs of the layer normalization layer (following the last attention sub-layer in the subnetwork 222) when the residual and layer normalization layers are included or the outputs of the last attention sub-layer in the subnetwork 222 when the residual and layer normalization layers are not included. In cases where a decoder subnetwork 222 includes a position-wise feed-forward layer 232, the decoder subnetwork can also include a residual connection layer that combines the outputs of the position-wise feed-forward layer with the inputs to the position-wise feed-forward layer to generate a decoder position-wise residual output and a layer normalization layer that applies layer normalization to the decoder position-wise residual output. These two layers are also collectively referred to as an "Add & Norm" operation. The outputs of this layer normalization layer can then be used as the outputs of the decoder subnetwork 222.
[0052] At each generation time step, the linear layer 224 applies a learned linear transformation to the output of the last decoder subnetwork 222 in order to project the output of the last decoder subnetwork 222 into the appropriate space for processing by the softmax layer 226. The softmax layer 226 then applies a softmax function over the outputs of the linear layer 224 to generate the probability distribution (output probabilities) 234 over the possible network outputs at the generation time step. The decoder 210 can then select a network output from the possible network outputs using the probability distribution, to output final result 204.
[0053] According to aspects of the technology, variations on the Transformer-type architecture can be used. These may include T5, Bidirectional Encoder Representations from Transformers (BERT), Language Model for Dialogue Applications (LaMDA), and/or Pathways Language Mode (PaLM) type architectures. Other arrangements may be multimodal, for instance able to handle text, video and/or audio information.
Cross-Lingual Media Editing
[0054] As noted above, in a conventional approach it can very difficult to edit a video in a language that one does not understand. For instance, it may be tricky to know the correct places to cut the video so as to maintain phrases and sentences, and to know what parts must be preserved in order for the video to make sense after it has been edited.
[0055] The technology provides cross-lingual media editing via a text-based audio and/or video editing interface. By way of example, the user is able to edit a recording by manipulating a textual representation of the video contents (e.g., a transcript of the wnrds spoken in the recording). In this case, the user is editing a recording in language A, or even a multi-lingual recording in a set of languages, even if the user does not speak or understand A. Thus, what the user view's is not the original textual representation for language A, but rather a translation that is in language B. The edits are then propagated back to the language A textual representation by the system and from there translated into edits of the original audio/video to make the audio/video match the newly edited transcript.
[0056] According to one aspect of the technology, an alignment matrix of words and phrases from the original transcript in original language A (e.g., transcript 116) to words and phrases in the translated transcript in the other language B (e.g., transcript 118) can be created during the translation process. Alternatively, this may be done as a post-processing step after translation if the translation is not manual or done in an automated fashion in which alignment information is not produced as a by-product. The alignments are used to identify the locations in the original language transcript that must be changed to match the edits in the transcript for the other language. In one scenario, the alignment matrix may comprise an alignment data structure with a mapping of transcription word tokens to start- and end-timings in the source media. In another scenario, the alignment matrix may be an ephemeral activation pattern present in the model during inference that is learned during training. This may manifest, for example, in values fed forward through the neural network model, which may only be evident through deep introspection of specific memory locations representing model activations in e.g., the transformer attention layers during runtime.
[0057] The alignment information may be applied to the attention mechanism of the neural network, in order to inform the LLM what words to focus on in the translation. Note that alignment need not be employed if the LLM is configured to handle language translation. For instance, if the LLM is multimodal, it can work as trained if prompted correctly, without employing an alignment data structure.
[0058] As noted above, the system can include one or more apps, such as a video editing or document writing app. This can include a text editing app, routine or other program, which enables textual documents to be viewed and edited by a user, such as a transcript for media data. The app or other program can provide textual analysis. Textual analysis can include various analyses on the text of the document being viewed within the editing app and make recommendations for changes to the text of the document. This can include grammar and punctuation analysis, term usage consistency analysis, flow and narrative analysis, etc.
[0059] In addition or alternatively, textual analysis can include disfluency detection, hr disfluency detection, filler words in a text document may be identified. This may be particularly prevalent in transcripts, where phrases such as "uh," "uhm," "hm," and others are used as placeholders while a speaker considers a point. Moreover, repetitions of words can be detected, and mistakes (e.g., misspoken words, incorrect grammar during speech, etc.) can also be detected. These detected portions of the text can then be recommended for change and/or automatically changed in the transcript. This could be done on the original transcript of the source language, on the edited transcript of the other language, or on both transcripts.
[0060] Textual analysis can also involve extractive summarization, in which the program can analyze portions or the entirety of the transcript and build a textual summary of the entirety of the document (or one or more portions thereof) by, for example, selecting sentences summarizing key points from the transcript. This textual summary can then be added to the transcript or output as its own transcript. In addition or alternatively, textual analysis can create abstractive summarization, which identifies key pieces of the text and builds new sentences, instead of using existing sentences from the transcript, for a summarization of the transcript.
[0061] The app may also perform video summarization and/or highlighting. For instance, if the transcript includes questions and answers by two or more speakers, textual analysis can include generating text summarizing the question-answer back-and-forth. Content re-organization may also be performed by the system. By way of example, text analysis may identify one or more portions of the original or translated transcript that appear to be out of place (e.g., not otherwise associated with the current conversation topic occurring, in reference to a prior question or thought, etc.) and may recommend making changes to move this portion of text to a more relevant portion of the transcript.
[0062] Content reorganization may also include removing a portion of audio and/or video at a first time, identify a different time in the media data where the portion of audio and or video would more likely be relevant based on recommendations from the textual analysis, and then perform stitching and other audio and/or video editing techniques to add the portion of audio and/or video in at the different time in the media data. Content reorganization may also include cleaning up one or more portions of the content where data has been moved (e.g., when media is moved to a new time, the old time can be stitched together to not have an interruption in the media data).
[0063] Thus, when the user makes edits in the translated transcript, this may include moving text around, shortening sentences, tightening up language, etc. The system may map the textual changes to the audio and/or video for the content of interest, in order to present the information correctly. Here, the user may review the changes to ensure they are consistent with the transcript edits. Once the edited text is translated to the original language (and/or to any other languages), the system performs the appropriate mapping in the desired language. For instance, an LLM trained in the target language will have learned valid ways of expressing certain concepts in the target language, and can perform the mappings according to those trained ways of expressing the concepts. In some situations, there may be fine-tuning, e.g., on a parallel corpus with the same content in two (or more) languages. This can be done to address situations where there are different idioms, grammar and/or morphology in the languages.
[0064] According to another aspect, the edits which have been translated into the original language can be used for captioning of the media content. Thus, in addition to any video and/or audio editing, captioning can be generated in the original language for presentation to a user.
[0065] The user may choose to incorporate some or all of these options when performing their editing. In one example, the user may make their edits in the other language, then be presented with a video with audio in the other language to allow the user to determine whether the edits made sense. Then, once accepted, the edited transcript can be converted to an ed ited version of the original transcript. At that point, video and/or audio editing of the original content may be performed to obtain the edited video in the source language. Some edits may be possible by direct manipulation of the video as-is, such as removing or rearranging frames of video and samples of audio. Other edits can make use of one or more other models (e.g., stable diffusion or other generative-type models) to create new video frames (or sections of video frames) and new audio samples to blend seamlessly with the surrounding content. [0066] While the above scenario illustrates how the system can operate using two languages, the system may support any number of languages. If the translations already existed, could apply the LLM approach to those languages too. Personalization of the video may correspond to different languages for one or more users of the system.
Training
[0067] The LLM may be trained for one or more specific language pairs (e.g., Turkish-English, English-Japanese, etc.). Training data can include, e.g., language model edits with translations to one or more other languages, which can be used to train the mode] in a supervised way. In one scenario, the LLM may be trained using few-shot prompting. By way of example, this may include the kind of reply that is expected to be generated, such as a type of change to the text. The base language model and examples for training need to be multilingual. The model may also be trained for prosody, which involves the cadence of the speech, as well as intonation, since that may vary from language to language. Moreover, how many seconds of video/audio that are generated for the translation process could be fine-tuned or prompted.
[0068] The technology may use a dual encoder LLM having two neural net towers, for instance one to encode contextual information and one to encode the media content. By way of example only, textual context may be input to a first LLM encoder, while textual metadata is input to a second LLM encoder. The outputs from each encoder can then be applied to a similarity module that is used to generate a contrastive loss. The contrastive loss can then be used to train the dual encoder LLM.
[0069] An example training computer system 300 is shown in Fig. 3. The system 300 can include a training device 302, one or more servers 304, and one or more client devices 306. Here, the training device 302 includes a model trainer 308 that trains machine-learned model(s) 310 using training data 312, which may be stored at the server(s) 304, the client device(s) 306 and/or a database 314 using various training or learning techniques, such as, for example, backwards propagation of errors. The database may also store the model trainer 308 and/or the training data 312.
[0070] By way of example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 308 may perform one or more generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained. [0071] The model trainer 308 can train models 310, such as cross-lingual media editing models associated with two or more languages, based on the training data 312, The training data 312 can include, for example, text, audio and/or video data segments, and associated desired outputs (e.g., clipped media segments, cropped media segments, stitched media segments, combined media segments, removed media segments, annotated media segments, and other modified media segments).
[0072] The models 310 may be multimodal, able to receive and process multiple ty pes of content such as videos, audio files, and/or text files. The models 310 can analyze edits made in one language, compare such edits to desired outputs, and, once trained, create corresponding edits in a different language and sync the edits with accompanying video. Thus, the training can be done to match edits in one language to match the semantics, intent and tone of the edits in another language. The trained model is able to determine how the edits should fit into the other language, given that it may have different idioms, grammar and/or morphology. The trained model is also able to align the language changes with corresponding video to match, for instance so that when a person is speaking, the movement of their mouth matches the edited audio (or accompanying captioning).
[0073] The model trainer 308 can also train the models using, for example, textual data, such as transcripts in different languages for media data. In particular, the model trainer 308 can provide a transcript with a modification made (e.g., a sentence removed, a sentence added, a sentence modified, a sentence moved within the transcript, or other modifications that can be performed on text) and an associated desired media output (e.g., audio and/or video). Based on the modified transcript and the desired media output, the model trainer 308 can train the models 310 to perform media editing actions to achieve the desired media output.
[0074] In some implementations, if a user has provided consent, Paining examples can be provided by the client device 306. Thus, in such implementations, the model 310 provided to that client device 306 may be trained by the training device 302 on user-specific data received from the client device 310. In some instances, this process can be referred to as personalizing the model, for instance according to a particular language (or set of languages), or a particular type of media content (e.g., training or educational videos).
[0075] As shown in this example, the training device 302, server(s) 304, client device(s) 306 and/or database 314 may be in operative communication via a network 316.
Textual and Video Editing Examples
[0076] Fig. 4 illustrates an example 400 of how a cross-lingual media editing system can function in a first editing phase involving text edits. As shown, at 402 a segment of original text is provided or obtained in the source language. For instance, the text may be a transcript from a cooking show about how to make a souffle. The original text, in this example in French (“Comment faire un souffle”), is converted into another language such as English (“How to make a souffle”). This conversion may be done by a crosslingual LLM or a text translation module.
[0077] In this example, a user modifies the text, making it clear that the video involves the process for making a chocolate souffle (“Steps to malting a chocolate souffle”). Here, “How to” was changed to “Steps”, “make” to “making”, and “chocolate” has been inserted, as indicated by the underlining.
[0078] Edited language text 406 and any contextual data 408 is applied to a cross-lingual module 410, which includes one or more trained cross-lingual models. The module generates, as shown at 412, text with proposed edits in the original language. Here, as shown at 414, this result may be “Etapes pour faire un souffle au chocolat !” The underlining may be presented to show the changes to the original text. Note that in this example, an exclamation mark has been proposed by the cross-lingual module 410 to indicate excitement.
[0079] Fig. 5 illustrates another example 500 of how the cross-lingual media editing system can function in a second editing phase involving video edits. Here, the original video in the source language is shown at 502. Once the translated text has been edited, and the proposed changes have been accepted in the source language, the cross-lingual module 410 makes corresponding edits to the video as shown at 504. This can include changing the displayed text to match the edited text, for instance as part of a caption or textual summar y of the content. In this example, because the edited text at 504 is longer than the original text at 502, the system may reformat, move or otherwise modify the text for presentation in the video. This may also involve graphics editing, such as moving the placement of the delicious souffle to the right of the frame, rather than in the center as originally presented.
[0080] The processing by the cross-lingual module 410 may involve content reorganization as discussed above. By way of example, if the steps in the souffle-making process were changed (e.g., whipping the egg whites before melting the chocolate), or a step was added (e.g., chill the whipped egg whites before folding into the melted chocolate), then content reorganization may involve changing the order of certain video segments, repeating a video segment, and/or inserting a new video segment, as well as syncing the baker’s voice to the corresponding video (and/or to any displayed captioning).
Example Computing Architecture
[0081] As noted above, the cross-lingual media technology discussed herein may be trained to generate one or more LLMs. By way of example, one LLM may be configured for use with a specific language pair, e.g., Turkish-English or English-French. Another LLM may be configured for use with three or more languages. This can enable different users to edit the original content in various languages. It can also enable generation of edited videos and other media content in multiple languages. For instance, if an original video was produced in Turkish and edited in English, the system may generate not only an edited version of the Turkish video but also one in English or any other language supported by the LLM. Or there may be a set of LLMs configured to generate new videos in different languages based upon one common editing language. In this case, if a souffle baking video was edited in English, one or more different LLMs could be used by the cross-lingual module to generate videos in French, Turkish, Japanese, Spanish, etc. The videos could be stored by a server and/or provided to various client devices.
[0082] One example computing architecture is shown in Figs. 6A and 6B. In particular, Figs. 6A and 6B are pictorial and functional diagrams, respectively, of an example system 600 that includes a plurality of computing devices and databases connected via a network. For instance, computing device(s) 602 may be implemented as a cloud-based server system. Databases 604, and 606 may store, e.g., a corpus of media content, training data and/or trained models (including LLMs as discussed herein), respectively. The server system may access the databases via network 608.
[0083] Client devices may include one or more of a desktop computer 610 and a laptop or tablet PC 612, for instance that can be employed as editing and/or training devices as discussed here. Editing and/or training tools could be provided to the user via a web-based sendee, app or other program. Other client devices may include handheld devices including a personal communication device such as a mobile phone or PDA 614 or a tablet 616. Another example is a wearable device 618 such as a smartwatch (or headmounted display device). Each of these can be used to play videos or otherwise present multimedia content in an original language or in a different language as discussed herein.
[0084] As shown in Fig. 6B, each of the computing devices 602 and 610-618 may include one or more processors, memory, data and instructions. The memory' stores information accessible by the one or more processors, including instructions and data (e.g., models) that may be executed or otherwise used by the processor(s). The memory' may be of any type capable of storing information accessible by the processor(s), including a computing device-readable medium. The memory is a non-transitory medium such as a harddrive, memory' card, optical disk, solid-state, etc. Systems may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media. The instructions may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor(s). For example, the instructions may be stored as computing device code on the computing device-readable medium. In that regard, the terms “instructions”, “modules” and “programs” may be used interchangeably herein. The instructions may be stored in object code format for direct processing by the processor, or in any' other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. [0085] The processors may be any hardware processors, such as commercially available CPUs, tensor processing units (TPUs), graphical processing units (GPUs), etc. Alternatively, each processor may be a dedicated device such as an ASIC or other hardware-based processor. Although Fig. 6B functionally illustrates the processors, memory, and other elements of a given computing device as being within the same block, such devices may actually include multiple processors, computing devices, or memories that may or may not be stored within the same physical housing. Similarly, the memory' may be a hard drive or other storage media located in a housing different from that of the processor(s), for instance in a cloud computing system of server 602. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel. As discussed herein, one or more processors may be configured to retrieve and store data in memory', and may execute instructions to implement LLMs and/or multimodal models. When multiple processors are employed, each processor may implement one or more of the instructions or otherwise implement a portion (e.g., one or more layers) of the model(s).
[0086] The computing devices may include all of the components normally used in connection with a computing device such as the processor and memory described above as well as a user interface subsystem for receiving input from a user and presenting information to the user (e.g., text, audio, and imagery' and/or other graphical elements). The user interface subsystem may include one or more user inputs (e.g., at least one front (user) facing camera, a mouse, keyboard, touch screen and/or microphone) and one or more display devices (e.g., a monitor having a screen or any other electrical device that is operable to display information (e.g., text, imagery' and/or other graphical elements). Other output devices, such as speaker(s) may also provide information to users.
[0087] The user-related computing devices (e.g., 610-618) may communicate with a back-end computing system (e.g., server 602) via one or more networks, such as network 608. The network 608, and intervening nodes, may include various configurations and protocols including short range communication protocols such as Bluetooth1M, Bluetooth LE™, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi and HTTP, and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces.
[0088] In one example, computing device 602 may include one or more server computing devices having a plurality of computing devices, e.g., a load balanced server farm or cloud computing system, that exchange information with different nodes of a network for the purpose of receiving, processing and transmitting the data to and from other computing devices. For instance, computing device 602 may include one or more server computing devices that are capable of communicating with any of the computing devices 610-618 via the network 608. The computing device 602 may implement a back-end server (e.g., a cloudbased video content server), which receives information from desktop computer 610, laptop/tablet PC 612, mobile phone or PDA 614, tablet 616 or wearable device 618 such as a smartwatch or head-mounted display.
[0089] The applications used by the user, such as video apps, word processing, social media or messaging applications, may utilize the technology by making a call to an API for a service that uses the LLM as described herein. The service may be locally hosted on the client device such as any of client devices 610, 612, 614, 616 and/or 618, or remotely hosted such as by a back-end server such as computing device 602. In one scenario, the client device may provide the textual edits but relies on a separate service for the LLM. In another scenario, the client application and the model(s) may be provided by the same entity but associated with different services. In a further scenario, a client application may integrate with a third-party service for the baseline functionality of the application, such as captioning for videos. Thus, one or more neural network models may be provided by various entities, including an entity that also provides the client application, a back-end service that can support different applications, or an entity that provides such models for use by different sendees and/or applications.
[0090] Resultant information (e.g., sets of edited text in one or more languages, or modified videos based on edited text) or other data derived from the approaches discussed herein may be shared by the server with one or more of the client computing devices. Alternatively or additionally, the client device(s) may maintain their own databases, models, etc. Thus, the client device(s) may locally process text for media editing in accordance with the approaches discussed herein. Moreover, the client device(s) mayreceive updated models and/or cross-lingual modules from the computing device 602 or directly from database 606 via the network 608.
[0091] In one scenario, a user can use a keyboard of their client device to make text edits on a translated transcript. Alternatively or additionally, the client device may support spoken and/or gesture-based edits. In this scenario, the computing device(s) 602 may be a back-end system that implements the cross-lingual media editing module to use the edited transcript in the user’s preferred language to generate an edited transcript in the original language (and/or in other selected languages), as well as to make any necessary' edits to the video corresponding to tire original language (and/or to other selected languages). This can include creating new video frames (or sections of video frames) and new audio samples to blend seamlessly with the surrounding content as described above. There may be multiple editing rounds performed by the user, for instance to change a word or phrase, add an introduction or conclusion, remove a portion of the text entirely, etc. [0092] Once the back-end or other computing device(s) modify and/or create content based upon the user’s edits, such content can be transmitted to the user’s client device and/or other client devices for presentation by rendering a user interface at the respective client device(s). In this way, the client device(s) need not store or run the LLM and/or multimodal model(s) locally, while still being able to present the desired content to the user(s). In one example, some of the editing/creation may be done locally on the client device, while other cditing/crcation may be done by the back-cnd device. For instance, the client device may be configured to run an LLM to generate textual transcripts in one or more languages, while a back-end server is configured to use those transcripts in video editing according to a multimodal model or other video editing approach.
[0093] The modified and/or created content may be stored locally on the client device(s), maintained by the back-end or other computing device, or reside in the multimedia corpus storage system (e.g., database 604). Each client device may render a UI to present the desired content. This may be done in ad hoc fashion, such as when a user clicks on an icon or I ink for the desired content.
Personalization
[0094] Media content may be modified or otherwise personalized based upon translated text that has been edited by a user. Another kind of personalization can be personalization specific to the content and the original content creator. For instance, the editing user may not be the original creator, since the editor may not understand or be fluent in the language in which the content was originally created. In this case, edits to the original content can be made in a way the preserves the creator's language patterns and/or personality. This could be done by using the full source media context (not just the content immediately surrounding the edited regions), or even by passing the system a pointer to a corpus of other content by the creator that can be used to train personalization models and parameters. Here, for example, if a Japanese person is editing a Japanese transcript of an English video by an American actor who has a very distinctive communication style, the resulting edited English video content could follow that actor’s particular pattern of speech, even if the resulting language would not necessarily match “proper” American English. Personalization may also be impacted by the language(s) used by the user, past editing sessions and/or other information.
[0095] The system may maintain information associated with one or more personal profiles (e.g., stored locally at the client device). Further to the descriptions above, a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., information about a user’s language preferences, writing or communication style, or a user’s current location), and if the user is sent content or communications from a server. In addition, certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user’s identity may be treated so that no personally identifiable information can be determined for the user, or a user’s geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.
[0096] Personalization options may depend on how the cross-lingual media editing technology is deployed, e.g., on the client device versus on a back-end application server. One option for an on-device model may involve fine tuning the model(s) using data from the device for personalization. By way of example, as the user uses one or more applications, the on-device model may use or flag certain text segments as personalized inputs to update, retrain or otherwise refine the model.
Exempkuy Methods
[0097] Fig. 7A illustrates a method 700 in accordance with the technology discussed herein. The method includes, at block 702, receiving, by one or more processors of a computing system, edits in a first language to an item of content translated from a second language. The item of content includes audio in the second language and at least one of imagery or video content. At block 704, the method includes applying, by the one or more processors, the edits in the first language and contextual infomiation of the item of content to a trained model to generate corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language. At block 706, the method includes creating, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits. And at block 708 the method includes generating, by the one or more processors, a revised version of the item of content having the revised audio in the selected language.
[0098] Fig. 7B illustrates another method 720 in accordance with the technology discussed herein. At block 722 the method includes receiving, by one or more processors of a computing device, edits in a first language to an item of content translated from a second language. The item of content includes audio in the second language and at least one of imagery or video content. At block 724 the method includes providing, by the one or more processors, the edits in the first language and contextual information of the item of content to a trained model to obtain generated corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language. At block 726, the method includes obtaining, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits. At block 728, the method includes obtaining, by the one or more processors, a revised version of the item of content having the revised audio in the selected language. And at block 730, the method includes presenting, by the one or more processors via a user interface subsystem of the computing device, the revised version of the item of content. [0099] Although the technology herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present technology. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present technology as defined by the appended claims.

Claims

1. A method, comprising: receiving, by one or more processors of a computing system, edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery or video content; applying, by the one or more processors, the edits in the first language and contextual information of the item of content to a trained model to generate corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; creating, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits; and generating, by the one or more processors, a revised version of the item of content having the revised audio in the selected language.
2. The method of claim 1, wherein the revised version of the item of content includes one or more changes to sync to the revised audio with the imagery or video content.
3. The method of claim 2, wherein the changes to sync to the revised audio with the imagery or video content include modifications to the imagery or video content.
4. The method of claim 1, wherein the selected language is the second language.
5. The method of claim 1, wherein the second language is a source language of the item of content.
6. The method of claim 1, further comprising: prior to receiving the edits in the first language, generating a transcript, in the first language, corresponding to the item of content; and causing the transcript in the first language to be presented to a user.
7. The method of claim 6, further comprising: prior to generating the transcript in the first language, generating a transcript in the second language of tire audio of the item of content; wherein the transcript in the first language is generated by translating the transcript in the second language into the first language.
8. The method of claim 1, wherein creating the revised audio in the selected language includes aligning the revised audio with the corresponding edits in the selected language.
9. The method of claim 1, wherein the corresponding edits comprise a set of suggestions in the selected language, and the method further comprises: causing the set of suggestions to be presented to a user; receiving acceptance of at least a portion of the set of suggestions in response to the presentation; and upon receiving the acceptance, performing the creating of the revised audio using the at least the portion of the set of suggestions.
10. The method of claim 1, further comprising generating an edited transcript in the second language.
1 1 . The method of claim 10, further comprising incorporating the edited transcript in the second language as captioning for the revised version of the item of content.
12. The method of claim 1, wherein the edits in the first language include one or more of an insertion, deletion or re-arrangement of a word or phrase.
13. The method of claim 1, wherein creating the revised audio in the selected language is done using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language.
14. The method of claim 1, wherein the contextual information of the item of content comprises one or more of audio information or video information associated with the item of content.
15. A computing system, comprising: memory configured to store a trained neural network model; and one or more processors operatively coupled to the memory, the one or more processors being configured to: receive edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery or video content; apply the edits in the first language and contextual information of the item of content to the trained neural network model to generate corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; create revised audio in the selected language for the item of content according to the corresponding edits; and generate a revised version of the item of content having the revised audio in the selected language.
16. The computing system of claim 15. wherein the revised version of the item of content includes one or more changes to sync to the revised audio with the imagery or video content.
17. The computing system of claim 15, wherein the one or more processors are further configured: prior to reception of the edits in the first language, to generate a transcript, in the first language, corresponding to the item of content; and to cause the transcript in the first language to be presented to a user.
18. The computing system of claim 17, wherein: the one or more processors are further configured, prior to generation of the transcript in the first language, to generate a transcript in the second language of the audio of the item of content; and the transcript in the first language being generated by translation of the transcript in the second language into the first language.
19. The computing system of claim 15, wherein the creation of the revised audio in the selected language includes alignment of the revised audio with the corresponding edits in the selected language.
20. The computing system of claim 15, wherein the corresponding edits comprise a set of suggestions in the selected language, and the one or more processors are further configured to: cause the set of suggestions to be presented to a user; receive acceptance of at least a portion of the set of suggestions in response to the presentation; and upon reception of the acceptance, perform the creation of the revised audio using the at least the portion of the set of suggestions.
21. The computing system of claim 15, wherein the one or more processors are further configured to generate an edited transcript in the second language.
22. The computing system of claim 15, wherein the creation of the revised audio in the selected language is done using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language.
23. A method, comprising: receiving, by one or more processors of a computing device, edits in a first language to an item of content translated from a second language, the item of content including audio in the second language and at least one of imagery' or video content; providing, by the one or more processors, the edits in the first language and contextual information of the item of content to a trained model to obtain generated corresponding edits in a selected language to match at least one of semantics, intent or tone of the edits in the first language; obtaining, by the one or more processors, revised audio in the selected language for the item of content according to the corresponding edits; obtaining, by the one or more processors, a revised version of the item of content having the revised audio in the selected language; and presenting, by the one or more processors via a user interface subsystem of the computing device, the revised version of the item of content.
24. The method of claim 23, wherein: providing the edits in the first language and the contextual information of the item of content to the trained model to obtain generated corresponding edits in the selected language includes communicating the edits in the first language and the contextual information to a remote computing device; and obtaining the revised audio in the selected language includes obtaining the revised audio from the remote computing device.
25. The method of claim 23, wherein the revised version of the item of content includes one or more changes to sync to the revised audio with the imagery' or video content.
26. The method of claim 25, wherein the changes to sync to the revised audio with the imagery' or video content include modifications to the imagery or video content.
27. The method of claim 23, wherein the selected language is the second language.
28. The method of claim 23, wherein the second language is a source language of the item of content.
29. The method of claim 23, further comprising: prior to receiving the edits in the first language, causing generation of a transcript in the first language corresponding to the item of content; and presenting the transcript in the first language to a user of the computing device.
30. The method of claim 29, further comprising: prior to causing generation of the transcript in the first language, causing generation of a transcript in the second language of the audio of the item of content; wherein the transcript in the first language is generated by translating the transcript in the second language into the first language.
31. The method of claim 23, wherein the revised audio in the selected language is obtained via aligning the revised audio with the corresponding edits in the selected language.
32. The method of claim 23, wherein the corresponding edits comprise a set of suggestions in the selected language, and the method further comprises: presenting the set of suggestions to a user of the computing device; receiving acceptance of at least a portion of the set of suggestions in response to the presentation; and upon receiving the acceptance, obtaining the revised audio using the at least the portion of the set of suggestions.
33. The method of claim 23, further comprising obtaining an edited transcript in the second language.
34. The method of claim 33, further comprising presenting, via the user interface subsystem, the edited transcript in the second language as captioning for the revised version of the item of content.
35. The method of claim 23, wherein the edits in the first language include one or more of an insertion, deletion or re-arrangement of a word or phrase.
36. The method of claim 23, wherein the revised audio in the selected language is created using an alignment matrix between words or phrases between a transcript in the first language and a transcript in the second language.
PCT/US2023/0368702023-11-062023-11-06Cross-lingual media editingPendingWO2025101169A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
PCT/US2023/036870WO2025101169A1 (en)2023-11-062023-11-06Cross-lingual media editing

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
PCT/US2023/036870WO2025101169A1 (en)2023-11-062023-11-06Cross-lingual media editing

Publications (1)

Publication NumberPublication Date
WO2025101169A1true WO2025101169A1 (en)2025-05-15

Family

ID=89119388

Family Applications (1)

Application NumberTitlePriority DateFiling Date
PCT/US2023/036870PendingWO2025101169A1 (en)2023-11-062023-11-06Cross-lingual media editing

Country Status (1)

CountryLink
WO (1)WO2025101169A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8302010B2 (en)*2010-03-292012-10-30Avid Technology, Inc.Transcript editor

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8302010B2 (en)*2010-03-292012-10-30Avid Technology, Inc.Transcript editor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
OHAD FRIED ET AL: "Text-based editing of talking-head video", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 38, no. 4, 12 July 2019 (2019-07-12), pages 1 - 14, XP058439439, ISSN: 0730-0301, DOI: 10.1145/3306346.3323028*
STEVE RUBIN ET AL: "Content-based tools for editing audio stories", PROCEEDINGS OF THE 26TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, UIST '13, 1 January 2013 (2013-01-01), New York, New York, USA, pages 113 - 122, XP055374357, ISBN: 978-1-4503-2268-3, DOI: 10.1145/2501988.2501993*

Similar Documents

PublicationPublication DateTitle
US20220050661A1 (en)Analyzing graphical user interfaces to facilitate automatic interaction
WO2019100350A1 (en)Providing a summary of a multimedia document in a session
US11126794B2 (en)Targeted rewrites
US20180144747A1 (en)Real-time caption correction by moderator
US11321667B2 (en)System and method to extract and enrich slide presentations from multimodal content through cognitive computing
US20180130496A1 (en)Method and system for auto-generation of sketch notes-based visual summary of multimedia content
CN113518160B (en)Video generation method, device, equipment and storage medium
US11756528B2 (en)Automatic generation of videos for digital products using instructions of a markup document on web based documents
US20080281579A1 (en)Method and System for Facilitating The Learning of A Language
US11074939B1 (en)Disambiguation of audio content using visual context
KR20140094919A (en)System and Method for Language Education according to Arrangement and Expansion by Sentence Type: Factorial Language Education Method, and Record Medium
US20240330580A1 (en)Generation of Personalized and Structured Content Using a Collaborative Online Generator
CN108780439A (en)For system and method abundant in content and for instructing reading and realizing understanding
US20240378375A1 (en)Generation of Structured Content Using a Collaborative Online Generator
US10366149B2 (en)Multimedia presentation authoring tools
JP7629254B1 (en) Information processing system, information processing method, and program
CN118917287A (en)Electronic book editing method, electronic book editing device, storage medium, electronic book editing device and program product
KR102741271B1 (en)Apparatus, method, and program for managing creation and modification of subtitles using artificial intelligence
US11907677B1 (en)Immutable universal language assistive translation and interpretation system that verifies and validates translations and interpretations by smart contract and blockchain technology
JP2001344237A (en)Natural language processor through encoding, and its method
WO2025101169A1 (en)Cross-lingual media editing
US20230186899A1 (en)Incremental post-editing and learning in speech transcription and translation services
US20210397783A1 (en)Rich media annotation of collaborative documents
Waller et al.‘What I Think about When I Type about Talking’: Reflections on Text-Entry Acceleration Interfaces
Chen et al.Developing an AI-powered multimodal chatbot

Legal Events

DateCodeTitleDescription
121Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number:23818169

Country of ref document:EP

Kind code of ref document:A1


[8]ページ先頭

©2009-2025 Movatter.jp