Movatterモバイル変換


[0]ホーム

URL:


US20240015371A1 - Systems and methods for generating a video summary of a virtual event - Google Patents

Systems and methods for generating a video summary of a virtual event
Download PDF

Info

Publication number
US20240015371A1
US20240015371A1US17/811,732US202217811732AUS2024015371A1US 20240015371 A1US20240015371 A1US 20240015371A1US 202217811732 AUS202217811732 AUS 202217811732AUS 2024015371 A1US2024015371 A1US 2024015371A1
Authority
US
United States
Prior art keywords
generate
embedding
transcription
image
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US17/811,732
Other versions
US11889168B1 (en
Inventor
Subham BISWAS
Saurabh Tahiliani
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Verizon Patent and Licensing Inc
Original Assignee
Verizon Patent and Licensing Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Verizon Patent and Licensing IncfiledCriticalVerizon Patent and Licensing Inc
Priority to US17/811,732priorityCriticalpatent/US11889168B1/en
Assigned to VERIZON PATENT AND LICENSING INC.reassignmentVERIZON PATENT AND LICENSING INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: BISWAS, SUBHAM, TAHILIANI, SAURABH
Priority to US18/389,764prioritypatent/US12200322B2/en
Publication of US20240015371A1publicationCriticalpatent/US20240015371A1/en
Application grantedgrantedCritical
Publication of US11889168B1publicationCriticalpatent/US11889168B1/en
Activelegal-statusCriticalCurrent
Adjusted expirationlegal-statusCritical

Links

Images

Classifications

Definitions

Landscapes

Abstract

A video summary device may generate a textual summary of a transcription of a virtual event. The video summary device may generate a phonemic transcription of the textual summary and generate a text embedding based on the phonemic transcription. The video summary device may generate an audio embedding based on a target voice. The video summary device may generate an audio output of the phonemic transcription uttered by the target voice. The audio output may be generated based on the text embedding and the audio embedding. The video summary device may generate an image embedding based on video data of a target user. The image embedding may include information regarding images of facial movements of the target user. The video summary device may generate a video output of different facial movements of the target user uttering the phonemic transcription, based on the text embedding and the image embedding.

Description

Claims (20)

What is claimed is:
1. A method performed by a video summary device, the method comprising:
generating a textual summary of a transcription of a virtual event;
generating a phonemic transcription of the textual summary;
generating a text embedding based on the phonemic transcription, wherein the text embedding includes information regarding text classification of the phonemic transcription;
generating an audio embedding based on a target voice, wherein the audio embedding includes information regarding audio classification of the target voice;
generating an audio output of the phonemic transcription uttered by the target voice, wherein the audio output is generated based on the text embedding and the audio embedding;
generating an image embedding based on video data of a target user, wherein the image embedding includes information regarding images of facial movements of the target user;
generating a video output of different facial movements of the target user uttering the phonemic transcription, wherein the video output is generated based on the text embedding and the image embedding;
generating a video summary of the virtual event based on the audio output and the video output; and
providing the video summary to a user device.
2. The method ofclaim 1, wherein the information regarding text classification comprises one or more of information regarding grammatical rules associated with the textual summary, information regarding contexts associated with the textual summary, information regarding semantics associated with the textual summary, or information regarding emotions conveyed by the textual summary, and
wherein the information regarding audio classification comprises one or more of information regarding an amplitude of the target voice, information regarding a frequency of the target voice, or information regarding a tone of the target voice.
3. The method ofclaim 1, wherein generating the textual summary comprises:
processing the transcription to generate a preprocessed input; and
processing the preprocessed input, using a machine learning model, to generate the textual summary.
4. The method ofclaim 3, wherein processing the transcription comprises:
determining a type of utterance for each portion of a plurality of portions of the transcription of the virtual event; and
filtering the plurality of portions, based on the type of utterance determined for each portion of the plurality of portions, to generate filtered portions; and
generating the textual summary based on the filtered portions.
5. The method ofclaim 1, wherein generating the audio output comprises:
generating a spectrogram based on the text embedding and the audio embedding; and
generating a waveform based on the spectrogram,
wherein the audio output includes the waveform.
6. The method ofclaim 1, wherein generating the video output comprises:
generating a plurality of images for each portion of a plurality of portions of the phonemic transcription,
wherein the plurality of images, of a particular portion of the plurality of portions of the phonemic transcription, depict the target user uttering the particular portion, and
wherein the video output includes the plurality of images generated for each portion of the plurality of portions of the phonemic transcription.
7. The method ofclaim 6, wherein generating the plurality of images comprises:
generating, based on the text embedding and the image embedding, a first image of the plurality of images; and
generating a second image of the plurality of images after generating the first image,
wherein the second image is determined based on the first image, the text embedding, and the image embedding.
8. A device, comprising:
one or more processors configured to:
generate a phonemic transcription of a virtual event;
generate a text embedding based on the phonemic transcription, wherein the text embedding includes information regarding text classification of the phonemic transcription;
generate an audio embedding based on a target voice, wherein the audio embedding includes information regarding audio classification of the target voice;
generate an audio output of the phonemic transcription being uttered by the target voice, wherein the audio output is generated based on the text embedding and the audio embedding;
generate an image embedding based on video data of a target user, wherein the image embedding includes information regarding images of facial movements of the target user;
generate a video output of different facial movements of the target user uttering the phonemic transcription, wherein the video output is generated based on the text embedding and the image embedding; and
generate a video summary of the virtual event based on the audio output and the video output,
wherein the video summary is provided to one or more devices.
9. The device ofclaim 8, wherein the one or more processors, to generate the audio output, are configured to:
combine the audio embedding and the text embedding to generate a combined embedding;
provide the combined embedding as an input to a neural network model to cause the neural network model to generate a spectrogram; and
generate the audio output based the spectrogram.
10. The device ofclaim 8, wherein the one or more processors, to generate the phonemic transcription, are configured to:
generate a textual summary of a transcription of the virtual event; and
generate the phonemic transcription based on the textual summary.
11. The device ofclaim 10, wherein the one or more processors are further configured to:
determine a tag for each portion of a plurality of portions of the transcription of the virtual event,
wherein the transcription of the virtual event identifies one or more participants of the virtual event;
identify one or more pronouns, included in the transcription of the virtual event, based on the tag determined for each portion of the plurality of portions; and
replace the one or more pronouns with information identifying a respective participant, of the one or more participants, that uttered the one or more pronouns.
12. The device ofclaim 8, wherein the one or more processors, to generate the video output, are configured to:
generate a plurality of images for each portion of a plurality of portions of the phonemic transcription,
wherein the plurality of images, of a particular portion of the plurality of portions of the phonemic transcription, depict the target user uttering the particular portion, and
wherein the video output includes the plurality of images generated for each portion of the plurality of portions of the phonemic transcription.
13. The device ofclaim 8, wherein the one or more processors, to generate the audio output, are configured to:
generate the audio output using a first machine learning model, and
wherein the one or more processors, to generate the video output, are configured to:
generate the video output using a second machine learning model.
14. The device ofclaim 8, wherein the one or more processors, to generate the video output, are configured to:
modify one or more pixel values of an image of the target user to generate a particular image that is included in the video output,
wherein the particular image is generated based on the text embedding and the image embedding.
15. A non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising:
one or more instructions that, when executed by one or more processors of a device, cause the device to:
generate a phonemic transcription of a virtual event;
generate a text embedding based on the phonemic transcription, wherein the text embedding includes information regarding text classification of the phonemic transcription;
generate an audio embedding based on a target voice, wherein the audio embedding includes information regarding audio classification of the target voice;
generate an audio output of the phonemic transcription being uttered by the target voice, wherein the audio output is generated based on the text embedding and the audio embedding;
generate an image embedding based on video data of a target user, wherein the image embedding includes information regarding images of facial movements of the target user;
generate a video output of different facial movements of the target user uttering the phonemic transcription, wherein the video output is generated based on the text embedding and the image embedding; and
generate a video summary of the virtual event based on the audio output and the video output.
16. The non-transitory computer-readable medium ofclaim 15, wherein the one or more instructions, when executed by the one or more processors, further cause the device to:
determine a label for each portion of a plurality of portions of a transcription of the virtual event;
filter the plurality of portions, based on the label determined for each portion of the plurality of portions, to generate filtered portions;
generate a textual summary of the virtual event based on the filtered portions; and
generate the phonemic transcription based on the textual summary.
17. The non-transitory computer-readable medium ofclaim 15, wherein the one or more instructions, that cause the device to generate the video output, cause the device to:
generate a plurality of images for each portion of a plurality of portions of the phonemic transcription,
wherein the plurality of images, of a particular portion of the plurality of portions of the phonemic transcription, depict the target user uttering the particular portion, and
wherein the video output includes the plurality of images generated for each portion of the plurality of portions of the phonemic transcription.
18. The non-transitory computer-readable medium ofclaim 17, wherein the one or more instructions, that cause the device to generate the plurality of images, cause the device to:
generate, based on the text embedding and the image embedding, a first image of the plurality of images; and
generate a second image of the plurality of images after generating the first image,
wherein the second image is determined based on the first image, the text embedding, and the image embedding.
19. The non-transitory computer-readable medium ofclaim 15, wherein the one or more instructions, that cause the device to generate the audio output, cause the device to:
generate a spectrogram based on the text embedding and the audio embedding; and
generate a waveform based on the spectrogram,
wherein the audio output includes the waveform.
20. The non-transitory computer-readable medium ofclaim 15, wherein the one or more instructions, that cause the device to generate the phonemic transcription, cause the device to:
generate a textual summary of a transcription of the virtual event; and
generate the phonemic transcription based on the textual summary.
US17/811,7322022-07-112022-07-11Systems and methods for generating a video summary of a virtual eventActive2042-07-13US11889168B1 (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
US17/811,732US11889168B1 (en)2022-07-112022-07-11Systems and methods for generating a video summary of a virtual event
US18/389,764US12200322B2 (en)2022-07-112023-12-19Systems and methods for generating a video summary of a virtual event

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US17/811,732US11889168B1 (en)2022-07-112022-07-11Systems and methods for generating a video summary of a virtual event

Related Child Applications (1)

Application NumberTitlePriority DateFiling Date
US18/389,764ContinuationUS12200322B2 (en)2022-07-112023-12-19Systems and methods for generating a video summary of a virtual event

Publications (2)

Publication NumberPublication Date
US20240015371A1true US20240015371A1 (en)2024-01-11
US11889168B1 US11889168B1 (en)2024-01-30

Family

ID=89431038

Family Applications (2)

Application NumberTitlePriority DateFiling Date
US17/811,732Active2042-07-13US11889168B1 (en)2022-07-112022-07-11Systems and methods for generating a video summary of a virtual event
US18/389,764ActiveUS12200322B2 (en)2022-07-112023-12-19Systems and methods for generating a video summary of a virtual event

Family Applications After (1)

Application NumberTitlePriority DateFiling Date
US18/389,764ActiveUS12200322B2 (en)2022-07-112023-12-19Systems and methods for generating a video summary of a virtual event

Country Status (1)

CountryLink
US (2)US11889168B1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20240127790A1 (en)*2022-10-122024-04-18Verizon Patent And Licensing Inc.Systems and methods for reconstructing voice packets using natural language generation during signal loss
US12238059B2 (en)2021-12-012025-02-25Meta Platforms Technologies, LlcGenerating a summary of a conversation between users for an additional user in response to determining the additional user is joining the conversation
US20250182752A1 (en)*2023-12-052025-06-05Nice Ltd.System and method for the generation of worklists from interaction recordings

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8209623B2 (en)*2003-12-052012-06-26Sony Deutschland GmbhVisualization and control techniques for multimedia digital content
US20120284276A1 (en)*2011-05-022012-11-08Barry FernandoAccess to Annotated Digital File Via a Network
US20160014482A1 (en)*2014-07-142016-01-14The Board Of Trustees Of The Leland Stanford Junior UniversitySystems and Methods for Generating Video Summary Sequences From One or More Video Segments
US10096033B2 (en)*2011-09-152018-10-09Stephan HEATHSystem and method for providing educational related social/geo/promo link promotional data sets for end user display of interactive ad links, promotions and sale of products, goods, and/or services integrated with 3D spatial geomapping, company and local information for selected worldwide locations and social networking
US20210081056A1 (en)*2015-12-072021-03-18Sri InternationalVpa with integrated object recognition and facial expression recognition
US11334618B1 (en)*2020-11-172022-05-17Audiocodes Ltd.Device, system, and method of capturing the moment in audio discussions and recordings
US20230169990A1 (en)*2021-12-012023-06-01Verizon Patent And Licensing Inc.Emotionally-aware voice response generation method and apparatus
US20230283851A1 (en)*2014-02-262023-09-07Rovi Guides, Inc.Methods and systems for supplementing media assets during fast-access playback operations

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
NO327155B1 (en)*2005-10-192009-05-04Fast Search & Transfer Asa Procedure for displaying video data within result presentations in systems for accessing and searching for information
US9635307B1 (en)*2015-12-182017-04-25Amazon Technologies, Inc.Preview streaming of video data
US10885942B2 (en)*2018-09-182021-01-05At&T Intellectual Property I, L.P.Video-log production system
US20220067385A1 (en)*2020-09-032022-03-03Sony Interactive Entertainment Inc.Multimodal game video summarization with metadata
US11616658B2 (en)*2021-04-302023-03-28Zoom Video Communications, Inc.Automated recording highlights for conferences
US20230140369A1 (en)*2021-10-282023-05-04Adobe Inc.Customizable framework to extract moments of interest
US12124508B2 (en)*2022-07-122024-10-22Adobe Inc.Multimodal intent discovery system
US20240037824A1 (en)*2022-07-262024-02-01Verizon Patent And Licensing Inc.System and method for generating emotionally-aware virtual facial expressions
US11910073B1 (en)*2022-08-152024-02-20Amazon Technologies, Inc.Automated preview generation for video entertainment content
US12367343B2 (en)*2022-09-162025-07-22Verizon Patent And Licensing Inc.Systems and methods for adjusting a transcript based on output from a machine learning model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8209623B2 (en)*2003-12-052012-06-26Sony Deutschland GmbhVisualization and control techniques for multimedia digital content
US20120284276A1 (en)*2011-05-022012-11-08Barry FernandoAccess to Annotated Digital File Via a Network
US10096033B2 (en)*2011-09-152018-10-09Stephan HEATHSystem and method for providing educational related social/geo/promo link promotional data sets for end user display of interactive ad links, promotions and sale of products, goods, and/or services integrated with 3D spatial geomapping, company and local information for selected worldwide locations and social networking
US20230283851A1 (en)*2014-02-262023-09-07Rovi Guides, Inc.Methods and systems for supplementing media assets during fast-access playback operations
US20160014482A1 (en)*2014-07-142016-01-14The Board Of Trustees Of The Leland Stanford Junior UniversitySystems and Methods for Generating Video Summary Sequences From One or More Video Segments
US20210081056A1 (en)*2015-12-072021-03-18Sri InternationalVpa with integrated object recognition and facial expression recognition
US11334618B1 (en)*2020-11-172022-05-17Audiocodes Ltd.Device, system, and method of capturing the moment in audio discussions and recordings
US20230169990A1 (en)*2021-12-012023-06-01Verizon Patent And Licensing Inc.Emotionally-aware voice response generation method and apparatus

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US12238059B2 (en)2021-12-012025-02-25Meta Platforms Technologies, LlcGenerating a summary of a conversation between users for an additional user in response to determining the additional user is joining the conversation
US20240127790A1 (en)*2022-10-122024-04-18Verizon Patent And Licensing Inc.Systems and methods for reconstructing voice packets using natural language generation during signal loss
US12334048B2 (en)*2022-10-122025-06-17Verizon Patent And Licensing Inc.Systems and methods for reconstructing voice packets using natural language generation during signal loss
US20250182752A1 (en)*2023-12-052025-06-05Nice Ltd.System and method for the generation of worklists from interaction recordings

Also Published As

Publication numberPublication date
US20240121487A1 (en)2024-04-11
US11889168B1 (en)2024-01-30
US12200322B2 (en)2025-01-14

Similar Documents

PublicationPublication DateTitle
JP7638282B2 (en) Anaphora resolution method, system, and program
US11184298B2 (en)Methods and systems for improving chatbot intent training by correlating user feedback provided subsequent to a failed response to an initial user intent
US10909328B2 (en)Sentiment adapted communication
US11889168B1 (en)Systems and methods for generating a video summary of a virtual event
US10339923B2 (en)Ranking based on speech pattern detection
US20170097929A1 (en)Facilitating a meeting using graphical text analysis
US10789576B2 (en)Meeting management system
US11443227B2 (en)System and method for cognitive multilingual speech training and recognition
US11645561B2 (en)Question answering system influenced by user behavior and text metadata generation
US11955127B2 (en)Cognitive correlation of group interactions
US11734348B2 (en)Intelligent audio composition guidance
US11158210B2 (en)Cognitive real-time feedback speaking coach on a mobile device
US11682318B2 (en)Methods and systems for assisting pronunciation correction
US12243438B2 (en)Enhancing video language learning by providing catered context sensitive expressions
US10616532B1 (en)Behavioral influence system in socially collaborative tools
US12394405B2 (en)Systems and methods for reconstructing video data using contextually-aware multi-modal generation during signal loss
US11750671B2 (en)Cognitive encapsulation of group meetings
US12367343B2 (en)Systems and methods for adjusting a transcript based on output from a machine learning model
US11397857B2 (en)Methods and systems for managing chatbots with respect to rare entities
JP2023530970A (en) A system for voice-to-text tagging of rich transcripts of human speech
US11386056B2 (en)Duplicate multimedia entity identification and processing
US12334048B2 (en)Systems and methods for reconstructing voice packets using natural language generation during signal loss
US11483262B2 (en)Contextually-aware personalized chatbot
US12190887B2 (en)Adversarial speech-text protection against automated analysis
US12306809B2 (en)Identifying duplication multimedia entities

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:VERIZON PATENT AND LICENSING INC., NEW JERSEY

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BISWAS, SUBHAM;TAHILIANI, SAURABH;REEL/FRAME:060475/0968

Effective date:20220710

FEPPFee payment procedure

Free format text:ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCFInformation on status: patent grant

Free format text:PATENTED CASE


[8]ページ先頭

©2009-2025 Movatter.jp