CN112399258B

Movatterモバイル変換

Info

Publication number: CN112399258B
Application number: CN201910745446.XA
Authority: CN
Inventors: 陈春勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2022-06-07
Anticipated expiration: 2039-08-13
Also published as: CN112399258A

Abstract

The present disclosure provides a live playback video generation method and apparatus, a live playback video playing method and apparatus, an electronic device, and a storage medium; relates to the technical field of communication. The live playback video generation method comprises the following steps: acquiring a live video stream in a live broadcasting process, and monitoring whether the live video stream is associated with a recommendation; when the association between the live video stream and a recommendation is monitored, determining whether target voice data corresponding to the live video stream is matched with the recommendation or not; adding a marker to the live video stream when the target speech data matches the recommendation; generating a live playback video using a plurality of the live video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added. The present disclosure can help a user quickly locate a video segment of interest while watching a live playback video.

Description

Live playback video generation playing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of audio and video technologies, and in particular, to a live playback video generation method, a live playback video generation apparatus, a live playback video playing method, a live playback video playing apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of network technology, live network has become a popular entertainment mode. In a live scene, a main broadcast user can carry out live broadcast in a live broadcast room, and audience users can enter the live broadcast room of the main broadcast user through logging in a server to watch live broadcast video of the main broadcast user.

Many anchor users introduce media, such as recommended goods or services, to audience users during the live broadcast process. When the audience user misses the live broadcast of the anchor user, the live broadcast content of the anchor user can be known in a mode of watching the live playback video.

The existing live broadcast playback video is usually obtained by directly splicing live broadcast video streams or directly recording based on pictures in live broadcast, and specific video clips of the introduction medium of a main broadcast user cannot be identified, so that the user cannot quickly locate the interested video clips when watching the playback video.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a live playback video generation method, a live playback video generation apparatus, an electronic device, and a computer-readable storage medium, which can further help a user to quickly locate an interesting video segment when watching a live playback video.

According to an aspect of the present disclosure, there is provided a live playback video generation method including:

acquiring a live video stream in a live broadcasting process, and monitoring whether the live video stream is associated with a recommendation;

when the association between the live video stream and a recommendation is monitored, determining whether target voice data corresponding to the live video stream is matched with the recommendation or not;

adding a marker to the live video stream when the target speech data matches the recommendation;

generating a live playback video using a plurality of the live video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added.

In an exemplary embodiment of the present disclosure, monitoring whether the live video stream is associated with a recommendation includes:

and monitoring whether a second layer related to the media pushing object exists on a first layer where a live frame of the live video stream is located.

In an exemplary embodiment of the present disclosure, the method further comprises:

determining whether real-time voice data corresponding to the live video stream is matched with the reference characteristics;

and when the real-time voice data is matched with the reference characteristics, starting to acquire the real-time voice data to form the target voice data.

In an exemplary embodiment of the present disclosure, determining whether real-time voice data corresponding to the live video stream matches a reference feature includes:

obtaining sample voice data of a target object, and framing the sample voice data;

extracting a feature vector of each frame of the sample voice data, and training based on the feature vector of each frame of the sample voice data to obtain the reference feature;

framing the real-time voice data corresponding to the live video stream, and extracting a feature vector of each frame of the real-time voice data;

and determining whether the real-time voice data is matched with the reference feature according to the similarity between the feature vector of each frame of the real-time voice data and the reference feature.

In an exemplary embodiment of the present disclosure, determining whether the target speech data matches the interface includes:

performing text recognition on the target voice data to obtain first text data;

acquiring text information related to the tweet as second text data;

and determining whether the target voice data is matched with the recommendation according to the similarity between the first text data and the second text data.

determining that the target speech data matches the recommendation when a similarity between the first text data and the second text data is greater than a threshold.

In an exemplary embodiment of the present disclosure, the current live video stream has identification information pointing to a previous live video stream; the generating a live playback video using a plurality of live video streams includes:

determining the sequence of each live video stream according to the received identification information of each live video stream;

sequencing the live video stream according to the determined sequence;

and splicing the sequenced live video streams to obtain the live playback video.

According to an aspect of the present disclosure, there is provided a live playback video playing method, including:

acquiring a live playback video; the live playback video comprises one or more markers, and each marker corresponds to a video clip related to the propellant;

and when any mark is detected to be triggered, controlling the live playback video to jump to a video clip corresponding to the mark.

providing a time axis according to the time length of the live playback video;

presenting each marker on the timeline according to a position of a video clip associated with each media item in the live playback video.

and taking the mark closest to the current playing progress moment of the live playback video as a target mark, and providing guide information of a recommendation corresponding to the target mark.

According to an aspect of the present disclosure, there is provided a live playback video generation apparatus including:

the video stream acquisition module is used for acquiring a live video stream in a live broadcasting process and monitoring whether the live video stream is related to a recommendation;

the voice matching module is used for determining whether target voice data corresponding to the live video stream is matched with a recommendation when the fact that the live video stream is related to the recommendation is monitored;

a tagging module for adding a tag to the live video stream when the target speech data matches the recommendation;

a video generation module for generating a live playback video using a plurality of the live video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added.

In an exemplary embodiment of the disclosure, the video stream acquisition module monitors whether the live video stream is associated with a recommendation by: and monitoring whether a second layer related to the media pushing object exists on a first layer where a live frame of the live video stream is located.

In an exemplary embodiment of the present disclosure, the apparatus further includes:

the target voice judging module is used for determining whether real-time voice data corresponding to the live video stream is matched with the reference characteristics; and when the real-time voice data is matched with the reference characteristics, starting to acquire the real-time voice data to form the target voice data.

In an exemplary embodiment of the present disclosure, the target voice determination module determines whether real-time voice data corresponding to the live video stream matches a reference feature by: obtaining sample voice data of a target object, and framing the sample voice data; extracting a feature vector of each frame of the sample voice data, and training based on the feature vector of each frame of the sample voice data to obtain the reference feature; framing the real-time voice data corresponding to the live video stream, and extracting a feature vector of each frame of the real-time voice data; and determining whether the real-time voice data is matched with the reference feature according to the similarity between the feature vector of each frame of the real-time voice data and the reference feature.

In an exemplary embodiment of the disclosure, the target speech determination module is further configured to: and when the similarity between the feature vector of each frame of the real-time voice data and the reference feature is greater than a threshold value, determining that the real-time voice data is matched with the reference feature.

In an exemplary embodiment of the disclosure, the voice matching module determines whether the target voice data matches the recommendation by: performing text recognition on the target voice data to obtain first text data; acquiring text information related to the interface object as second text data; and determining whether the target voice data is matched with the recommendation according to the similarity between the first text data and the second text data.

In an exemplary embodiment of the disclosure, the voice matching module is further configured to: determining that the target speech data matches the recommendation when a similarity between the first text data and the second text data is greater than a threshold.

In an exemplary embodiment of the present disclosure, the current live video stream has identification information pointing to a previous live video stream; the video generation module generates a live playback video by the following method: determining the sequence of each live video stream according to the received identification information of each live video stream; sequencing the live video stream according to the determined sequence; and splicing the sequenced live broadcast video streams to obtain the live broadcast playback video.

According to an aspect of the present disclosure, there is provided a live playback video playback apparatus including:

the video acquisition module is used for acquiring a live playback video; the live playback video comprises one or more marks, and each mark corresponds to a video clip related to the media pushing object;

and the playing control module is used for controlling the live playback video to jump to the video clip corresponding to the mark when detecting that any mark is triggered.

the mark display module is used for providing a time axis according to the time length of the live playback video; and presenting each marker on the timeline according to the position of the video clip related to each media-pushing object in the live playback video.

In an exemplary embodiment of the disclosure, the indicia display module is further configured to: and taking the mark closest to the current playing progress moment of the live playback video as a target mark, and providing guide information of a recommendation corresponding to the target mark.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

Exemplary embodiments of the present disclosure may have some or all of the following benefits:

in a live playback video generation method provided by a disclosed example embodiment, when a live video stream is associated with a recommendation, it is determined whether target voice data corresponding to the live video stream matches the recommendation, and when the target voice data matches the recommendation, a tag is added to the live video stream; therefore, after the live playback video is synthesized, the video clips of the main broadcast user or other user introduction media objects can be quickly positioned through the added marks, and then the user can be helped to quickly position the interested video clips when watching the live playback video, the review time cost of the user is reduced, and the information transmission efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It should be apparent that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived by those of ordinary skill in the art without inventive effort.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture of a live playback video generation method and apparatus to which embodiments of the present disclosure may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device used to implement embodiments of the present disclosure;

fig. 3 schematically illustrates a flow diagram of a live playback video generation method according to one embodiment of the present disclosure;

FIG. 4 schematically illustrates a live interface diagram in one embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a process of determining target speech data in one embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a process of determining whether real-time speech data matches a reference feature in one embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating speech data framing processing in one embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of a process of determining whether the target speech data matches the recommendation in one embodiment of the present disclosure;

fig. 9 schematically illustrates a flow diagram of a live playback video playback method according to one embodiment of the present disclosure;

10A and 10B schematically show live playback video play interface diagrams in one embodiment of the disclosure;

fig. 11 schematically shows a flow diagram of a live playback video generation method and a playing method according to one embodiment of the present disclosure;

FIG. 12 schematically illustrates a flow chart of a process of determining whether a cast user begins speaking in one embodiment of the present disclosure;

fig. 13 schematically illustrates a block diagram of a live playback video generation apparatus according to one embodiment of the present disclosure;

fig. 14 schematically illustrates a block diagram of a live playback video playback apparatus according to one embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a live playback video generation method and apparatus, and a live playback video playing method and apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, thesystem architecture 100 may include one or more of

terminal devices

101, 102, 103, anetwork 104, and aserver 105. Thenetwork 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and theserver 105.Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example,server 105 may be a server cluster comprised of multiple servers, or the like.

The live playback video generation or playing method provided by the embodiment of the present disclosure may be executed by the

terminal devices

101, 102, and 103, and accordingly, the live playback video generation or playing apparatus may be disposed in the

terminal devices

101, 102, and 103. The live playback video generation or playing method provided by the embodiment of the present disclosure may also be executed by theserver 105, and accordingly, the live playback video generation or playing apparatus may also be disposed in theserver 105. The live playback video generation or playing method provided in the embodiment of the present disclosure may also be executed by the

terminal devices

101, 102, and 103 and theserver 105 together, and accordingly, the live playback video generation or playing apparatus may be disposed in the

terminal devices

101, 102, and 103 and theserver 105, which is not particularly limited in this exemplary embodiment. For example, in an exemplary embodiment, it may be that after the

terminal device

101, 102, 103 acquires the live video stream in the live process, it monitors whether the live video stream is associated with the recommendation; when the live video stream is associated with the media, sending target voice data corresponding to the live video stream to theserver 105 to confirm whether the target voice data is matched with the media; theserver 105 returns the matching result to the

terminal device

101, 102, 103 so that the

terminal device

101, 102, 103 adds a mark to the live video stream accordingly; thereafter, the

terminal apparatuses

101, 102, 103 upload the live video stream to which the mark is added and which is not added to theserver 105, so that theserver 105 generates live playback video.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

It should be noted that thecomputer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 2, thecomputer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from astorage section 208 into a Random Access Memory (RAM) 203. In theRAM 203, various programs and data necessary for system operation are also stored. TheCPU 201,ROM 202, andRAM 203 are connected to each other via a bus 204. An input/output (I/O)interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: aninput portion 206 including a keyboard, a mouse, and the like; anoutput section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; astorage section 208 including a hard disk and the like; and acommunication section 209 including a network interface card such as a LAN card, a modem, or the like. Thecommunication section 209 performs communication processing via a network such as the internet. Adrive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on thedrive 210 as necessary, so that a computer program read out therefrom is mounted into thestorage section 208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through thecommunication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and apparatus of the present application. In some embodiments, thecomputer system 200 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among the key technologies of Speech processing Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Some steps of the technical solution in the present disclosure may involve a speech processing technology and a natural language processing technology. The technical scheme of the embodiment of the disclosure is explained in detail as follows:

the present example embodiment provides a live playback video generation method. The live playback video generation method can be applied to one or more of the

terminal devices

101, 102 and 103; may also be applied to theserver 105 described above; but may also be applied to one or more of the above-mentioned

terminal devices

101, 102, 103 and the above-mentionedserver 105 at the same time. Referring to fig. 3, the live playback video generation method may include the steps of:

s310, acquiring a live video stream in a live broadcasting process, and monitoring whether the live video stream is associated with a recommendation;

step S320, when the fact that the live video stream is related to a recommendation is monitored, whether target voice data corresponding to the live video stream is matched with the recommendation or not is determined;

s330, when the target voice data is matched with the recommendation, adding a mark to the live video stream;

step S340, generating a live broadcast playback video by utilizing a plurality of live broadcast video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added.

In the live playback video generation method provided by the present exemplary embodiment, after the live playback video is synthesized, the video clip of the introduction media object of the anchor user or other users can be quickly located through the added mark, so that the user can be helped to quickly locate the video clip of interest when watching the live playback video, the review time cost of the user is reduced, the efficiency of information transmission is improved, and the commercial conversion rate of the user can be improved.

Next, in another embodiment, the above steps are explained in more detail.

In step S310, a live video stream in a live process is acquired, and whether the live video stream is associated with a recommendation is monitored.

In this example embodiment, in the live broadcast process, the terminal device at the anchor end may continuously push the live broadcast video stream to the server, and correspondingly, the terminal device at the viewer end may continuously pull the live broadcast video stream from the server. For example, in the live broadcasting process of a main broadcasting user, terminal equipment of the main broadcasting end collects image data and voice data in real time, and outputs video data meeting the video coding requirement according to the collected image data and voice data, such as video data in a YUV format or video data in an RGB format; and then, the terminal equipment of the anchor terminal can adopt a preset coding mode and a packaging mode to code and package the video data to obtain a live video stream, and the live video stream is uploaded to a server. The audience user can log in the server at the audience terminal equipment and enter the live broadcast room of the anchor user so as to obtain the live broadcast video stream; after the audience terminal equipment acquires the live video stream, a live picture can be generated according to the live video stream for audience users to watch.

Fig. 4 is a schematic diagram of a live view in this exemplary embodiment. The user can like, pay attention to, send gifts, send barracks and the like on the basis of the live broadcast picture at the terminal equipment of the audience; in addition, many anchor users recommend media, such as recommended goods or services, to audience users during the live broadcast process. In this exemplary embodiment, when recommending a media item, the anchor user may upload guidance information of the media item, for example, a purchase link of the media item, to the server; after the server pushes the guiding information to the viewer terminal device, the viewer terminal device may generate asecond layer 412 related to the media pushing object on thefirst layer 411 where the live broadcast image is located, and display the guiding information; the presentation mode may be, for example, a bubble pop-up window, and the name, price, etc. of the recommendation may be displayed in the bubble pop-up window, and when the viewer user clicks the bubble pop-up window, the viewer user may jump to the detailed description page of the recommendation.

Therefore, in this exemplary embodiment, monitoring whether the live video stream is associated with a recommendation may be implemented by monitoring whether asecond layer 412 related to the recommendation exists on afirst layer 411 where a live frame of the live video stream is located; for example, whether thesecond layer 412 related to the media object exists may be determined by monitoring a layer rendering interface of the viewer terminal device. Of course, according to different guidance information display manners of the media, in other exemplary embodiments of the present disclosure, it may also be monitored whether the live video stream is associated with the media in other manners, which is not limited in this exemplary embodiment.

In step S320, when it is monitored that the live video stream is associated with a recommendation, it is determined whether target voice data corresponding to the live video stream matches the recommendation.

In this example embodiment, after determining that the live video stream is associated with a recommendation, it is necessary to determine when the anchor user introduces his recommended recommendation. In this example embodiment, it may be determined whether target voice data corresponding to the live video stream matches a recommended material, so as to determine whether the anchor introduces the recommended material in the current live video stream; the target speech data may be, for example, speech data of a target user, such as an anchor user; in some scenarios, for example, in a company scenario, the target speech data may also be speech data of other target users, which is not particularly limited in this exemplary embodiment.

In an actual live scene, the voice data in the live video stream generally does not only include the target voice data; for example, background music, environmental sound, etc. may also be included, and in some live scenes, such as live game scenes, in-game voice data, game sound effects, etc. may also be included. Therefore, in the present exemplary embodiment, it is necessary to collect voice data after determining that the anchor user or other specified target user starts speaking; therefore, not only can the resource consumption of the computer be reduced, but also the data volume during subsequent matching can be reduced. Further, as shown in fig. 5, in the present exemplary embodiment, the live playback video generation method may further include the following step S510 and step S520. Wherein:

in step S510, it is determined whether real-time voice data corresponding to the live video stream matches a reference feature. For example, referring to fig. 6, in the present exemplary embodiment, it may be determined whether the real-time speech data and the reference feature are matched through the following steps S610 to S640. Wherein:

in step S610, sample voice data of a target object is acquired and framed. For example, in the present exemplary embodiment, the voice data of the anchor user may be collected in advance as sample voice data; the acquisition mode may be actively uploaded by the anchor user, or may be extracted from other live videos of the anchor user, which is not particularly limited in this exemplary embodiment. After the sample Voice data of the target object is obtained, framing processing may be performed on the sample Voice data by using, for example, Voice Activity Detection (VAD) technology; for example, sample speech data is first segmented into segments; referring to FIG. 7, there will typically be an overlap between the various small segments; if each segment is 25 milliseconds in length, there is a 15 millisecond overlap between each two segments; therefore, in order to reduce the interference to the subsequent steps, the silence at the head and tail ends of each segment can be cut off, so as to obtain the framing result for the sample voice data, i.e. each segment is one frame after processing. Of course, in other exemplary embodiments of the present disclosure, the sample speech data may be framed in other manners, which also belongs to the scope of the present disclosure.

In step S620, a feature vector of each frame of the sample speech data is extracted, and the reference feature is obtained based on the feature vector training of each frame of the sample speech data. For example, after sample voice data is framed, waveforms of each frame can be converted into a multidimensional vector according to physiological characteristics of human ears, the multidimensional vectors corresponding to each frame are sequenced according to a time sequence to form a data matrix, and then feature extraction can be performed on the data matrix through a deep convolutional network model or other modes; for example, the data matrix may be input to a deep convolutional network model, a convolutional feature matrix is obtained through a forward propagation operation, and then the reference feature is obtained by flattening the convolutional feature matrix. In addition, as is readily understood by those skilled in the art, the reference feature may also be extracted in other ways, which is not particularly limited in the exemplary embodiment.

In step S630, the real-time speech data corresponding to the live video stream is framed, and feature vectors of frames of the real-time speech data are extracted. In this example embodiment, when it is monitored that a second layer related to the media pushing object exists on a first layer where a live broadcast picture is located, real-time voice data corresponding to the live broadcast video stream may be acquired; after the real-time voice data is collected, the real-time voice data corresponding to the live video stream may be framed in a manner similar to that in step S610 and step S620, and feature vectors of frames of the real-time voice data are extracted, which is not repeated herein.

In step S640, it is determined whether the real-time speech data matches the reference feature according to the similarity between the feature vector of each frame of the real-time speech data and the reference feature. For example, in the present exemplary embodiment, a feature vector of each frame of real-time speech data may be compared with the reference feature by using a DTW (dynamic time warping) algorithm to obtain a similarity therebetween. In addition, in other exemplary embodiments of the present disclosure, feature extraction may also be performed on feature vectors of each frame of the real-time speech data through, for example, a deep convolutional network model or in other manners, so as to obtain feature data of the real-time speech data, and then whether the real-time speech data matches the reference feature is determined according to a similarity between the feature data and the reference feature, which is not limited in this exemplary embodiment.

In step S520, when the real-time speech data matches the reference feature, the real-time speech data starts to be collected to form the target speech data.

For example, in this exemplary embodiment, it may be determined whether the similarity between the feature vector of each frame of the real-time speech data and the reference feature is greater than a threshold, and when the similarity between the feature vector of each frame of the real-time speech data and the reference feature is greater than a threshold (e.g., 80%), it is determined that the real-time speech data matches the reference feature. The threshold may be a preset fixed value or a dynamically changing value, which is not particularly limited in this exemplary embodiment. In addition, it may also be determined that the real-time speech data matches the reference feature in other manners, for example, when a distance between a feature vector of each frame of the real-time speech data and the reference feature is smaller than a threshold value, it is determined that the real-time speech data matches the reference feature, and the like, which also belongs to the protection scope of the present disclosure.

When the real-time voice data is detected to be matched with the reference characteristics, it is indicated that the anchor user or other target users speak, so that the real-time voice data can be collected to form the target voice data through the current terminal equipment; in this exemplary embodiment, the target voice data may be stored locally in the terminal device for subsequent processing locally, or the target voice data may be uploaded to the server for subsequent processing at the server. After the target voice data is obtained, it can be determined whether the target voice data is matched with the recommendation. Referring to fig. 8, in the present exemplary embodiment, it may be determined whether the target voice data and the recommendation match through steps S810 to S830 described below. Wherein:

in step S810, text recognition is performed on the target voice data to obtain first text data. For example, in this exemplary embodiment, the first text data may be obtained by performing speech recognition on each of the target speech data through one or more of a deep neural network model, a hidden markov model, and a gaussian mixture model. For example, the time-series information may be modeled by a hidden markov model, and after a state of the hidden markov model is given, the probability distribution of the speech feature vector belonging to the state is modeled based on a gaussian mixture model by a maximum expectation value algorithm or the like; after the modeling is successful, voice recognition can be performed on the target voice data to obtain corresponding first text data. Of course, in other exemplary embodiments of the present invention, the speech recognition may also be performed by combining Context information (Context Dependent) or by other methods, which is not particularly limited in this exemplary embodiment.

In step S820, text information related to the tweet is acquired as second text data. For example, in the present exemplary embodiment, when it is monitored that a second layer related to the mediator exists on a first layer where a live broadcast picture is located, text information related to the mediator in the second layer may be automatically extracted as second text data; for example, text information related to the tweet can be directly extracted from a JSON file corresponding to the second layer; the textual information may include the name, price, profile, and other literature of the tweet. In addition, in other exemplary embodiments of the present disclosure, an image corresponding to the second layer may also be obtained, and text information related to the interface object is obtained by performing character recognition on the image corresponding to the second layer, which also belongs to the protection scope of the present disclosure.

In step S830, it is determined whether the target speech data matches the recommendation according to the similarity between the first text data and the second text data.

In this example embodiment, the same number of words in the first text data and the second text data may be obtained first, and then the similarity between the first text data and the second text data may be determined according to the ratio of the same number of words to the total number of words in the second text data. Alternatively, in this exemplary embodiment, the word segmentation processing may be performed on the first text data first, and the first text data is converted into a first vector according to a vector corresponding to each word segmentation, then the word segmentation processing is performed on the second text data, and the second text data is converted into a second vector according to a vector corresponding to each word segmentation, and then the similarity between the first text data and the second text data is determined according to a distance (such as a euclidean distance, a manhattan distance, a cosine distance, or a hamming distance) between the first vector and the second vector. In addition, the similarity between the first text data and the second text data may also be calculated by other methods, such as BM25 algorithm, which is not particularly limited in this exemplary embodiment.

In step S330, when the target speech data matches the recommendation, a mark is added to the live video stream.

For example, in this example embodiment, it may be determined whether the similarity between the first text data and the second text data is greater than a threshold, and when the similarity between the first text data and the second text data is greater than a threshold (e.g., 90%), it is determined that the target speech data matches the recommendation; and on the contrary, determining that the target voice data is not matched with the mediator, and further continuously acquiring new target voice data and determining whether the target voice data is matched with the mediator or not. The threshold may be a preset fixed value or a dynamically changing value, which is not particularly limited in this exemplary embodiment.

Taking an example that the matching process is executed in the server, after the target voice data is matched with the recommendation, the server can return the judgment result to the terminal equipment; after receiving the judgment result, the terminal equipment automatically adds a mark to the corresponding live video stream; the way of adding the mark may be, for example, writing data related to the identification information (such as ID information or unique code) of the recommendation object in the extension information of the live video stream, and further establishing an association relationship between the mark and the corresponding recommendation object. In addition, in other exemplary embodiments of the present disclosure, the mark may also be added to the live video stream in other manners, and this exemplary embodiment is not limited to this.

In this example embodiment, each current live video stream has identification information pointing to a previous live video stream; for example, the identification information may be a time stamp, etc. After the terminal device uploads all live video streams of the current live broadcast to the server, the server can determine the sequence of each live video stream according to the received identification information of each live video stream; sequencing the live video stream according to the determined sequence; splicing the sequenced live broadcast video streams to obtain the live broadcast playback video; thereafter, the resulting live playback video may be further processed by compression encoding and the like. Of course, in other exemplary embodiments of the present disclosure, the live playback video may also be generated locally at the terminal device, and this is not limited in this exemplary embodiment.

In the above exemplary embodiment, the live video streams are sorted in time sequence, but in other exemplary embodiments of the present disclosure, the live video streams may also be sorted according to other rules, for example, the live video streams may be sorted according to item information of a media, supplier information of a media, and the like. In addition, in the above exemplary embodiment, all live video streams of a live broadcast are spliced, but in other exemplary embodiments of the present disclosure, only a part of live video streams may be selected for splicing, for example, only live video streams associated with recommendations are selected for splicing or only live video streams with tags added for splicing, etc.; these are also within the scope of the present disclosure.

Because the extension information of the live video stream of the anchor user introduction medium object is added with the mark, the extension information of the live playback video after splicing and synthesis naturally also comprises the mark; for example, the tags may be stored in a track element of a live playback video, for example, a series of tag-related text files may be stored in the track element, and these text files may contain data in a format such as JSON or CSV; therefore, when the live playback video is played, the text file in the track element is read to generate the mark, and a user can quickly locate an interested video clip according to the mark.

The present exemplary embodiment further provides a live playback video playing method, which is used to play the live playback video generated in the foregoing exemplary embodiment. The live playback video playing method can be applied to one or more of the

terminal devices

101, 102, 103 and the above-mentionedserver 105 at the same time. Referring to fig. 9, the live playback video playing method may include steps S910 to S920 described below. Wherein:

in step S910, a live playback video is acquired; the live playback video comprises one or more marks, and each mark corresponds to a video clip related to the media. In this exemplary embodiment, after the live playback video with the mark is generated by the live playback video generation method, the live playback video may be stored in a server; and then after the audience user enters a playback interface of the live broadcast room, the terminal equipment can pull the live broadcast playback video from the server to play. Of course, if the live playback video is stored locally in the terminal device, the live playback video can also be directly acquired locally from the terminal device to be played.

In step S920, when it is detected that any of the marks is triggered, the live playback video is controlled to jump to a video segment corresponding to the mark.

For example, referring to fig. 10A, in the present exemplary embodiment, atime axis 1010 may be first provided according to the duration of the live playback video, and then each mark may be presented on thetime axis 1010 according to the position of the video clip related to each media in the live playback video. For example, if the live playback video duration is 19 minutes 23 seconds, atime axis 1010 including all time points between 0 and 19 minutes 23 seconds needs to be provided; after providing thetimeline 1010, the text files contained in the track elements of the live playback video may be read, thereby generating a mark and displaying the mark on thetimeline 1010; as shown in fig. 10A, markers 1021-1023 are displayed on a time axis, and the positions of the markers 1021-1023 on thetime axis 1010 correspond to the positions of the video clips related to the respective media in the live playback video; for example, when the video clip related to the article a starts from 8 minutes 15 seconds and ends at 8 minutes 55 seconds, themarker 1021 corresponding to the article a is located at the position of 8 minutes 15 seconds in the time axis. When the mark on the time axis is detected to be triggered, for example, a click operation or other trigger operation of the mark on the time axis by a user is detected, the live playback video can be controlled to jump to the video clip corresponding to the mark.

As shown in fig. 10A, when it is detected that the mark on the time axis is triggered, guidance information of the mediator corresponding to each mark may be displayed on the time axis; for example, when a click operation of a user on a certain mark on the time axis is detected, guidance information of a recommended object corresponding to the mark may be displayed through a floating window or in other forms; the guidance information may include a name, price, profile, other documentation, etc. Furthermore, in the present exemplary embodiment, during live playback video playing, if a currently played video clip does not correspond to any of the markers, a next marker on the timeline may be taken as a target marker, and guidance information of a recommendation corresponding to the target marker may be provided. For example, referring to fig. 10B, the video clip corresponding to article a starts at 8 minutes and 15 seconds and ends at 8 minutes and 55 seconds; the video clip corresponding to the commodity B starts from 12 minutes and 4 seconds; the currently played video segment does not correspond to themark 1021, themark 1022 and themark 1023 in 9 minutes and 36 seconds, so that thenext mark 1022 can be used as a target mark and guide information of the media object corresponding to thetarget mark 1022 can be provided; in addition, other documents, such as the most fierced product to be sold, the product with the highest high-ranking buyback rate, etc., may be added to the guidance information of the propellant corresponding to thetarget marker 1022. Based on the scheme, after the video clip related to one multimedia object is played, the user can be guided to watch the video clip related to the next multimedia object through the guiding information.

In the above-described exemplary embodiment, themarker 1021, themarker 1022, and themarker 1023 are displayed on the time axis of the live playback video; however, in other exemplary embodiments of the present disclosure,

markers

1021, 1022, and 1023 may be displayed in other locations; when themark 1021, themark 1022 and themark 1023 displayed at other positions are triggered, the live playback video can be controlled to jump to the video segment related to the media corresponding to the marks; it is also within the scope of the present disclosure.

The live playback video generation method and the live playback video playing method in the present exemplary embodiment are further described below with reference to specific scenes.

Referring to fig. 11, in step S1101, when the anchor user starts live broadcasting, the terminal device at the anchor side continuously pushes the live video stream to the server, and correspondingly, the terminal device at the viewer side continuously pulls the live video stream from the server. In step S1102, if it is monitored that a second layer related to the media promotion object exists on the first layer where the live broadcast frame of the live broadcast video stream is located, such as a bubble popup window, it is determined that the current live broadcast video stream is associated with the media promotion object. In step S1103, when it is determined that the current live video stream is associated with a recommender, it is determined whether the anchor user starts speaking through steps S1201 to S1206 shown in fig. 12, and after it is determined that the anchor user starts speaking, real-time voice data is collected to form target voice data and sent to a server. In step S1104, the server decodes the received target voice data, and determines whether the target voice data matches the recommendation; for example, it is determined whether or not the text corresponding to the target speech data includes a word related to the agent. In step S1105, after determining that the target voice data matches the recommendation, the server returns a determination result to the terminal device, and the terminal device adds a mark to the current live video stream in response to the determination result. In step S1106, each live video stream is uploaded to the server. In step S1107, the server generates a live playback video using a plurality of live video streams including the above-described tagged live video stream. In step S1108, after the viewer user enters the live broadcast room, the viewer user can request a live broadcast playback video on the playback interface; in the playback process of playing the live playback video, whether a user triggers an added mark is detected. In step S1109, when it is detected that the user triggers a certain mark, the live playback video is controlled to jump to a video clip related to the recommendation corresponding to the mark. The above steps S1201 to S1206 may include, for example:

in the model training process: in step S1201, sample voice data of a target user, such as a anchor user, is acquired; in step S1202, preprocessing such as framing is performed on the sample voice data; in step S1203, feature extraction is performed on the sample voice data after preprocessing; in step S1204, training is performed based on the extracted features to obtain reference features. In the using process of the model: first, feature data of real-time speech is extracted by a method similar to the above-described steps S1201 to S1204. In step S1205, the similarity between the feature data of the real-time speech and the reference feature is calculated. In step S1206, when the calculated similarity is greater than the threshold, determining that the real-time speech data matches the reference feature; when the real-time voice data is matched with the reference characteristics, it is indicated that the anchor user or other target users speak, so that the real-time voice data can be collected to form the target voice data through the current terminal equipment.

It should be noted that although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, in the present exemplary embodiment, a live playback video generation apparatus is also provided. The live playback video generation device can be applied to terminal equipment, and can also be applied to the terminal equipment and a server simultaneously. Referring to fig. 13, the live playbackvideo generation apparatus 1300 may include a videostream acquisition module 1310, avoice matching module 1320, anadd mark module 1330, and a video generation module 13401340. Wherein:

the videostream acquiring module 1310 may be configured to acquire a live video stream in a live broadcast process and monitor whether the live video stream is associated with a recommender;

thevoice matching module 1320 may be configured to determine whether target voice data corresponding to the live video stream matches a recommendation when it is monitored that the live video stream is associated with the recommendation;

anadd tag module 1330 can be configured to add a tag to the live video stream when the target speech data matches the recommendation;

thevideo generation module 1340 may be configured to generate a live playback video using a plurality of the live video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added.

In an exemplary embodiment of the disclosure, the videostream acquisition module 1310 may monitor whether the live video stream is associated with a recommendation by: and monitoring whether a second layer related to the media pushing object exists on a first layer where a live frame of the live video stream is located.

the target voice judging module can be used for determining whether real-time voice data corresponding to the live video stream is matched with the reference characteristics; and when the real-time voice data is matched with the reference characteristics, starting to acquire the real-time voice data to form the target voice data.

In an exemplary embodiment of the disclosure, the target voice determination module may determine whether real-time voice data corresponding to the live video stream matches a reference feature by: obtaining sample voice data of a target object, and framing the sample voice data; extracting a feature vector of each frame of the sample voice data, and training based on the feature vector of each frame of the sample voice data to obtain the reference feature; framing the real-time voice data corresponding to the live video stream, and extracting a feature vector of each frame of the real-time voice data; and determining whether the real-time voice data is matched with the reference feature according to the similarity between the feature vector of each frame of the real-time voice data and the reference feature.

In an exemplary embodiment of the disclosure, the target speech determination module may be further configured to: and when the similarity between the feature vector of each frame of the real-time voice data and the reference feature is greater than a threshold value, determining that the real-time voice data is matched with the reference feature.

In an exemplary embodiment of the present disclosure, thevoice matching module 1320 may determine whether the target voice data matches the recommendation by: performing text recognition on the target voice data to obtain first text data; acquiring text information related to the tweet as second text data; and determining whether the target voice data is matched with the recommendation according to the similarity between the first text data and the second text data.

In an exemplary embodiment of the disclosure, thevoice matching module 1320 may be further configured to: determining that the target speech data matches the recommendation when a similarity between the first text data and the second text data is greater than a threshold.

In an exemplary embodiment of the present disclosure, the current live video stream has identification information pointing to a previous live video stream; thevideo generation module 1340 may generate live playback video by: determining the sequence of each live video stream according to the received identification information of each live video stream; sequencing the live video stream according to the determined sequence; and splicing the sequenced live broadcast video streams to obtain the live broadcast playback video.

The specific details of each module or unit in the live playback video generation apparatus have been described in detail in the corresponding live playback video generation method, and therefore are not described herein again.

Further, in this exemplary embodiment, a live playback video playing apparatus is also provided, which is used to play the live playback video generated in the above exemplary embodiment. The live playback video playing device can be applied to terminal equipment, and can also be applied to the terminal equipment and a server at the same time. Referring to fig. 14, the live playbackvideo playback apparatus 1400 may include avideo acquisition module 1410 and aplay control module 1420. Wherein:

thevideo acquisition module 1410 may be configured to acquire a live playback video; the live playback video comprises one or more marks, and each mark corresponds to a video clip related to the media pushing object;

theplay control module 1420 may be configured to control the live playback video to jump to the video segment corresponding to any of the marks when detecting that the mark is triggered.

In an exemplary embodiment of the present disclosure, the apparatus may further include: the mark display module can be used for providing a time axis according to the duration of the live playback video; and presenting each marker on the timeline according to the position of the video clip related to each media-pushing object in the live playback video.

In an exemplary embodiment of the present disclosure, the mark display module may be further configured to: and taking the mark closest to the current playing progress moment of the live playback video as a target mark, and providing guide information of a recommendation corresponding to the target mark.

The specific details of each module or unit in the live playback video playing apparatus have been described in detail in the corresponding live playback video playing method, and therefore are not described herein again.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the method as described in the above embodiments. For example, the electronic device may implement the steps shown in fig. 3 to 12, and the like.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A live playback video generation method, comprising:

acquiring a live broadcast video stream in a live broadcast process, and monitoring whether a second layer related to a recommendation exists on a first layer where a live broadcast picture of the live broadcast video stream is located;

when it is monitored that a second layer related to a media pushing object exists on a first layer where a live broadcast picture of the live broadcast video stream is located, determining whether target voice data corresponding to the live broadcast video stream is matched with the media pushing object;

2. The live playback video generation method of claim 1, wherein the method further comprises:

and when the real-time voice data is matched with the reference characteristics, acquiring the real-time voice data to form the target voice data.

3. The live playback video generation method of claim 2, wherein determining whether real-time voice data corresponding to the live video stream matches a reference feature comprises:

4. The live playback video generation method of claim 3, wherein the method further comprises:

5. The live playback video generation method of claim 1, wherein determining whether the target speech data matches the agent comprises:

performing text recognition on the target voice data to obtain first text data;

acquiring text information related to the tweet as second text data;

6. The live playback video generation method of claim 5, further comprising:

7. The live playback video generation method of claim 1, wherein the current live video stream has identification information pointing to a previous live video stream; the generating a live playback video using a plurality of live video streams includes:

sequencing the live video stream according to the determined sequence;

and splicing the sequenced live broadcast video streams to obtain the live broadcast playback video.

8. A live playback video playing method is characterized by comprising the following steps:

acquiring a live playback video; the live playback video comprises one or more marks, and each mark corresponds to a video clip related to the media pushing object;

providing a time axis according to the time length of the live playback video;

presenting each marker on the timeline according to a position of a video clip associated with each media item in the live playback video;

9. The live playback video playback method of claim 8, further comprising:

when the currently played video clip does not correspond to any mark, taking the next mark on the time axis as a target mark;

and providing guide information of the recommendation corresponding to the target mark.

10. A live playback video generation apparatus, comprising:

the video stream acquisition module is used for acquiring a live video stream in a live broadcast process and monitoring whether a second layer related to a recommendation medium exists on a first layer where a live broadcast picture of the live video stream is located;

the voice matching module is used for determining whether target voice data corresponding to the live video stream is matched with the media object or not when a second layer related to the media object exists on a first layer where a live frame of the live video stream is monitored;

the video generation module is used for generating a live playback video by utilizing a plurality of live video streams; wherein the plurality of live video streams includes the live video stream to which the mark is added.

11. The live playback video generation apparatus of claim 10, wherein the apparatus further comprises:

12. The live playback video generation apparatus of claim 10, wherein the target speech determination module is specifically configured to:

extracting a feature vector of each frame of the sample voice data, and training based on the feature vector of each frame of the sample voice data to obtain a reference feature;

13. A live playback video playback apparatus, comprising:

the mark display module is used for providing a time axis according to the time length of the live playback video; presenting each marker on the timeline according to a position of a video clip associated with each media item in the live playback video;

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-9.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-9 via execution of the executable instructions.