CN117058575A

Movatterモバイル変換

Info

Publication number: CN117058575A
Application number: CN202310947120.1A
Authority: CN
Inventors: 孙超
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-11-14

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a target object identification method, which comprises the steps of obtaining a video file and an audio file corresponding to the video file; performing face detection on target objects in the video file to obtain a target lip image sequence corresponding to each target object; inputting each audio file into a preset synchronous model, and acquiring synchronous lip-shaped video files corresponding to each audio file; extracting a synchronous lip image sequence in a synchronous lip video file; and comparing the target lip image sequence with the synchronous lip image sequence through a preset comparison model to obtain an identity recognition result of the target object. The invention is applied to the target object identity recognition requirements in businesses such as finance or insurance, and the target object identity is recognized by comparing the target lip image sequence with the synchronous lip image sequence through the preset comparison model, so that the accuracy of target object recognition in businesses such as finance or insurance is improved.

Description

Translated fromChinese

目标对象识别方法、装置、设备及存储介质Target object identification method, device, equipment and storage medium

技术领域Technical field

本发明涉及人工智能技术领域，尤其涉及一种目标对象识别方法、装置、设备及存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to a target object recognition method, device, equipment and storage medium.

背景技术Background technique

随着计算机技术的不断发展，目标对象身份识别在近年来得到很大的发展。在越来越多的领域中得到应用，例如在银行、证券、保险等金融机构的业务量持续扩大，产生大量的身份识别需求。With the continuous development of computer technology, target object identification has made great progress in recent years. It is being used in more and more fields. For example, the business volume of financial institutions such as banks, securities, and insurance continues to expand, creating a large number of identification needs.

现有技术中，现在的目标对象识别一般是通过声纹识别验证目标对象的身份。例如，在银行领域中，需要对身份信息识别时，通常是通过验证目标对象的声纹信息和预先存储的声纹信息进行匹配，从而确定目标对象的身份。如此，需要大量人员的配合提前存储大量的声纹信息，以在验证时可以确定目标对象的身份。而且对人脸图像中唇部信息的应用，一般是采用对视频进行唇语识别，从而对声纹识别进行辅助验证，这种方式对唇部信息的利用效率低，导致辅助验证的准确率也较低。In the existing technology, current target object recognition generally verifies the identity of the target object through voiceprint recognition. For example, in the banking field, when identity information needs to be identified, the identity of the target object is usually determined by matching the voiceprint information of the target object with the pre-stored voiceprint information. In this way, the cooperation of a large number of people is required to store a large amount of voiceprint information in advance so that the identity of the target object can be determined during verification. Moreover, the application of lip information in face images generally uses lip recognition on videos to assist in verification of voiceprint recognition. This method uses lip information inefficiently, resulting in a low accuracy in auxiliary verification. lower.

发明内容Contents of the invention

本发明实施例提供一种目标对象识别方法、装置、设备及存储介质，以改善现有技术中唇部信息验证目标对象身份利用效率低和准确性较低等问题。Embodiments of the present invention provide a target object identification method, device, equipment and storage medium to improve the problems in the prior art such as low utilization efficiency and low accuracy of lip information verification of target object identity.

一种目标对象识别方法，包括：A target object recognition method, including:

获取视频文件，以及与所述视频文件对应的音频文件；Obtain the video file and the audio file corresponding to the video file;

对所述视频文件中的各目标对象进行人脸检测，得到与所述视频文件中各所述目标对象对应的目标唇部图像序列；Perform face detection on each target object in the video file to obtain a target lip image sequence corresponding to each target object in the video file;

将各所述音频文件输入到预设同步模型中，生成与各所述音频文件分别对应的同步唇形视频文件；Input each of the audio files into a preset synchronization model, and generate a synchronized lip video file corresponding to each of the audio files;

提取所述同步唇形视频文件中的与各所述目标对象对应的同步唇部图像序列；Extract the synchronized lip image sequence corresponding to each of the target objects in the synchronized lip video file;

获取预设比对模型，通过所述预设比对模型对与同一所述目标对象所述目标唇部图像序列和所述同步唇部图像序列进行比对，得到该目标对象的身份识别结果。Obtain a preset comparison model, compare the target lip image sequence and the synchronized lip image sequence with the same target object through the preset comparison model, and obtain the identity recognition result of the target object.

一种目标对象识别装置，包括：A target object recognition device, including:

文件获取模块，用于获取视频文件，以及与所述视频文件对应的音频文件；A file acquisition module, used to acquire video files and audio files corresponding to the video files;

人脸检测模块，用于对所述视频文件中的目标对象进行人脸检测，得到与所述视频文件中各所述目标对象对应的目标唇部图像序列；A face detection module, configured to perform face detection on the target objects in the video file, and obtain a target lip image sequence corresponding to each target object in the video file;

同步视频模块，用于将各所述音频文件输入到预设同步模型中，生成与各所述音频文件分别对应的同步唇形视频文件；A synchronized video module, configured to input each of the audio files into a preset synchronization model and generate a synchronized lip video file corresponding to each of the audio files;

图像提取模块，用于提取所述同步唇形视频文件中的与各所述目标对象对应的同步唇部图像序列；An image extraction module, configured to extract the synchronized lip image sequence corresponding to each of the target objects in the synchronized lip video file;

识别结果模块，用于获取预设比对模型，通过所述预设比对模型对与同一所述目标对象所述目标唇部图像序列和所述同步唇部图像序列进行比对，得到该目标对象的身份识别结果。The recognition result module is used to obtain a preset comparison model, and compare the target lip image sequence and the synchronized lip image sequence with the same target object through the preset comparison model to obtain the target Object identification results.

一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述目标对象识别方法。A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above target object recognition method is implemented.

一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述目标对象识别方法。A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned target object identification method is implemented.

本发明提供一种目标对象识别方法、装置、设备及存储介质，该方法通过对获取的视频文件中的目标对象进行人脸检测，实现了对视频文件中各目标对象的唇部图像的提取，进而实现了在金融或保险等业务中对目标唇部图像序列的获取。通过将各音频文件输入到预设同步模型中，实现了通过预设同步模型将音频文件转换为同步唇形视频文件。通过提取同步唇形视频文件中的与各目标对象对应的同步唇部图像序列，实现了对同步唇部图像序列的获取。通过预设比对模型对与同一目标对象对应的目标唇部图像序列和同步唇部图像序列进行比对，实现了对目标对象身份结果的确定，提高了在金融或保险等业务中对唇部图像的利用效率。进一步地，通过预设比对模型对目标唇部图像序列和同步唇部图像序列进行比对，实现了比对唇部图像识别出目标对象，提高了在金融或保险等业务中目标对象识别的准确率。The present invention provides a target object recognition method, device, equipment and storage medium. The method realizes the extraction of lip images of each target object in the video file by performing face detection on the target object in the acquired video file. This further enables the acquisition of target lip image sequences in financial or insurance services. By inputting each audio file into the preset synchronization model, the audio file is converted into a lip synchronized video file through the preset synchronization model. By extracting the synchronized lip image sequence corresponding to each target object in the synchronized lip video file, the acquisition of the synchronized lip image sequence is achieved. By comparing the target lip image sequence and the synchronized lip image sequence corresponding to the same target object through the preset comparison model, the identity of the target object is determined, which improves lip identification in finance or insurance and other businesses. Image utilization efficiency. Furthermore, the target lip image sequence and the synchronized lip image sequence are compared through a preset comparison model, thereby realizing the target object recognition by comparing the lip images, which improves the accuracy of target object recognition in finance or insurance and other businesses. Accuracy.

附图说明Description of the drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative labor.

图1是本发明一实施例中目标对象识别方法的应用环境示意图；Figure 1 is a schematic diagram of the application environment of the target object recognition method in an embodiment of the present invention;

图2是本发明一实施例中目标对象识别方法的流程图；Figure 2 is a flow chart of a target object recognition method in an embodiment of the present invention;

图3是本发明一实施例中目标对象识别方法步骤S30的流程图；Figure 3 is a flow chart of step S30 of the target object identification method in an embodiment of the present invention;

图4是本发明一实施例中目标对象识别方法步骤S50的流程图；Figure 4 is a flow chart of step S50 of the target object identification method in an embodiment of the present invention;

图5是本发明一实施例中目标对象识别装置的原理框图；Figure 5 is a functional block diagram of a target object recognition device in an embodiment of the present invention;

图6是本发明一实施例中计算机设备的示意图。Figure 6 is a schematic diagram of a computer device in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

本发明实施例提供的目标对象识别方法，该目标对象识别方法可应用如图1所示的应用环境中。具体地，该目标对象识别方法应用在目标对象识别装置中，该目标对象识别装置包括如图1所示的客户端和服务器，客户端与服务器通过网络进行通信，用于改善现有技术中唇部信息辅助验证目标对象身份利用效率低和准确性较低等问题。其中，该服务器可以是独立的服务器，也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。客户端又称为用户端，是指与服务器相对应，为客户提供分类服务的程序。客户端可安装在但不限于各种计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。The target object identification method provided by the embodiment of the present invention can be applied in the application environment as shown in Figure 1. Specifically, the target object recognition method is applied in a target object recognition device. The target object recognition device includes a client and a server as shown in Figure 1. The client and the server communicate through the network to improve the existing technology. Problems such as low utilization efficiency and low accuracy of external information-assisted verification of target object identity. The server may be an independent server, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and content distribution networks (Content Delivery Network (CDN), as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. The client, also known as the user end, refers to the program that corresponds to the server and provides classified services to customers. Clients can be installed on, but are not limited to, various computers, laptops, smartphones, tablets, and portable wearable devices.

在一实施例中，如图2所示，提供一种目标对象识别方法，以该方法应用在图1中的服务器为例进行说明，包括如下步骤：In one embodiment, as shown in Figure 2, a target object identification method is provided. Taking the method applied to the server in Figure 1 as an example, the method includes the following steps:

S10：获取视频文件，以及与所述视频文件对应的音频文件。S10: Obtain the video file and the audio file corresponding to the video file.

可理解地，视频文件是对一个或多个人员进行录制得到的视频。例如，在保险领域场景中，视频文件可以是业务员和客户之间办理保险时的对话视频。在银行领域场景中，视频文件可以是工作人员和用户之间办理信用卡时的对话视频。音频文件为视频文件中人员的音频数据。例如，在保险领域场景中，音频文件可以是业务员和客户之间的对话。在银行领域场景中，音频文件可以是工作人员和用户之间的对话。视频文件和音频文件可以是从数据库中获取预先准备好的，或者是从客户端发送到服务端的，例如，在保险场景坐席质检过程中，可以是对视频文件进行预处理并存储至数据库中，或者直接对视频文件进行实时质检。Understandably, the video file is a video recorded of one or more persons. For example, in the insurance field scenario, the video file can be a video of the conversation between a salesperson and a customer when applying for insurance. In a banking scenario, the video file can be a video of the conversation between a staff member and a user when applying for a credit card. The audio file is the audio data of the person in the video file. For example, in an insurance field scenario, the audio file could be a conversation between a salesperson and a customer. In a banking scenario, the audio file could be a conversation between a staff member and a user. Video files and audio files can be obtained in advance from the database, or sent from the client to the server. For example, during the agent quality inspection process in insurance scenarios, the video files can be preprocessed and stored in the database. , or directly perform real-time quality inspection on video files.

S20：对所述视频文件中的目标对象进行人脸检测，得到与所述视频文件中各所述目标对象对应的目标唇部图像序列。S20: Perform face detection on the target objects in the video file, and obtain a target lip image sequence corresponding to each target object in the video file.

可理解地，目标唇部图像序列是由多个目标唇部图像拼接成的序列。It can be understood that the target lip image sequence is a sequence spliced by multiple target lip images.

具体地，对视频文件中各目标对象的人脸图像进行识别，也即通过计算机视觉技术对该视频文件中各目标对象的人脸图像进行关键点检测，即通过训练好的人脸关键点识别模型对各目标对象的人脸图像进行面部特征识别，从而得到与每一帧人脸图像对应的面部特征点。再从所有面部特征点中筛选出与唇部对应的面部特征点，并通过与唇部对应的面部特征点对每帧的唇部图像进行提取，从而与每一帧人脸图像对应的唇部图像。对所有唇部图像进行缩放增强，以及对所有帧缩放后的唇部图像进行筛选，也即按照每一帧的时间顺序对所有帧缩放后的唇部图像进行排列，再通过预设帧间隔进行筛选提取，即可得到目标唇部图像，再对所有目标唇部图像进行拼接，即可得到目标唇部图像序列。例如，在保险领域场景中，对获取到的视频文件中业务员或客户的人脸图像进行检测，从而得到每个目标对象对应的目标唇部图像序列。通过唇部图像质检确定对应的音频文件的身份信息。Specifically, the face images of each target object in the video file are recognized, that is, key points of the face images of each target object in the video file are detected through computer vision technology, that is, through trained face key point recognition. The model performs facial feature recognition on the face images of each target object, thereby obtaining the facial feature points corresponding to each frame of face image. Then filter out the facial feature points corresponding to the lips from all facial feature points, and extract the lip image of each frame through the facial feature points corresponding to the lips, so that the lips corresponding to each frame of face image image. Zoom and enhance all lip images, and filter the scaled lip images of all frames, that is, arrange the scaled lip images of all frames according to the time sequence of each frame, and then perform the preset frame interval. By filtering and extracting, the target lip image can be obtained, and then by splicing all the target lip images, the target lip image sequence can be obtained. For example, in the insurance field scenario, the facial images of salespersons or customers in the acquired video files are detected to obtain the target lip image sequence corresponding to each target object. Determine the identity information of the corresponding audio file through lip image quality inspection.

S30，将各所述音频文件输入到预设同步模型中，生成与各所述音频文件分别对应的同步唇形视频文件。S30: Input each of the audio files into a preset synchronization model, and generate a synchronized lip video file corresponding to each of the audio files.

可理解地，同步唇形视频文件为预设同步模型根据音频文件生成的视频。Understandably, the lip synchronized video file is a video generated by the preset synchronization model based on the audio file.

具体地，在得到音频文件之后，对各音频文件进行语音提取，也即可以通过训练的模型对语音进行提取，或者使用公开的方法对语音进行提取，从而得到与各目标对象对应的语音信息。然后，将与各目标对象对应的语音信息和预设图像输入到预设同步模型中，通过预设同步模型中的特征提取层对语音信息和预设图像分别提取特征，接着，通过生成器对提取的语音特征和图像特征进行转换，生成同步视频。然后，通过判别器对同步视频的唇部特征进行同步判断，并使用视觉质量鉴别器来提高视觉质量和同步精度，从而得到与各音频文件分别对应的同步唇形视频文件。例如，在保险领域中，将获取到的音频文件和该业务员的图像输入到预设同步模型中，从而生成音频文件对应的同步唇形视频文件。Specifically, after obtaining the audio files, speech extraction is performed on each audio file, that is, the speech can be extracted through a trained model, or the speech can be extracted using a public method, so as to obtain the speech information corresponding to each target object. Then, the voice information and preset images corresponding to each target object are input into the preset synchronization model, and features are extracted from the voice information and preset images through the feature extraction layer in the preset synchronization model. Then, the generator is used to The extracted voice features and image features are converted to generate synchronized videos. Then, the discriminator is used to synchronize the lip features of the synchronized video, and the visual quality discriminator is used to improve the visual quality and synchronization accuracy, thereby obtaining synchronized lip video files corresponding to each audio file. For example, in the insurance field, the acquired audio file and the image of the salesperson are input into the preset synchronization model, thereby generating a synchronized lip video file corresponding to the audio file.

S40：提取所述同步唇形视频文件中的与各所述目标对象对应的同步唇部图像序列。S40: Extract the synchronized lip image sequence corresponding to each of the target objects in the synchronized lip video file.

可理解地，同步唇部图像序列是由多个同步唇部图像拼接成的序列。It can be understood that the synchronized lip image sequence is a sequence spliced by multiple synchronized lip images.

具体地，对同步唇形视频文件中各目标对象进行人脸图像识别，也即通过训练好的人脸关键点识别模型对各目标对象的人脸图像进行面部特征识别，从而得到与每一帧人脸图像对应的同步面部特征点。根据所有同步面部特征点确定与每一帧人脸图像对应的同步唇部图像，也即从所有同步面部特征点中筛选出与同步唇部对应的同步面部特征点，并通过筛选出的同步面部特征点对每帧的同步唇部图像进行提取，从而与每一帧人脸图像对应的同步唇部图像。对所有帧的同步唇部图像进行筛选，也即按照每一帧的时间顺序对所有同步帧唇部图像进行排列，再通过预设帧间隔进行筛选提取，即可得到同步唇部图像序列。可理解地，提取同步唇形视频文件中的与各目标对象对应的同步唇部图像序列可以使用与目标唇部图像序列不同的方法进行提取，在本实施例中为了提高识别效率采用同样的方法提取同步唇部图像序列。Specifically, facial image recognition is performed on each target object in the synchronized lip video file, that is, facial feature recognition is performed on the facial image of each target object through the trained facial key point recognition model, so as to obtain the facial features of each target object. Synchronized facial feature points corresponding to the face image. Determine the synchronized lip image corresponding to each frame of face image based on all synchronized facial feature points, that is, filter out the synchronized facial feature points corresponding to the synchronized lips from all synchronized facial feature points, and use the filtered synchronized face Feature points are extracted from the synchronized lip image of each frame, so as to obtain a synchronized lip image corresponding to each frame of face image. Filter the synchronized lip images of all frames, that is, arrange the lip images of all synchronized frames according to the time sequence of each frame, and then filter and extract through the preset frame intervals to obtain the synchronized lip image sequence. Understandably, the synchronized lip image sequence corresponding to each target object in the synchronized lip video file can be extracted using a different method from the target lip image sequence. In this embodiment, the same method is used to improve the recognition efficiency. Extract synchronized lip image sequences.

S50：获取预设比对模型，通过所述预设比对模型对与同一所述目标对象所述目标唇部图像序列和所述同步唇部图像序列进行比对，得到该目标对象的身份识别结果。S50: Obtain a preset comparison model, and compare the target lip image sequence and the synchronized lip image sequence with the same target object through the preset comparison model to obtain the identity of the target object. result.

可理解地，身份识别结果表征音频文件中的目标对象和视频文件中的目标对象是否为同一个目标对象。预设比对模型是基于孪生目标对象识别网络构建的。Understandably, the identity recognition result indicates whether the target object in the audio file and the target object in the video file are the same target object. The preset comparison model is built based on the twin target object recognition network.

具体地，先对目标唇部图像序列和同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，也即先通过深度卷积层分别对所有唇部特征进行深度卷积，得到与各唇部特征对应的卷积特征。然后，通过长短时记忆网络对所有卷积特征进行时序处理，从而得到时序特征，对相同序列帧的两个时序特征之间进行相似度计算，即可得到与每一序列帧分别对应的相似度值。根据与同一视频文件对应的所有相似度值，确定与该视频文件对应的置信度。当置信度大于或等于预设置信度阈值时，确认音频文件中的目标对象和视频文件中的目标对象是同一个目标对象的第一识别结果。当置信度小于预设置信度阈值时，确认音频文件中的目标对象和视频文件中的目标对象不是同一个目标对象的第二识别结果。例如，在保险领域场景中，将获取的视频文件的目标唇部图像序列和生成的同步唇部图像序列进行比对，从而确定音频文件对应的人员是否为业务员，以及确定客户对应的音频文件。Specifically, the similarity of the lip images of the target lip image sequence and the same sequence frame in the synchronized lip image sequence is first compared, that is, all lip features are first deeply convolved through the depth convolution layer, The convolution features corresponding to each lip feature are obtained. Then, all the convolutional features are processed in time series through the long short-term memory network to obtain the time series features. The similarity between the two time series features of the same sequence frame is calculated to obtain the similarity corresponding to each sequence frame. value. Based on all similarity values corresponding to the same video file, the confidence level corresponding to the video file is determined. When the confidence is greater than or equal to the preset confidence threshold, it is confirmed that the target object in the audio file and the target object in the video file are the same target object. When the confidence is less than the preset confidence threshold, a second recognition result confirms that the target object in the audio file and the target object in the video file are not the same target object. For example, in an insurance field scenario, the target lip image sequence of the acquired video file is compared with the generated synchronous lip image sequence to determine whether the person corresponding to the audio file is a salesperson, and to determine the audio file corresponding to the customer. .

在本发明实施例中的一种目标对象识别方法，该方法通过对获取的视频文件中的目标对象进行人脸检测，实现了对视频文件中各目标对象的唇部图像的提取，进而实现了在金融或保险等业务中对目标唇部图像序列的获取。通过将各音频文件输入到预设同步模型中，实现了通过预设同步模型将音频文件转换为同步唇形视频文件。通过提取同步唇形视频文件中的与各目标对象对应的同步唇部图像序列，实现了对同步唇部图像序列的获取。通过预设比对模型对与同一目标对象对应的目标唇部图像序列和同步唇部图像序列进行比对，实现了对目标对象身份结果的确定，提高了在金融或保险等业务中对唇部图像的利用效率。进一步地，通过预设比对模型对目标唇部图像序列和同步唇部图像序列进行比对，实现了比对唇部图像识别出目标对象，提高了在金融或保险业务中目标对象识别的准确率。A target object recognition method in an embodiment of the present invention realizes the extraction of lip images of each target object in the video file by performing face detection on the target object in the acquired video file, thereby achieving Acquisition of target lip image sequence in financial or insurance services. By inputting each audio file into the preset synchronization model, the audio file is converted into a lip synchronized video file through the preset synchronization model. By extracting the synchronized lip image sequence corresponding to each target object in the synchronized lip video file, the acquisition of the synchronized lip image sequence is achieved. By comparing the target lip image sequence and the synchronized lip image sequence corresponding to the same target object through the preset comparison model, the identity of the target object is determined, which improves lip identification in finance or insurance and other businesses. Image utilization efficiency. Furthermore, the target lip image sequence and the synchronized lip image sequence are compared through a preset comparison model, thereby realizing the target object identification by comparing the lip images, and improving the accuracy of target object identification in financial or insurance services. Rate.

在一实施例中，步骤S20中，也即对所述视频文件中的目标对象进行人脸检测，得到与所述视频文件中各所述目标对象对应的目标唇部图像序列，包括：In one embodiment, in step S20, face detection is performed on the target objects in the video file to obtain a target lip image sequence corresponding to each target object in the video file, including:

S201，对所述视频文件中各所述目标对象的人脸图像进行识别，得到与每一帧人脸图像对应的面部特征点。S201: Recognize the face images of each target object in the video file to obtain facial feature points corresponding to each frame of face image.

可理解地，面部特征点面部的关键点，关键点可以为眼睛、嘴巴等器官，或者可以为单眼皮或左右唇角等特征。Understandably, the facial features are key points on the face, and the key points can be organs such as eyes and mouth, or features such as single eyelids or left and right lip corners.

具体地，对该视频文件中各目标对象的人脸图像进行人脸关键点检测，也即获取训练完成的人脸关键点检测模型，并将该视频文件中输入至训练好的人脸关键点检测模型中，也即通过人脸检测模型中的关键点检测网络对该视频文件中各目标对象的人脸图像进行特征识别，并在相应的每帧人脸图像中用点标记出特征点，从而得到与每一帧人脸图像对应的面部特征点。其中，可基于人脸关键点检测模型识别出该视频文件中每帧人脸图像中用于标识五官轮廓的预设数量个面部特征点。Specifically, face key point detection is performed on the face images of each target object in the video file, that is, the trained face key point detection model is obtained, and the trained face key points are input into the video file. In the detection model, that is, through the key point detection network in the face detection model, feature recognition is performed on the face images of each target object in the video file, and feature points are marked with points in each corresponding frame of face image. Thus, the facial feature points corresponding to each frame of face image are obtained. Among them, a preset number of facial feature points used to identify the outline of facial features in each frame of the face image in the video file can be identified based on the facial key point detection model.

S202，根据所有所述面部特征点确定与每一帧人脸图像对应的唇部图像。S202: Determine the lip image corresponding to each frame of face image based on all the facial feature points.

S203，对所有帧所述唇部图像进行筛选，得到所述目标唇部图像序列。S203: Filter the lip images in all frames to obtain the target lip image sequence.

可理解地，目标唇部图像序列为同一个目标对象不同帧的唇部图像。Understandably, the target lip image sequence is lip images of the same target object in different frames.

具体地，根据所有面部特征点确定与每一帧人脸图像对应的唇部图像，也即从每一帧人脸图像的面部特征点中选择出与唇部对应的面部关键点，并通过与唇部对应的面部关键点对每帧的唇部图像进行提取，也即将唇部的面部关键点对应的区域从每帧人脸图像中分割出来，从而得到与每一帧人脸图像对应的唇部图像。进一步地，对提取到的每一帧唇部图像进行增强，也即通过对唇部图像进行缩放(例如缩放到32x38大小)，并将缩放后的唇部图像按照每一帧的时间顺序进行排序，从而得到图像排序结果。接着，按照预设的帧间隔(根据实际情况设置，例如间隔10帧)从图像排序结果中选择出所有目标唇部图像，并将筛选出的所有目标唇部图像按照每一帧的时间顺序拼接成目标唇部图像序列。例如，在金融机构安全信息验证场景中，为了防止欺诈和诈骗，说话者需要进行检测，以确认其是否为目标人员。检测通常需要说话者说一段内容或验证信息进行唇部信息的提取及识别。Specifically, the lip image corresponding to each frame of face image is determined based on all facial feature points, that is, the facial key points corresponding to the lips are selected from the facial feature points of each frame of face image, and through The facial key points corresponding to the lips are extracted from the lip image of each frame, that is, the area corresponding to the facial key points of the lips is segmented from each frame of the face image, thereby obtaining the lips corresponding to each frame of the face image. part image. Further, each extracted lip image is enhanced, that is, by scaling the lip image (for example, scaling to 32x38 size), and sorting the scaled lip images in time order of each frame. , thereby obtaining the image sorting result. Then, select all target lip images from the image sorting results according to the preset frame interval (set according to the actual situation, such as an interval of 10 frames), and splice all the filtered target lip images in the time sequence of each frame. into the target lip image sequence. For example, in the security information verification scenario of financial institutions, in order to prevent fraud and deception, the speaker needs to be detected to confirm whether he is the target person. Detection usually requires the speaker to say a piece of content or verify information to extract and identify lip information.

本发明实施例通过对视频文件中各目标对象的人脸图像进行识别，实现了对每一帧人脸图像中面部特征点的确定，实现了对每一帧人脸图像的唇部图像的选择。对所有帧唇部图像进行筛选，实现了对目标唇部图像序列的获取。By identifying the face images of each target object in the video file, the embodiment of the present invention realizes the determination of facial feature points in each frame of face image and the selection of the lip image of each frame of face image. . All frames of lip images are filtered to achieve the acquisition of the target lip image sequence.

在一实施例中，步骤S30中，也即将各所述音频文件输入到预设同步模型中，生成与各所述音频文件分别对应的同步唇形视频文件，包括：In one embodiment, in step S30, each of the audio files is input into the preset synchronization model, and a synchronized lip video file corresponding to each of the audio files is generated, including:

S301，对所有所述音频文件进行语音提取，得到与各所述目标对象对应的语音信息。S301: Perform speech extraction on all audio files to obtain speech information corresponding to each target object.

可理解地，音频文件为视频文件中对应的音频数据。语音信息为目标对象的语音数据。Understandably, the audio file is the corresponding audio data in the video file. The voice information is the voice data of the target object.

具体地，在得到音频文件之后，对所有音频文件进行语音提取，也即从音频文件中提取出各目标对象对应的语音信息。例如，对音频文件中的所有语音进行聚类，从而与各目标对象对应的语音信息。或者，通过计算所有语音之间的相似度，即通过计算语音的声纹特征之间的相似度，并将相似度超过阈值的语音进行聚类，从而与各目标对象对应的语音信息。需要说明的是，本实施例中不限定提取方法。例如，在保险理赔场景中，音频文件为用户和工作人员之间的对话，通过语音提取得到两个目标对象的语音信息。或者，在银行贷款场景中，音频文件为银行工作人员和客户之间的对话，通过语音提取得到两个目标对象的语音信息。Specifically, after obtaining the audio files, speech extraction is performed on all the audio files, that is, the speech information corresponding to each target object is extracted from the audio files. For example, all voices in the audio file are clustered to obtain voice information corresponding to each target object. Or, by calculating the similarity between all voices, that is, by calculating the similarity between the voiceprint features of the voices, and clustering the voices whose similarity exceeds the threshold, the voice information corresponding to each target object can be obtained. It should be noted that the extraction method is not limited in this embodiment. For example, in the insurance claims scenario, the audio file is a conversation between the user and the staff, and the voice information of the two target objects is obtained through speech extraction. Or, in a bank loan scenario, the audio file is a conversation between bank staff and customers, and the voice information of the two target objects is obtained through speech extraction.

S302，通过所述预设同步模型对各所述语音信息进行视频转换，得到与各所述语音信息分别对应的同步唇形视频文件。S302: Perform video conversion on each of the voice information through the preset synchronization model to obtain a synchronized lip video file corresponding to each of the voice information.

可理解地，预设同步模型是基于Wav2Lip方法构建的并经过大量数据训练得到的。同步唇形视频文件为预设同步模型根据语音信息生成的同步唇形视频。Understandably, the preset synchronization model is built based on the Wav2Lip method and trained with a large amount of data. The synchronized lip video file is a synchronized lip video generated by the preset synchronization model based on speech information.

具体地，将语音信息输入到预设同步模型中，并通过预设同步模型对各语音信息进行视频转换，即预设同步模型根据语音信息使得预设的人脸图像对语音信息进行描述，也即分别对语音信息和预设的人脸图像进行特征提取，得到分别对应的语音特征和图像特征，将每个目标对象的语音特征和图像特征分别输入到生成器中，生成与各语音信息对应的同步唇形视频。并通过预先训练好的判别器对应生成的同步唇形视频进行同步判断，以及使用视觉质量鉴别器来提高视觉质量和同步精度，如此，在生成的同步唇形视频通过判别器的鉴定后，从而得到与各语音信息分别对应的同步唇形视频文件。其中，预设同步模型可以自己训练，也可以调用公开方法，在此不做限定。例如，在金融领域中，通过预设同步模型对提取的音频文件和预设图像生成视频，生成的同步唇部视频用于判断该音频文件对应的人员信息。Specifically, the voice information is input into the preset synchronization model, and video conversion is performed on each voice information through the preset synchronization model. That is, the preset synchronization model causes the preset face image to describe the voice information based on the voice information, and also That is, feature extraction is performed on voice information and preset face images to obtain corresponding voice features and image features. The voice features and image features of each target object are input into the generator respectively, and the corresponding voice information is generated. lip sync video. And use the pre-trained discriminator to judge the synchronization of the generated synchronous lip video, and use the visual quality discriminator to improve the visual quality and synchronization accuracy. In this way, after the generated synchronous lip video passes the identification of the discriminator, thus Obtain synchronized lip-shape video files corresponding to each voice information. Among them, the preset synchronization model can be trained by itself or can call public methods, which is not limited here. For example, in the financial field, a video is generated from the extracted audio files and preset images through a preset synchronization model, and the generated synchronized lip video is used to determine the person information corresponding to the audio file.

本发明实施例通过对所有音频文件进行语音提取，实现了对语音信息的提取。通过预设同步模型对各所述语音信息进行视频转换，实现了对语音信息的视频转换，以及实现了对同步唇形视频文件的获取，进而提高了后续目标对象身份识别的准确率。The embodiment of the present invention realizes the extraction of voice information by performing voice extraction on all audio files. The preset synchronization model is used to perform video conversion on each of the voice information, thereby realizing the video conversion of the voice information and the acquisition of the synchronized lip video file, thereby improving the accuracy of subsequent target object identification.

在一实施例中，步骤S50中，也即通过所述预设比对模型对与同一所述目标对象所述目标唇部图像序列和所述同步唇部图像序列进行比对，得到该目标对象的身份识别结果，包括：In one embodiment, in step S50, that is, by comparing the target lip image sequence and the synchronized lip image sequence with the same target object through the preset comparison model, the target object is obtained. The identification results include:

S501，对所述目标唇部图像序列和所述同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，得到与每一序列帧分别对应的相似度值。S501: Compare the lip images of the target lip image sequence and the lip images of the same sequence frame in the synchronized lip image sequence to obtain a similarity value corresponding to each sequence frame.

具体地，在得到同步唇部图像序列之后，对目标唇部图像序列和同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，也即分别对目标唇部图像序列和同步唇部图像序列中的所有序列帧的唇部图像进行特征识别，也即对目标唇部图像序列中各个目标唇部图像的所有序列帧的唇部图像进行特征识别，以及对同步唇部图像序列中各个同步唇部图像的所有序列帧的唇部图像进行特征识别，从而得到与每一帧唇部图像分别对应的至少一个唇部特征。获取与目标唇部图像序列和同步唇部图像序列中的相同序列帧的唇部图像分别对应的所有唇部特征，并对获取的同帧的唇部特征之间的相似度进行计算，即可得到与每一序列帧分别对应的相似度值。例如，在保险领域场景中，通过对生成的同步视频的同步唇部图像序列和获取的视频文件中目标唇部图像序列中相同序列帧的唇部图像进行相似度比对，确定目标对象的身份。Specifically, after obtaining the synchronized lip image sequence, a similarity comparison is performed between the target lip image sequence and the lip images of the same sequence frame in the synchronized lip image sequence, that is, the target lip image sequence and the synchronized lip image sequence are compared respectively. Perform feature recognition on the lip images of all sequence frames in the lip image sequence, that is, perform feature recognition on the lip images of all sequence frames of each target lip image in the target lip image sequence, and perform feature recognition on the synchronized lip image sequence Feature recognition is performed on the lip images of all sequence frames of each synchronized lip image, thereby obtaining at least one lip feature corresponding to each frame of lip image. Obtain all the lip features corresponding to the lip images of the same sequence frame in the target lip image sequence and the synchronized lip image sequence, and calculate the similarity between the obtained lip features of the same frame, that is Obtain the similarity value corresponding to each sequence frame. For example, in the insurance field scenario, the identity of the target object is determined by comparing the similarity between the synchronized lip image sequence of the generated synchronized video and the lip images of the same sequence frame in the target lip image sequence in the acquired video file. .

S502，根据与同一所述视频文件对应的所有所述相似度值，确定与该视频文件对应的置信度。S502: Determine the confidence level corresponding to the video file based on all the similarity values corresponding to the same video file.

进一步地，根据与同一视频文件对应的所有相似度值，确定与该视频文件对应的置信度，也即通过与同一目标对象对应的每一序列帧对应的相似度值计算各目标对象的置信度，即通过判断相似度值大于或等于预设相似度阈值的对应的每一序列帧的唇部图像数量，从而得到各目标对象的置信度。然后，通过与同一视频文件对应的所有目标对象的置信度对该视频文件的置信度进行计算，也即将各目标对象的置信度乘于预设的权重(该权重可以相同，也可以不同)，从而确定与该视频文件对应的置信度。或者，通过对各目标对象的置信度进行评分，将所有评分值相加，再通过预设对应关系，确定与该视频文件对应的置信度。Further, based on all similarity values corresponding to the same video file, the confidence level corresponding to the video file is determined, that is, the confidence level of each target object is calculated through the similarity value corresponding to each sequence frame corresponding to the same target object. , that is, by judging the number of lip images in each sequence of frames whose similarity value is greater than or equal to the preset similarity threshold, the confidence of each target object is obtained. Then, the confidence of the video file is calculated based on the confidence of all target objects corresponding to the same video file, that is, the confidence of each target object is multiplied by the preset weight (the weight can be the same or different), Thereby determining the confidence level corresponding to the video file. Alternatively, the confidence level of each target object is scored, all scoring values are added up, and then the confidence level corresponding to the video file is determined through a preset correspondence relationship.

S503，当所述置信度大于或等于预设置信度阈值时，确认身份识别结果为第一识别结果，所述第一识别结果表征所述音频文件中的目标对象和所述视频文件中的目标对象是同一个目标对象。S503. When the confidence is greater than or equal to the preset confidence threshold, confirm that the identity recognition result is the first recognition result. The first recognition result represents the target object in the audio file and the target in the video file. The objects are the same target object.

S504，当所述置信度小于预设置信度阈值时，确认身份识别结果为第二识别结果，所述第二识别结果表征所述音频文件中的目标对象和所述视频文件中的目标对象不是同一个目标对象。S504: When the confidence is less than the preset confidence threshold, confirm that the identity recognition result is a second recognition result, and the second recognition result represents that the target object in the audio file and the target object in the video file are not the same target object.

具体地，调取预设置信度阈值，将与该视频文件对应的置信度和预设置信度阈值进行比较，当置信度大于或等于预设置信度阈值时，确认身份识别结果为第一识别结果，第一识别结果表征音频文件中的目标对象和视频文件中的目标对象是同一个目标对象。当置信度小于预设置信度阈值时，确认身份识别结果为第二识别结果，第二识别结果表征音频文件中的目标对象和视频文件中的目标对象不是同一个目标对象。例如，在银行领域场景中，在平安银行以及平安证券开户或者一些大额资金转出需要对操作人进行安全确认，对操作人说话时的唇部信息进行比对识别，从而确定是否是同一人。Specifically, the preset confidence threshold is called, and the confidence corresponding to the video file is compared with the preset confidence threshold. When the confidence is greater than or equal to the preset confidence threshold, the identity recognition result is confirmed to be the first identification. As a result, the first recognition result represents that the target object in the audio file and the target object in the video file are the same target object. When the confidence is less than the preset confidence threshold, the identity recognition result is confirmed to be the second recognition result, and the second recognition result indicates that the target object in the audio file and the target object in the video file are not the same target object. For example, in banking scenarios, opening an account at Ping An Bank and Ping An Securities or transferring some large amounts of funds requires security confirmation of the operator, and comparison and identification of the operator's lip information when speaking to determine whether they are the same person. .

本发明实施例通过对目标唇部图像序列和同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，实现了对每一序列帧对应的相似度值的获取。通过与同一视频文件对应的所有相似度值，实现了对置信度的计算。通过置信度和预设置信度阈值的比较，实现了对目标对象身份的识别，进而提高了目标对象身份识别的准确率。The embodiment of the present invention achieves the acquisition of the similarity value corresponding to each sequence frame by comparing the similarity between the target lip image sequence and the lip images of the same sequence frame in the synchronized lip image sequence. The calculation of confidence is achieved through all similarity values corresponding to the same video file. By comparing the confidence level with the preset confidence threshold, the identity of the target object is identified, thereby improving the accuracy of the identification of the target object.

在一实施例中，步骤S501中，也即对所述目标唇部图像序列和所述同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，得到与每一序列帧分别对应的相似度值，包括：In one embodiment, in step S501, a similarity comparison is performed between the target lip image sequence and the lip images of the same sequence frame in the synchronized lip image sequence, and the results of the similarity comparison with each sequence frame are obtained. The corresponding similarity values include:

S5011，分别对所述目标唇部图像序列和所述同步唇部图像序列中的所有序列帧的唇部图像进行特征识别，得到与每一帧唇部图像分别对应的至少一个唇部特征。S5011: Perform feature recognition on the lip images of all sequence frames in the target lip image sequence and the synchronized lip image sequence, respectively, to obtain at least one lip feature corresponding to each frame of lip image.

具体地，分别对目标唇部图像序列和同步唇部图像序列中的所有序列帧的唇部图像进行特征识别，也即对所有唇部图像中的唇部张开程度特征、唇部左撇程度特征和唇部右撇程度特征中的一类或多类进行特征识别，先对唇部区域进行特征点识别，再根据上唇部内侧中心特征点与下唇部内侧中心特征点的距离确定唇部张开程度特征。将左侧唇角特征点与上唇部、下唇部外轮廓线上距离左侧唇角特征点最近的特征点分别连接并形成第一向量之后，通过计算所述第一向量之间的夹角得到唇部左撇程度特征。将右侧唇角特征点与上唇部、下唇部外轮廓线上距离右侧唇角特征点最近的特征点分别连接并形成第二向量之后，通过计算所述第二向量之间的夹角得到唇部右撇程度特征，如此，即可得到与每一帧唇部图像分别对应的至少一个唇部特征。例如，在保险领域中，通过对不同来源的唇部图像进行特征提取，从而得到说话时唇部图像对应的唇部张开程度特征、唇部左撇程度特征和唇部右撇程度特征等唇部特征。Specifically, feature recognition is performed on the lip images of all sequence frames in the target lip image sequence and the synchronized lip image sequence, that is, the lip opening degree features and lip left-handedness features in all lip images One or more types of features and lip right-handedness features are used for feature recognition. First, the feature points of the lip area are identified, and then the lip is determined based on the distance between the upper lip inner center feature point and the lower lip inner center feature point. Opening degree characteristics. After the left lip corner feature point is connected to the feature points closest to the left lip corner feature point on the outer contours of the upper lip and lower lip respectively to form a first vector, the angle between the first vectors is calculated. Obtain the left-handedness characteristics of the lips. After the right lip corner feature point is connected to the feature points on the outer contours of the upper lip and lower lip that are closest to the right lip corner feature point respectively to form a second vector, the angle between the second vectors is calculated. The right-handedness feature of the lips is obtained. In this way, at least one lip feature corresponding to each frame of the lip image can be obtained. For example, in the insurance field, by extracting features from lip images from different sources, we can obtain features such as lip opening degree features, lip left-handedness features, and lip right-handedness features corresponding to the lip images when speaking. local characteristics.

S5012，获取与所述目标唇部图像序列和所述同步唇部图像序列中的相同序列帧的唇部图像分别对应的所有所述唇部特征，并根据获取的同帧的所述唇部特征之间的相似度，确定相似度值。S5012, obtain all the lip features corresponding to the lip images of the same sequence frame in the target lip image sequence and the synchronized lip image sequence, and use the obtained lip features of the same frame to The similarity between them determines the similarity value.

具体地，对与目标唇部图像序列和同步唇部图像序列中的相同序列帧的唇部图像分别对应的所有唇部特征进行获取，得到相同序列帧的唇部图像分别对应的唇部张开程度特征、唇部左撇程度特征和唇部右撇程度特征。如此，将相同序列帧的唇部图像对应的唇部张开程度特征之间，或者相同序列帧的唇部图像对应的唇部左撇程度特征之间，亦或者相同序列帧的唇部图像对应的唇部右撇程度特征之间进行图像相似度计算，即先通过深度卷积层分别对唇部张开程度特征、唇部左撇程度特征和唇部右撇程度特征进行深度卷积，得到与每帧唇部特征对应的卷积特征。然后，通过长短时记忆网络对卷积特征进行时序处理，从而得到时序特征，对相同序列帧的两个时序特征之间进行相似度计算，从而得到与相同序列帧的唇部图像对应的相似度值。例如，在金融场景中，在身份识别时，计算两组不同来源的唇部图像之间的相似度，从而确定该验证人员是否是目标人员。或者，在远程安全信息验证过程中，需要对视频中目标对象的身份进行识别验证，确保是本人在操作。Specifically, all lip features corresponding to the lip images of the same sequence frame in the target lip image sequence and the synchronized lip image sequence are obtained, and the lip opening corresponding to the lip images of the same sequence frame is obtained. Degree characteristics, lip left-handedness degree characteristics, and lip right-handedness degree characteristics. In this way, the lip opening degree features corresponding to the lip images of the same sequence frame, or the lip left-handedness features corresponding to the lip images of the same sequence frame, or the lip images of the same sequence frame are corresponding to each other. The image similarity is calculated between the lip right-handedness features, that is, the depth convolution layer is first used to perform deep convolution on the lip opening degree features, lip left-handedness features and lip right-handedness features, respectively, to obtain Convolutional features corresponding to lip features in each frame. Then, the convolutional features are processed temporally through the long short-term memory network to obtain the temporal features. The similarity between the two temporal features of the same sequence frame is calculated to obtain the similarity corresponding to the lip image of the same sequence frame. value. For example, in a financial scenario, during identity recognition, the similarity between two sets of lip images from different sources is calculated to determine whether the verification person is the target person. Or, during the remote security information verification process, it is necessary to identify and verify the identity of the target object in the video to ensure that it is the person operating it.

本发明实施例通过分别对目标唇部图像序列和同步唇部图像序列中的所有序列帧的唇部图像进行特征识别，实现了对唇部特征的提取，进而实现了对相同序列帧的唇部图像的所有唇部特征的获取。根据获取的同帧的唇部特征之间的相似度，实现了对相似度值的计算，进而确保了识别结果的准确性。The embodiment of the present invention realizes the extraction of lip features by separately performing feature recognition on the lip images of all sequence frames in the target lip image sequence and the synchronized lip image sequence, and further realizes the extraction of lip features of the same sequence frames. Acquisition of all lip features of the image. Based on the similarity between the obtained lip features in the same frame, the similarity value is calculated, thereby ensuring the accuracy of the recognition result.

在一实施例中，步骤S50之前，也即获取预设比对模型之前，包括：In one embodiment, before step S50, that is, before obtaining the preset comparison model, the process includes:

S601，获取样本数据集，所述样本数据集包括至少一个样本数据和与所述样本数据对应的样本标签。S601. Obtain a sample data set, where the sample data set includes at least one sample data and a sample label corresponding to the sample data.

可理解地，样本数据为目标对象同一帧的一组唇部图像。其中，样本数据中的一组唇部图像，一个是对录制的视频文件中的人脸信息提取得到的，另一个是通过预设同步模型对音频数据转换的视频中的人脸信息提取得到的。样本标签用于表征样本数据中该组唇部图像对应的目标对象的身份识别结果。例如，在保险领域中，通过对保险业务员和客户之间的视频进行身份识别，确定音频数据对应的目标对象。或者，在银行领域中银行开户时，对操作人设置的口令进行唇部信息的比对验证。样本数据可以从不同的数据库中采集得到，也可以从客户端发送到数据库中的。进而根据所有样本数据和所有样本标签构建样本数据集。Understandably, the sample data is a set of lip images of the target object in the same frame. Among them, a set of lip images in the sample data, one is obtained by extracting face information from a recorded video file, and the other is obtained by extracting face information from a video converted from audio data through a preset synchronization model. . The sample label is used to characterize the identity recognition result of the target object corresponding to the set of lip images in the sample data. For example, in the insurance field, the target object corresponding to the audio data is determined by identifying the video between the insurance salesperson and the customer. Or, when opening a bank account in the banking field, the password set by the operator is compared and verified with lip information. Sample data can be collected from different databases or sent to the database from the client. Then a sample data set is constructed based on all sample data and all sample labels.

S602，获取预设训练模型，将所有所述样本数据输入到预设训练模型中，获取与各所述样本数据对应的预测标签。S602: Obtain a preset training model, input all the sample data into the preset training model, and obtain prediction labels corresponding to each of the sample data.

可理解地，预测标签表征预设训练模型对样本数据的预测的相似度值。Understandably, the prediction label represents the similarity value predicted by the preset training model on the sample data.

具体地，在得到样本数据集之后，获取预设训练模型，并将所有样本数据输入到预设训练模型中，通过预设训练模型对样本数据中的一组唇部图像进行比对，也即对第一唇部图像序列和第二唇部图像序列中的相同序列帧的唇部图像进行相似度比对，即先通过深度卷积层分别对样本数据中的第一唇部图像和第二唇部图像进行深度卷积，得到与每帧唇部图像对应的样本卷积特征。然后，通过长短时记忆网络对样本卷积特征进行时序处理，从而得到样本时序特征，对相同序列帧的两个样本时序特征之间进行相似度计算，从而得到与每一序列帧分别对应的样本相似度值。根据与同一视频文件对应的所有相似度值，确定样本置信度。当样本置信度大于或等于样本置信度阈值时，确认样本数据中唇部图像对应的目标对象是同一个目标对象。当样本置信度小于样本置信度阈值时，确认样本数据中唇部图像对应的目标对象不是同一个目标对象。具体过程与上述步骤S501至S504相同，在此不再详细赘述。Specifically, after obtaining the sample data set, obtain a preset training model, input all sample data into the preset training model, and compare a set of lip images in the sample data through the preset training model, that is, Compare the lip images of the same sequence frames in the first lip image sequence and the second lip image sequence, that is, first use the depth convolution layer to compare the first lip image and the second lip image in the sample data respectively. The lip image is depth-convolved to obtain sample convolution features corresponding to each frame of lip image. Then, the sample convolution features are processed in time series through the long short-term memory network to obtain the sample time series features. The similarity between the two sample time series features of the same sequence frame is calculated to obtain samples corresponding to each sequence frame. Similarity value. Determine sample confidence based on all similarity values corresponding to the same video file. When the sample confidence is greater than or equal to the sample confidence threshold, it is confirmed that the target object corresponding to the lip image in the sample data is the same target object. When the sample confidence is less than the sample confidence threshold, it is confirmed that the target object corresponding to the lip image in the sample data is not the same target object. The specific process is the same as the above-mentioned steps S501 to S504, and will not be described in detail here.

S603，根据与同一所述样本数据对应的样本标签和预测标签，确定所述预设训练模型对应的预测损失值。S603: Determine the prediction loss value corresponding to the preset training model based on the sample label and prediction label corresponding to the same sample data.

可理解地，预测损失值为对样本数据进行预测的过程中生成的。Understandably, the prediction loss value is generated during the process of predicting sample data.

具体地，在得到预测标签之后，将与同一样本数据对应的所有预测标签按照样本数据集中的样本数据的顺序进行排列，进而将样本标签，与序列相同的预测标签进行比较；也即按照样本数据的排序，将位于第一的样本标签，与位于第一预测标签进行比较，通过损失函数计算样本标签与预测标签之间的损失值；进而将位于第二的样本标签，与位于第二的预测标签进行比较，直至所有样本标签和所有预测标签均比较完成，将所有样本数据的损失值相加，即可得到预设训练模型对应的预测损失值。Specifically, after obtaining the prediction labels, all prediction labels corresponding to the same sample data are arranged in the order of the sample data in the sample data set, and then the sample labels are compared with prediction labels with the same sequence; that is, according to the sample data Sorting, compare the sample label located first with the predicted label located first, calculate the loss value between the sample label and the predicted label through the loss function; then compare the sample label located second with the predicted label located second The labels are compared until all sample labels and all prediction labels are compared. The loss values of all sample data are added up to obtain the prediction loss value corresponding to the preset training model.

S604，在所述预测损失值达到收敛条件时，将收敛之后的预设训练模型确定为预设比对模型。S604: When the prediction loss value reaches the convergence condition, determine the preset training model after convergence as the preset comparison model.

可理解地，收敛条件可以为预测损失值小于设定阈值的条件，还可以为预测损失值经过了500次计算后值为很小且不会再下降的条件，停止训练。Understandably, the convergence condition can be the condition that the predicted loss value is less than the set threshold, or it can also be the condition that the predicted loss value is very small and will no longer decrease after 500 calculations, and training is stopped.

具体地，在得到预测损失值之后，在预测损失值未达到预设的收敛条件时，通过预测损失值调整预设训练模型的初始参数，并将所有样本数据重新输入至调整初始参数的预设训练模型中，对调整初始参数的预设训练模型进行迭代训练，即可得到与调整初始参数的预设训练模型对应的预测损失值。进而在该预测损失值未达到预设收敛条件时，根据该预测损失值再次调整预设训练模型的初始参数，使得再次调整初始参数的预设训练模型的预测损失值达到预设的收敛条件。如此，使得预设训练模型的准确率越来越高，不断地向正确结果靠拢，直至预设训练模型的预测损失值达到预设收敛条件时，将收敛之后的预设训练模型确定为预设比对模型。Specifically, after obtaining the predicted loss value, when the predicted loss value does not reach the preset convergence condition, the initial parameters of the preset training model are adjusted through the predicted loss value, and all sample data are re-entered into the preset setting for adjusting the initial parameters. In the training model, the preset training model with adjusted initial parameters is iteratively trained to obtain the prediction loss value corresponding to the preset training model with adjusted initial parameters. Furthermore, when the predicted loss value does not reach the preset convergence condition, the initial parameters of the preset training model are adjusted again according to the predicted loss value, so that the predicted loss value of the preset training model whose initial parameters are adjusted again reaches the preset convergence condition. In this way, the accuracy of the preset training model is getting higher and higher, and it is constantly moving closer to the correct result. When the predicted loss value of the preset training model reaches the preset convergence condition, the preset training model after convergence is determined as the preset Compare models.

本发明实施例通过大量的样本数据对预设训练模型进行迭代训练，并计算预设训练模型的整体损失值，实现了对预设训练模型的预测损失值的确定。根据预测损失值对预设训练模型的初始参数进行调整，直至模型收敛，实现了对预设比对模型的确定，进而确保了预设比对模型有较高的预测准确率。The embodiment of the present invention performs iterative training on a preset training model through a large amount of sample data, and calculates the overall loss value of the preset training model, thereby realizing the determination of the predicted loss value of the preset training model. The initial parameters of the preset training model are adjusted according to the predicted loss value until the model converges, thereby realizing the determination of the preset comparison model, thereby ensuring that the preset comparison model has a high prediction accuracy.

应理解，上述实施例中各步骤的序号的大小并不意味着执的行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that the sequence number of each step in the above embodiment does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any influence on the implementation process of the embodiment of the present invention. limited.

在一实施例中，提供一种目标对象识别装置，该目标对象识别装置与上述实施例中目标对象识别方法一一对应。如图5所示，该目标对象识别装置包括文件获取模块11、人脸检测模块12、同步视频模块13、图像提取模块14和识别结果模块15。各功能模块详细说明如下：In one embodiment, a target object recognition device is provided, and the target object recognition device corresponds to the target object recognition method in the above embodiment. As shown in FIG. 5 , the target object recognition device includes a file acquisition module 11 , a face detection module 12 , a synchronized video module 13 , an image extraction module 14 and a recognition result module 15 . The detailed description of each functional module is as follows:

文件获取模块11，用于获取视频文件，以及与所述视频文件对应的音频文件；File acquisition module 11, used to acquire video files and audio files corresponding to the video files;

人脸检测模块12，用于对所述视频文件中的目标对象进行人脸检测，得到与所述视频文件中各所述目标对象对应的目标唇部图像序列；The face detection module 12 is used to perform face detection on the target objects in the video file, and obtain the target lip image sequence corresponding to each of the target objects in the video file;

同步视频模块13，用于将各所述音频文件输入到预设同步模型中，生成与各所述音频文件分别对应的同步唇形视频文件；The synchronized video module 13 is used to input each of the audio files into the preset synchronization model and generate a synchronized lip video file corresponding to each of the audio files;

图像提取模块14，用于提取所述同步唇形视频文件中的与各所述目标对象对应的同步唇部图像序列；The image extraction module 14 is used to extract the synchronized lip image sequence corresponding to each of the target objects in the synchronized lip video file;

识别结果模块15，用于获取预设比对模型，通过所述预设比对模型对与同一所述目标对象所述目标唇部图像序列和所述同步唇部图像序列进行比对，得到该目标对象的身份识别结果。The recognition result module 15 is used to obtain a preset comparison model, and compare the target lip image sequence and the synchronized lip image sequence with the same target object through the preset comparison model to obtain the The identification result of the target object.

在一实施例中，所述识别结果模块15包括：In one embodiment, the recognition result module 15 includes:

相似度单元，用于对所述目标唇部图像序列和所述同步唇部图像序列中的相同序列帧的唇部图像进行相似度比对，得到与每一序列帧分别对应的相似度值；A similarity unit, configured to compare the similarity between the target lip image sequence and the lip images of the same sequence frame in the synchronized lip image sequence, and obtain a similarity value corresponding to each sequence frame;

置信度单元，用于根据与同一所述视频文件对应的所有所述相似度值，确定与该视频文件对应的置信度；A confidence unit, configured to determine the confidence corresponding to the video file based on all the similarity values corresponding to the same video file;

第一识别结果单元，用于当所述置信度大于或等于预设置信度阈值时，确认身份识别结果为第一识别结果，所述第一识别结果表征所述音频文件中的目标对象和所述视频文件中的目标对象是同一个目标对象；A first recognition result unit, configured to confirm that the identity recognition result is the first recognition result when the confidence level is greater than or equal to the preset confidence threshold value, and the first recognition result represents the target object in the audio file and the target object in the audio file. The target object in the above video file is the same target object;

第二识别结果单元，用于当所述置信度小于预设置信度阈值时，确认身份识别结果为第二识别结果，所述第二识别结果表征所述音频文件中的目标对象和所述视频文件中的目标对象不是同一个目标对象。A second recognition result unit, configured to confirm that the identity recognition result is a second recognition result when the confidence is less than a preset confidence threshold, and the second recognition result represents the target object in the audio file and the video The target objects in the file are not the same target object.

在一实施例中，所述相似度单元，包括：In one embodiment, the similarity unit includes:

特征识别子单元，用于分别对所述目标唇部图像序列和所述同步唇部图像序列中的所有序列帧的唇部图像进行特征识别，得到与每一帧唇部图像分别对应的至少一个唇部特征；Feature identification subunit, configured to perform feature identification on the lip images of all sequence frames in the target lip image sequence and the synchronized lip image sequence, and obtain at least one lip image corresponding to each frame of lip image respectively. lip features;

相似度值子单元，用于获取与所述目标唇部图像序列和所述同步唇部图像序列中的相同序列帧的唇部图像分别对应的所有所述唇部特征，并根据获取的同帧的所述唇部特征之间的相似度，确定相似度值。Similarity value subunit, used to obtain all the lip features respectively corresponding to the lip images of the same sequence frame in the target lip image sequence and the synchronized lip image sequence, and obtain the lip features according to the acquired same frame The similarity between the lip features is determined to determine the similarity value.

在一实施例中，所述同步视频模块13，包括：In one embodiment, the synchronized video module 13 includes:

语音信息单元，用于对所有所述音频文件进行语音提取，得到与各所述目标对象对应的语音信息；A voice information unit, used to perform voice extraction on all the audio files to obtain voice information corresponding to each of the target objects;

唇形视频单元，用于通过所述预设同步模型对各所述语音信息进行视频转换，得到与各所述语音信息分别对应的同步唇形视频文件。A lip video unit is configured to perform video conversion on each of the voice information through the preset synchronization model to obtain a synchronized lip video file corresponding to each of the voice information.

在一实施例中，所述人脸检测模块12，包括：In one embodiment, the face detection module 12 includes:

面部特征点单元，用于对所述视频文件中各所述目标对象的人脸图像进行识别，得到与每一帧人脸图像对应的面部特征点；A facial feature point unit, used to identify the face image of each target object in the video file and obtain the facial feature points corresponding to each frame of face image;

唇部图像单元，用于根据所有所述面部特征点确定与每一帧人脸图像对应的唇部图像；A lip image unit, configured to determine the lip image corresponding to each frame of face image based on all the facial feature points;

图像序列单元，用于对所有帧所述唇部图像进行筛选，得到所述目标唇部图像序列。An image sequence unit is used to filter the lip images in all frames to obtain the target lip image sequence.

在一实施例中，所述识别结果模块15，还包括：In one embodiment, the recognition result module 15 also includes:

样本获取单元，用于获取样本数据集，所述样本数据集包括至少一个样本数据和与所述样本数据对应的样本标签；A sample acquisition unit, configured to acquire a sample data set, where the sample data set includes at least one sample data and a sample label corresponding to the sample data;

标签预测单元，用于获取预设训练模型，将所有所述样本数据输入到预设训练模型中，获取与各所述样本数据对应的预测标签；A label prediction unit, used to obtain a preset training model, input all the sample data into the preset training model, and obtain prediction labels corresponding to each of the sample data;

损失预测单元，用于根据与同一所述样本数据对应的样本标签和预测标签，确定所述预设训练模型对应的预测损失值；A loss prediction unit, configured to determine the prediction loss value corresponding to the preset training model based on the sample label and prediction label corresponding to the same sample data;

模型收敛单元，用于在所述预测损失值达到收敛条件时，将收敛之后的预设训练模型确定为预设比对模型。The model convergence unit is configured to determine the preset training model after convergence as the preset comparison model when the prediction loss value reaches the convergence condition.

关于目标对象识别装置的具体限定可以参见上文中对于目标对象识别方法的限定，在此不再赘述。上述目标对象识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the target object recognition device, please refer to the above limitations on the target object recognition method, which will not be described again here. Each module in the above target object recognition device can be implemented in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储上述实施例中目标对象识别方法所用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种目标对象识别方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 . The computer device includes a processor, memory, network interface, and database connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems, computer programs and databases. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The database of the computer device is used to store data used in the target object recognition method in the above embodiment. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program implements a target object recognition method when executed by the processor.

在一个实施例中，提供了一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述目标对象识别方法。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above is implemented. Target object recognition method.

在一个实施例中，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述目标对象识别方法。In one embodiment, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned target object identification method is implemented.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，仅以上述各功能单元、模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能单元、模块完成，即将所述装置的内部结构划分成不同的功能单元或模块，以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and simplicity of description, only the division of the above functional units and modules is used as an example. In actual applications, the above functions can be allocated to different functional units and modules according to needs. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

以上所述实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围，均应包含在本发明的保护范围内。The above-described embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of each embodiment of the present invention, and should be included in within the protection scope of the present invention.