CN116521939A

Movatterモバイル変換

Info

Publication number: CN116521939A
Application number: CN202310551276.8A
Authority: CN
Inventors: 王斌; 宋志鹏; 蒋婕; 李廷超; 曹又潮; 迟鹭璎
Original assignee: Shuangzhibo Shenyang Electromechanical Equipment Manufacturing Co ltd
Current assignee: Shuangzhibo Shenyang Electromechanical Equipment Manufacturing Co ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-08-01

Abstract

Translated fromChinese

本申请涉及多模态数据检索领域，尤其涉及一种面向司法图文数据的跨模态检索方法，包括：确定待构建庭审项目，以及获取原始视频数据、原始音频数据和原始文本数据，以及分别进行特征提取，得到提取视频数据、提取音频数据和提取文本数据并且针对标准多模态数据检索网络进行训练，得到多模态庭审数据检索网络；输入待检索庭审项目至多模态庭审数据检索网络，得到相应的原始视频数据、原始音频数据或者原始文本数据。本申请通过针对待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，构建关于待构建庭审项目的多模态庭审数据检索网络，将待构建庭审项目的多模态数据进行统一空间的存储，方便对待构建庭审项目进行检索。

This application relates to the field of multimodal data retrieval, and in particular to a cross-modal retrieval method for judicial graphic data, including: determining the trial project to be constructed, obtaining original video data, original audio data and original text data, and respectively Perform feature extraction, extract video data, extract audio data, and extract text data, and perform training on the standard multimodal data retrieval network to obtain a multimodal court trial data retrieval network; input the court trial items to be retrieved to the multimodal court trial data retrieval network, Corresponding original video data, original audio data or original text data are obtained. This application builds a multi-modal court trial data retrieval network for the court trial project to be constructed by extracting video data, audio data and text data for the trial project to be constructed, and unifies the multimodal data of the trial project to be constructed. Storage for easy retrieval of court trial projects to be constructed.

Description

Translated fromChinese

面向司法图文数据的跨模态检索方法Cross-modal retrieval method for judicial graphic data

技术领域technical field

本申请涉及多模态数据检索领域，尤其涉及一种面向司法图文数据的跨模态检索方法。This application relates to the field of multi-modal data retrieval, in particular to a cross-modal retrieval method for judicial graphic data.

背景技术Background technique

随着科技的发展与时代的进步，法庭案件审理数量逐渐增多，法庭案件记录也越来越多。常见并一直使用的记录方法是利用纸质文件记录整场审判的过程。但利用纸质文件记录缺点很多，首先随着年份的增长，纸质文件会越来越多，需要占用更多的空间进行保存。其次，利用纸质文件并不环保，因为审判时间较长，记录文件体积普遍较为庞大，一份庭审记录需要大量纸张才可以打印完全。第三，不便于查找。对于某一年份或某一特定卷宗的查找难度相对较低，如果要查找特定卷宗中的特定文字内容，需要人工进行全文件阅读才能准确定位，但这样做工作量很大，效率偏低。With the development of science and technology and the progress of the times, the number of court cases has gradually increased, and the number of court case records has also increased. A common and consistently used recording method is to use paper documents to record the entire trial process. However, there are many disadvantages of using paper files to record. First, as the years grow, there will be more and more paper files, which need to take up more space for storage. Secondly, the use of paper documents is not environmentally friendly, because the trial time is long, and the volume of record files is generally relatively large. A court record requires a lot of paper to print completely. Third, it is not easy to find. The difficulty of searching for a certain year or a specific file is relatively low. If you want to find a specific text in a specific file, you need to manually read the entire file to locate it accurately, but this is a lot of work and low efficiency.

通常需要查找的数据不仅仅是文本数据，庭审过程中的相关记录文件形式还有视频和音频，对于视频有效信息的定位也是耗时耗力的，在没有文本文件记录的情况下，需要对整个视频文件进行查看才能实现定位。Usually the data that needs to be searched is not only text data, but also video and audio in the form of relevant record files in the court trial process. It is also time-consuming and labor-intensive to locate effective video information. In the absence of text file records, the entire Only by viewing the video file can positioning be realized.

目前法院对于有关于庭审记录的检索需求是实现文本、音频和视频三个模态数据之间的检索。现有的多个媒体数据之间可以检索的方式通常是采用多任务网络，但并不可以专门用于法庭案件审理应用场景，专业性不强，应用场景不符合。At present, the court's search requirement for court trial records is to realize the search among the three modal data of text, audio and video. Existing methods for retrieving multiple media data usually use a multi-tasking network, but they cannot be specially used in court case trial application scenarios. They are not professional and do not meet the application scenarios.

发明内容Contents of the invention

本申请提供了一种面向司法图文数据的跨模态检索方法，能够解决现有的庭审文件检索方法不能够专门用于法庭案件审理的问题。This application provides a cross-modal retrieval method for judicial graphic data, which can solve the problem that the existing retrieval methods for court documents cannot be specially used for court cases.

本申请的技术方案是一种面向司法图文数据的跨模态检索方法，包括：The technical solution of this application is a cross-modal retrieval method for judicial graphic data, including:

S1：确定若干个待构建庭审项目，以及基于待构建庭审项目，相应地获取原始视频数据、原始音频数据和原始文本数据；S1: Determine several court trial projects to be constructed, and obtain original video data, original audio data and original text data accordingly based on the court trial projects to be constructed;

S2：基于待构建庭审项目，分别对原始视频数据、原始音频数据和原始文本数据进行特征提取，相应地得到以相同存储形式进行存储的提取视频数据、提取音频数据和提取文本数据；S2: Based on the court trial project to be constructed, feature extraction is performed on the original video data, original audio data and original text data respectively, and the extracted video data, extracted audio data and extracted text data stored in the same storage form are correspondingly obtained;

S3：通过若干个待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，针对标准多模态数据检索网络进行训练，得到多模态庭审数据检索网络；S3: Through the extracted video data, extracted audio data and extracted text data of several court trial projects to be constructed, train the standard multi-modal data retrieval network to obtain a multi-modal court trial data retrieval network;

S4：获取待检索庭审项目并且输入待检索庭审项目至多模态庭审数据检索网络，得到相应于待检索庭审项目的原始视频数据、原始音频数据或者原始文本数据。S4: Obtain the court trial items to be retrieved and input the court trial items to be retrieved into the multimodal court trial data retrieval network to obtain original video data, original audio data or original text data corresponding to the court trial items to be retrieved.

可选地，其特征在于，所述步骤S2包括：Optionally, it is characterized in that the step S2 includes:

S21：针对原始视频数据进行分段处理，得到分段视频数据，以及通过MovieNet对分段视频数据进行特征提取，得到提取视频数据；S21: performing segmentation processing on the original video data to obtain segmented video data, and performing feature extraction on the segmented video data through MovieNet to obtain extracted video data;

S22：针对原始音频数据进行分段处理，得到分段音频数据，以及通过AudioNet对分段音频数据进行特征提取，得到提取音频数据；S22: performing segmentation processing on the original audio data to obtain segmented audio data, and performing feature extraction on the segmented audio data through AudioNet to obtain extracted audio data;

S23：通过Bert对原始文本数据进行特征提取，得到包括若干个单词向量的提取文本数据。S23: performing feature extraction on the original text data through Bert to obtain extracted text data including several word vectors.

可选地，所述步骤S3包括：Optionally, the step S3 includes:

S31：通过若干个待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，针对标准多模态数据检索网络进行训练，得到多模态庭审数据初步网络；S31: Through extracting video data, extracting audio data and extracting text data of several court trial projects to be constructed, train the standard multi-modal data retrieval network to obtain a preliminary multi-modal court trial data network;

S32：确定测试庭审项目并且输入测试庭审项目至多模态庭审数据检索网络，通过mAP曲线、PR曲线和top-N精度对多模态庭审数据检索网络进行评估，得出评估结果；S32: Determine the test trial project and input the test trial project to the multimodal trial data retrieval network, evaluate the multimodal trial data retrieval network through the mAP curve, PR curve and top-N accuracy, and obtain the evaluation result;

判断评估结果是否符合预设的评估标准，如果符合，以多模态庭审数据初步网络作为多模态庭审数据检索网络；Judging whether the evaluation results meet the preset evaluation standards, if so, use the preliminary multimodal trial data network as the multimodal trial data retrieval network;

如果不符合，通过评估结果优化多模态庭审数据初步网络，得到多模态庭审数据检索网络。If not, the preliminary multimodal trial data network is optimized based on the evaluation results to obtain a multimodal trial data retrieval network.

有益效果：Beneficial effect:

本申请通过针对待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，构建关于待构建庭审项目的多模态庭审数据检索网络，将待构建庭审项目的多模态数据进行统一空间的存储，方便对待构建庭审项目进行检索，因此本申请能够解决现有的庭审文件检索方法不能够专门用于法庭案件审理的问题。This application builds a multimodal court trial data retrieval network for court trial projects to be constructed by extracting video data, audio data, and text data for court trial projects to be constructed, and performs unified spatial retrieval of multimodal data for court trial projects to be constructed. storage, which is convenient for retrieval of court trial items to be constructed, so this application can solve the problem that the existing court trial document retrieval methods cannot be specially used for court case trials.

附图说明Description of drawings

为了更清楚地说明本申请的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solution of the present application more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, for those of ordinary skill in the art, on the premise of not paying creative labor, Additional drawings can also be derived from these drawings.

图1为本申请实施例中面向司法图文数据的跨模态检索方法的流程示意图。FIG. 1 is a schematic flowchart of a cross-modal retrieval method for judicial graphic data in an embodiment of the present application.

具体实施方式Detailed ways

下面将详细地对实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下实施例中描述的实施方式并不代表与本申请相一致的所有实施方式。仅是与权利要求书中所详述的、本申请的一些方面相一致的系统和方法的示例。The embodiments will be described in detail hereinafter, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following examples do not represent all implementations consistent with this application. These are merely examples of systems and methods consistent with aspects of the present application as recited in the claims.

本申请提供了一种面向司法图文数据的跨模态检索方法，如图1所示，图1为本申请实施例中面向司法图文数据的跨模态检索方法的流程示意图，包括：This application provides a cross-modal retrieval method for judicial graphic data, as shown in Figure 1, Figure 1 is a schematic flow diagram of a cross-modal retrieval method for judicial graphic data in the embodiment of the application, including:

S1：确定若干个待构建庭审项目，以及基于待构建庭审项目，相应地获取原始视频数据、原始音频数据和原始文本数据。S1: Determine several court trial projects to be constructed, and obtain original video data, original audio data and original text data accordingly based on the court trial projects to be constructed.

具体地，预处理数据集。庭审的原始视频数据和原始音频数据往往时长较长，需要对对原始视频数据和原始音频数据加入标注，以便于后续的特征提取。Specifically, preprocess the dataset. The original video data and original audio data of court trials are often long in length, and it is necessary to add annotations to the original video data and original audio data to facilitate subsequent feature extraction.

S2：基于待构建庭审项目，分别对原始视频数据、原始音频数据和原始文本数据进行特征提取，相应地得到以相同存储形式进行存储的提取视频数据、提取音频数据和提取文本数据。S2: Based on the court trial project to be constructed, feature extraction is performed on the original video data, original audio data and original text data respectively, and the extracted video data, extracted audio data and extracted text data stored in the same storage form are correspondingly obtained.

其中，所述步骤S2包括：Wherein, the step S2 includes:

具体地，针对视频模态：首先对视频进行分段处理，利用MovieNet提取每一段视频的特征。Specifically, for the video mode: first segment the video, and use MovieNet to extract the features of each video.

针对音频模态：首先对视频进行分段处理，利用AudioNet提取音频的特征，并转换为对应的文本数据。For the audio mode: first segment the video, use AudioNet to extract the audio features, and convert them into corresponding text data.

针对文本模态：利用Bert模型对文本数据进行特征提取。整个文本的表示是所有单词向量的集合，表示为{t_i,……,t_j}。For text mode: use the Bert model to extract features from text data. The representation of the whole text is the set of all word vectors, denoted as {t_i ,...,t_j }.

S3：通过若干个待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，针对标准多模态数据检索网络进行训练，得到多模态庭审数据检索网络。S3: Through the extracted video data, extracted audio data and extracted text data of several court trial projects to be constructed, the standard multi-modal data retrieval network is trained to obtain a multi-modal court trial data retrieval network.

其中，所述步骤S3包括：Wherein, the step S3 includes:

S31：通过若干个待构建庭审项目的提取视频数据、提取音频数据和提取文本数据，针对标准多模态数据检索网络进行训练，得到多模态庭审数据初步网络。S31: Through the extracted video data, extracted audio data and extracted text data of several court trial projects to be constructed, train the standard multimodal data retrieval network to obtain a preliminary multimodal trial data network.

具体地，将三种模态数据进行转换，映射到同一子空间，在这一子空间中，不同模态数据的存储形式是相同的。Specifically, the three modal data are converted and mapped to the same subspace, and in this subspace, the storage forms of different modal data are the same.

重复获取不同的待构建庭审项目的所有三种模态数据，然后将三种模态数据映射到同一子空间，通过多模态庭审数据网络进行训练，优化网络结构，使得多模态庭审数据网络能够消除不同模态的异构性，实现多模态庭审数据的检索。Repeatedly obtain all three modal data of different trial projects to be constructed, and then map the three modal data to the same subspace, train through the multi-modal trial data network, optimize the network structure, and make the multi-modal trial data network It can eliminate the heterogeneity of different modalities and realize the retrieval of multimodal court trial data.

判断评估结果是否符合预设的评估标准，如果符合，以多模态庭审数据初步网络作为多模态庭审数据检索网络；如果不符合，通过评估结果优化多模态庭审数据初步网络，得到多模态庭审数据检索网络。Judging whether the evaluation results meet the preset evaluation standards, if so, use the multimodal trial data preliminary network as the multimodal trial data retrieval network; if not, optimize the multimodal trial data preliminary network through the evaluation results to obtain the multimodal trial data network. State Trial Data Retrieval Network.

具体地，首先计算出精度和召回率，精度表示预测为正例的样本中有多少是真的正例，召回率表示所有正例中预测正确的程度，精确度的计算公式如下所示：Specifically, first calculate the precision and recall rate. The precision indicates how many of the samples predicted as positive examples are true positive examples, and the recall rate indicates the degree of correct prediction among all positive examples. The calculation formula of precision is as follows:

其中，tp表示为检索样本中的预测正确的正样本，fp是将负样本预测为正样本的情况，精确度主要是反映“预测正确的正样本”占“预测为正样本”的比重，从而判断检索的准确性。Among them, tp represents the correctly predicted positive samples in the retrieved samples, fp is the case of predicting negative samples as positive samples, and the accuracy mainly reflects the proportion of "predicted positive samples" in "predicted positive samples", so Judge the accuracy of the search.

召回率的计算公式如下所示：The formula for calculating the recall rate is as follows:

其中，tp表示为检索样本中的预测正确的正样本，fn是将正样本预测为负样本的情况，召回率主要是反映“预测正确的正样本”占“正样本”的比重，从而进一步判断检索的准确性。Among them, tp represents the correctly predicted positive samples in the retrieved samples, fn is the case of predicting positive samples as negative samples, and the recall rate mainly reflects the proportion of "correctly predicted positive samples" in "positive samples", so as to further judge Accuracy of retrieval.

以上对本申请的实施例进行了详细说明，但内容仅为本申请的较佳实施例，不能被认为用于限定本申请的实施范围。凡依本申请范围所作的均等变化与改进等，均应仍属于本申请的专利涵盖范围之内。The embodiments of the present application have been described in detail above, but the contents are only preferred embodiments of the present application, and cannot be considered as limiting the implementation scope of the present application. All equivalent changes and improvements made according to the scope of this application should still fall within the scope of patent coverage of this application.

Claims

1. The cross-modal retrieval method for judicial image-text data is characterized by comprising the following steps of:

s1: determining a plurality of to-be-constructed court trial projects, and correspondingly acquiring original video data, original audio data and original text data based on the to-be-constructed court trial projects;

s2: based on a trial project to be constructed, respectively carrying out feature extraction on original video data, original audio data and original text data to correspondingly obtain extracted video data, extracted audio data and extracted text data which are stored in the same storage form;

s3: training a standard multi-modal data retrieval network through a plurality of extracted video data, extracted audio data and extracted text data of a court trial project to be constructed to obtain the multi-modal court trial data retrieval network;

s4: and acquiring a court trial item to be searched, and inputting the court trial item to be searched into a multi-mode court trial data search network to obtain the original video data, the original audio data or the original text data corresponding to the court trial item to be searched.

2. The judicial teletext-oriented cross-modal retrieval method according to claim 1, wherein step S2 includes:

s21: segmenting the original video data to obtain segmented video data, and extracting features of the segmented video data through MovieNet to obtain extracted video data;

s22: segmenting the original audio data to obtain segmented audio data, and extracting features of the segmented audio data through an Audio Net to obtain extracted audio data;

s23: and extracting features of the original text data through Bert to obtain extracted text data comprising a plurality of word vectors.

3. The judicial teletext-oriented cross-modal retrieval method according to claim 1, wherein step S3 includes:

s31: training a standard multi-modal data retrieval network through a plurality of extracted video data, extracted audio data and extracted text data of a court trial project to be constructed to obtain a multi-modal court trial data preliminary network;

s32: determining a test court trial project, inputting the test court trial project into a multi-mode court trial data retrieval network, and evaluating the multi-mode court trial data retrieval network through mAP curves, PR curves and top-N precision to obtain an evaluation result;

judging whether the evaluation result accords with a preset evaluation standard, and if so, taking the multi-mode court trial data primary network as a multi-mode court trial data retrieval network;

if the multi-mode court trial data is not matched with the multi-mode court trial data, optimizing the multi-mode court trial data primary network through the evaluation result to obtain a multi-mode court trial data retrieval network.