CN107169106B

Movatterモバイル変換

Info

Publication number: CN107169106B
Application number: CN201710351135.6A
Authority: CN
Inventors: 周文明; 王志鹏
Original assignee: Zhuhai Thinkjoy Information Technology Co ltd
Current assignee: Zhuhai Thinkjoy Information Technology Co ltd
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2023-08-18
Anticipated expiration: 2037-05-18
Also published as: CN107169106A

Abstract

The application discloses a video retrieval method, a video retrieval device, a storage medium and a processor. Wherein the method comprises the following steps: acquiring a target retrieval picture and a plurality of video images; preprocessing a plurality of video images to obtain at least one first target video image; processing at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image; processing all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image; clustering the first features and the second features according to a preset algorithm to obtain a retrieval model; carrying out matting processing on the target retrieval image to obtain a target area image; and searching the target area image to obtain a search result. The application solves the technical problems of low video retrieval precision and low retrieval efficiency in the prior art.

Description

Translated fromChinese

视频检索方法、装置、存储介质及处理器Video retrieval method, device, storage medium and processor

技术领域technical field

本发明涉及数字智能化领域，具体而言，涉及一种视频检索方法、装置、存储介质及处理器。The invention relates to the field of digital intelligence, in particular to a video retrieval method, device, storage medium and processor.

背景技术Background technique

随着平安城市、智慧社区等项目的构建和普及，视频安防监控设备已逐步架设到城市的各个角落，并且可以7x24小时昼夜不间断的录制采集视频图像数据。对于规模庞大、数量繁多的交通、社区监控视频系统，新兴的基于计算机视觉技术的智慧视频分析使得海量视频的自动分析及目标识别成为了可能。众所周知，监控视频主要用于社区及公共安全的维护，通过实时取证及事后检索，对保障社会治安起着至关重要的作用。然而，视频图像作为非结构化数据，数据量庞大，有效信息少，在格式化存储方面仍存在很多问题。此外，视频数据的实时快速检索也面临诸多挑战，而人为检索由于工作量大、检索目标众多、容易遗漏、效率低下等种种限制因素而不符合实际应用。基于上述，现有技术中的视频检索技术主要包括以下两种方式：With the construction and popularization of safe cities, smart communities and other projects, video security monitoring equipment has been gradually installed in every corner of the city, and can record and collect video image data 7x24 hours a day and night. For large-scale and large-scale traffic and community monitoring video systems, the emerging intelligent video analysis based on computer vision technology makes the automatic analysis and target recognition of massive videos possible. As we all know, surveillance video is mainly used for the maintenance of community and public safety. Through real-time evidence collection and post-event retrieval, it plays a vital role in ensuring social security. However, as unstructured data, video images have a huge amount of data and little effective information, and there are still many problems in format storage. In addition, the real-time and fast retrieval of video data also faces many challenges, and human retrieval is not suitable for practical applications due to various constraints such as heavy workload, numerous retrieval targets, easy omission, and low efficiency. Based on the above, the video retrieval technology in the prior art mainly includes the following two methods:

方式一，基于语义的视频检索。该种检索方式以关键字为基础，通过对视频人为添加或自动生成语义描述数据进行基于关键字的检索匹配，关键字可以是标题、主题、人物、视频事件等。然而，在安防监控应用中，基于语义的视频检索技术的精度依赖于大量的语义描述信息，且对单个特定目标的描述信息较少，检索效果十分受限。例如，在海量的公共安防视频中寻找某个目标人物，其描述信息仅有如“穿蓝色上衣黑色裤子的人”，而无法具体描述该人物的深层特征信息，检索的针对性差，搜索到的结果将会十分庞杂。Method 1, semantic-based video retrieval. This kind of retrieval method is based on keywords, and performs keyword-based retrieval and matching on video artificially added or automatically generated semantic description data. Keywords can be titles, topics, characters, video events, etc. However, in security monitoring applications, the accuracy of semantic-based video retrieval technology depends on a large amount of semantic description information, and there is less description information for a single specific target, so the retrieval effect is very limited. For example, when searching for a certain target person in a large number of public security videos, the description information is only such as "a person wearing a blue jacket and black pants", but cannot specifically describe the deep feature information of the person, and the search is not targeted. The results will be very mixed.

方法二、基于内容的视频检索。该种检索方式通常采用传统图像处理方法，通过提取视频图像的颜色、纹理、边缘、特征点等底层信息，以分析视频之间的相似度作为检索的依据。相较于语义检索，基于内容的视频检索有效利用了图像视频中的底层特征，检索效率有所提升。然而，目前大多数基于内容的图像检索技术需采用传统图像特征，描述能力仍存在一定限制，且用于检索的特征向量维度高，计算相似性时耗时很长，难以做到真正的实时检索。Method two, content-based video retrieval. This kind of retrieval method usually adopts traditional image processing methods, and analyzes the similarity between videos as the basis for retrieval by extracting the underlying information such as color, texture, edge, and feature points of video images. Compared with semantic retrieval, content-based video retrieval effectively utilizes the underlying features in image and video, and the retrieval efficiency is improved. However, most of the current content-based image retrieval technologies need to use traditional image features, and there are still certain limitations in the description ability, and the feature vectors used for retrieval have high dimensions, and the calculation of similarity takes a long time, making it difficult to achieve real-time retrieval .

综上，目前的视频检索技术存在检索针对性、检索精度和检索效率较低，检索实时性较差的技术问题，因此，现有技术中存在视频检索精度和检索效率较低的技术问题。To sum up, the current video retrieval technology has the technical problems of low retrieval pertinence, low retrieval accuracy and retrieval efficiency, and poor real-time retrieval. Therefore, there are technical problems of low video retrieval accuracy and retrieval efficiency in the existing technology.

针对上述的问题，目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容Contents of the invention

本发明实施例提供了一种视频检索方法、装置、存储介质及处理器，以至少解决现有技术中存在的视频检索精度和检索效率较低的技术问题。Embodiments of the present invention provide a video retrieval method, device, storage medium, and processor, so as to at least solve the technical problems of low video retrieval accuracy and retrieval efficiency in the prior art.

根据本发明实施例的一个方面，提供了一种视频检索方法，该方法包括：获取目标检索图片和多个视频图像；对上述多个视频图像进行预处理，得到至少一个第一目标视频图像；根据第一预设模型对上述至少一个第一目标视频图像进行目标检测处理和目标跟踪处理，得到上述至少一个第一目标视频图像中的每个上述第一目标视频图像的全部目标图像序列；根据第二预设模型对上述每个上述第一目标视频图像的全部目标图像序列进行特征提取处理，得到上述每个上述第一目标视频图像的第一特征和第二特征，其中，上述第一特征为上述第一目标视频图像的二值化哈希特征，上述第二特征为上述第一目标视频图像的原始特征；根据预设近似最邻近算法对上述第一特征和上述第二特征进行聚类处理，得到检索模型；对上述目标检索图像进行抠图处理，得到目标区域图像；根据上述检索模型对上述目标区域图像进行检索，得到检索结果。According to an aspect of an embodiment of the present invention, a video retrieval method is provided, the method comprising: acquiring a target retrieval picture and a plurality of video images; performing preprocessing on the plurality of video images to obtain at least one first target video image; Perform target detection processing and target tracking processing on the at least one first target video image according to the first preset model to obtain all target image sequences of each of the above-mentioned first target video images in the at least one first target video image; according to The second preset model performs feature extraction processing on all target image sequences of each of the above-mentioned first target video images to obtain the first feature and second feature of each of the above-mentioned first target video images, wherein the first feature It is the binary hash feature of the above-mentioned first target video image, and the above-mentioned second feature is the original feature of the above-mentioned first target video image; the above-mentioned first feature and the above-mentioned second feature are clustered according to the preset approximate nearest neighbor algorithm processing to obtain a retrieval model; performing matting processing on the above-mentioned target retrieval image to obtain a target area image; performing retrieval on the above-mentioned target area image according to the above-mentioned retrieval model to obtain a retrieval result.

进一步地，上述根据上述检索模型对上述目标区域图像进行检索，得到检索结果包括：获取上述目标区域图像的第三特征和第四特征，其中，上述第三特征为上述目标区域图像的二值化哈希特征，上述第四特征为上述目标区域图像的原始特征；计算上述第三特征与上述每个上述第一目标视频图像的上述第一特征之间的汉明距离，得到至少一个第二目标视频图像；计算上述第四特征与上述至少一个第二目标视频图像中的每个上述第二目标视频图像的上述第二特征的欧式距离，得到目标图像帧，其中，上述目标图像帧与上述目标检索图像的相似度大于预设相似度阈值；获取上述目标图像帧的帧ID；在上述多个视频图像中查找与上述帧ID对应的上述视频图像，得到上述检索结果。Further, the retrieval of the above-mentioned target area image according to the above-mentioned retrieval model, and obtaining the retrieval result include: obtaining the third feature and the fourth feature of the above-mentioned target area image, wherein the above-mentioned third feature is the binarization of the above-mentioned target area image Hash feature, the above-mentioned fourth feature is the original feature of the above-mentioned target area image; calculate the Hamming distance between the above-mentioned third feature and the above-mentioned first feature of each of the above-mentioned first target video images, and obtain at least one second target Video image; calculate the Euclidean distance between the above-mentioned fourth feature and the above-mentioned second feature of each of the above-mentioned second target video images in the above-mentioned at least one second target video image, and obtain the target image frame, wherein the above-mentioned target image frame and the above-mentioned target The similarity of the retrieved images is greater than a preset similarity threshold; the frame ID of the target image frame is obtained; the above-mentioned video image corresponding to the above-mentioned frame ID is searched in the above-mentioned multiple video images, and the above-mentioned retrieval result is obtained.

进一步地，在根据第二预设模型对上述每个上述第一目标视频图像的全部目标图像序列进行特征提取处理之后，上述方法还包括：将上述至少一个第一目标视频图像、上述目标图像序列、上述第一特征和上述第二特征结构化存储至数据库中。Further, after performing feature extraction processing on all target image sequences of each of the above-mentioned first target video images according to the second preset model, the above-mentioned method further includes: combining the above-mentioned at least one first target video image, the above-mentioned target image sequence , the above-mentioned first feature and the above-mentioned second feature are stored in a database in a structured manner.

进一步地，上述预设近似最邻近算法为局部敏感度哈希算法。Further, the above preset approximate nearest neighbor algorithm is a local sensitivity hash algorithm.

进一步地，上述对上述多个视频图像进行预处理，得到至少一个第一目标视频图像包括：对上述多个视频图像中的每个上述视频图像依次进行长度归一化处理和解码处理，得到上述第一目标视频图像。Further, the above-mentioned preprocessing of the plurality of video images to obtain at least one first target video image includes: performing length normalization processing and decoding processing on each of the above-mentioned video images in sequence to obtain the above-mentioned The first target video image.

进一步地，上述方法还包括：根据随机梯度下降算法对上述第一预设模型和上述第二预设模型进行训练，直至上述第一预设模型和上述第二预设模型达到收敛状态。Further, the above method further includes: training the above-mentioned first preset model and the above-mentioned second preset model according to the stochastic gradient descent algorithm until the above-mentioned first preset model and the above-mentioned second preset model reach a convergence state.

根据本发明实施例的另一方面，还提供了一种视频检索装置，该装置包括：获取单元，用于获取目标检索图片和多个视频图像；第一处理单元，用于对上述多个视频图像进行预处理，得到至少一个第一目标视频图像；第二处理单元，用于根据第一预设模型对上述至少一个第一目标视频图像进行目标检测处理和目标跟踪处理，得到上述至少一个第一目标视频图像中的每个上述第一目标视频图像的全部目标图像序列；第三处理单元，用于根据第二预设模型对上述每个上述第一目标视频图像的全部目标图像序列进行特征提取处理，得到上述每个上述第一目标视频图像的第一特征和第二特征，其中，上述第一特征为上述第一目标视频图像的二值化哈希特征，上述第二特征为上述第一目标视频图像的原始特征；第四处理单元，用于根据预设近似最邻近算法对上述第一特征和上述第二特征进行聚类处理，得到检索模型；第五处理单元，用于对上述目标检索图像进行抠图处理，得到目标区域图像；检索单元，用于根据上述检索模型对上述目标区域图像进行检索，得到检索结果。According to another aspect of the embodiments of the present invention, there is also provided a video retrieval device, which includes: an acquisition unit, configured to acquire target retrieval pictures and a plurality of video images; a first processing unit, configured to process the plurality of video images The image is preprocessed to obtain at least one first target video image; the second processing unit is configured to perform target detection processing and target tracking processing on the above-mentioned at least one first target video image according to a first preset model to obtain the above-mentioned at least one first target video image All target image sequences of each of the above-mentioned first target video images in a target video image; a third processing unit, configured to characterize all target image sequences of each of the above-mentioned first target video images according to a second preset model The extraction process obtains the first feature and the second feature of each of the above-mentioned first target video images, wherein the above-mentioned first feature is the binary hash feature of the above-mentioned first target video image, and the above-mentioned second feature is the above-mentioned first feature. The original feature of a target video image; the fourth processing unit is used to cluster the above-mentioned first feature and the above-mentioned second feature according to the preset approximate nearest neighbor algorithm to obtain a retrieval model; the fifth processing unit is used to process the above-mentioned The target retrieval image is subjected to matting processing to obtain a target region image; the retrieval unit is configured to retrieve the above target region image according to the retrieval model to obtain a retrieval result.

进一步地，上述检索单元包括：第一获取子单元，用于获取上述目标区域图像的第三特征和第四特征，其中，上述第三特征为上述目标区域图像的二值化哈希特征，上述第四特征为上述目标区域图像的原始特征；第一计算子单元，用于计算上述第三特征与上述每个上述第一目标视频图像的上述第一特征之间的汉明距离，得到至少一个第二目标视频图像；第二计算子单元，用于计算上述第四特征与上述至少一个第二目标视频图像中的每个上述第二目标视频图像的上述第二特征的欧式距离，得到目标图像帧，其中，上述目标图像帧与上述目标检索图像的相似度大于预设相似度阈值；第二获取子单元，用于获取上述目标图像帧的帧ID；检索子单元，用于在上述多个视频图像中检索与上述帧ID对应的上述视频图像，得到上述检索结果。Further, the retrieval unit includes: a first acquisition subunit, configured to acquire the third feature and the fourth feature of the target area image, wherein the third feature is a binary hash feature of the target area image, and the above The fourth feature is the original feature of the above-mentioned target area image; the first calculation subunit is used to calculate the Hamming distance between the above-mentioned third feature and the above-mentioned first feature of each of the above-mentioned first target video images, and obtain at least one The second target video image; the second calculation subunit is used to calculate the Euclidean distance between the above-mentioned fourth feature and the above-mentioned second feature of each of the above-mentioned second target video images in the above-mentioned at least one second target video image, and obtain the target image Frame, wherein, the similarity between the above-mentioned target image frame and the above-mentioned target retrieval image is greater than the preset similarity threshold; the second obtaining subunit is used to obtain the frame ID of the above-mentioned target image frame; The above-mentioned video image corresponding to the above-mentioned frame ID is searched in the video image, and the above-mentioned search result is obtained.

根据本发明实施例的又一方面，还提供了一种存储介质，上述存储介质包括存储的程序，其中，在上述程序运行时控制上述存储介质所在设备执行上述的视频检索方法。According to yet another aspect of the embodiments of the present invention, there is also provided a storage medium, the above-mentioned storage medium includes a stored program, wherein when the above-mentioned program is running, the device where the above-mentioned storage medium is located is controlled to execute the above-mentioned video retrieval method.

根据本发明实施例的又一方面，还提供了一种处理器，上述处理器用于运行程序，其中，上述程序运行时执行上述的视频检索方法。According to still another aspect of the embodiments of the present invention, there is also provided a processor, the above-mentioned processor is used to run a program, wherein the above-mentioned video retrieval method is executed when the above-mentioned program is running.

在本发明实施例中，采用下述方式：获取目标检索图片和多个视频图像；对多个视频图像进行预处理，得到至少一个第一目标视频图像；根据第一预设模型对至少一个第一目标视频图像进行目标检测处理和目标跟踪处理，得到至少一个第一目标视频图像中的每个第一目标视频图像的全部目标图像序列；根据第二预设模型对每个第一目标视频图像的全部目标图像序列进行特征提取处理，得到每个第一目标视频图像的第一特征和第二特征，其中，第一特征为第一目标视频图像的二值化哈希特征，第二特征为第一目标视频图像的原始特征；根据预设近似最邻近算法对第一特征和第二特征进行聚类处理，得到检索模型；通过对目标检索图像进行抠图处理得到目标区域图像；达到了根据检索模型对目标区域图像进行检索得到检索结果的目的，从而实现了提升视频的检索精度和检索效率、降低检索的时间成本和人力成本的技术效果，进而解决了现有技术中存在的视频检索精度和检索效率较低的技术问题。In the embodiment of the present invention, the following methods are adopted: acquiring target retrieval pictures and multiple video images; performing preprocessing on multiple video images to obtain at least one first target video image; performing at least one first target video image according to a first preset model A target video image is subjected to target detection processing and target tracking processing to obtain all target image sequences of each first target video image in at least one first target video image; for each first target video image according to a second preset model All target image sequences of all the target image sequences are subjected to feature extraction processing to obtain the first feature and the second feature of each first target video image, wherein the first feature is the binary hash feature of the first target video image, and the second feature is The original feature of the first target video image; according to the preset approximate nearest neighbor algorithm, the first feature and the second feature are clustered to obtain the retrieval model; the target area image is obtained by matting the target retrieval image; The retrieval model retrieves the image of the target area to obtain the retrieval results, thereby achieving the technical effect of improving the retrieval accuracy and efficiency of the video, reducing the time cost and labor cost of the retrieval, and thus solving the video retrieval accuracy existing in the prior art. and technical problems with low retrieval efficiency.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是根据本发明实施例的一种可选的视频检索方法的流程示意图；Fig. 1 is a schematic flow chart of an optional video retrieval method according to an embodiment of the present invention;

图2是根据本发明实施例的另一种可选的视频检索方法的流程示意图；FIG. 2 is a schematic flow diagram of another optional video retrieval method according to an embodiment of the present invention;

图3是根据本发明实施例的一种可选的视频检索装置的结构示意图；FIG. 3 is a schematic structural diagram of an optional video retrieval device according to an embodiment of the present invention;

图4是根据本发明实施例的另一种可选的视频检索装置的结构示意图。Fig. 4 is a schematic structural diagram of another optional video retrieval device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is an embodiment of a part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

实施例1Example 1

根据本发明实施例，提供了一种视频检索方法的实施例，需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, an embodiment of a video retrieval method is provided. It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although In the flowcharts, a logical order is shown, but in some cases the steps shown or described may be performed in an order different from that shown or described herein.

图1是根据本发明实施例的一种可选的视频检索方法的流程示意图，如图1所示，该方法包括如下步骤：Fig. 1 is a schematic flow chart of an optional video retrieval method according to an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:

步骤S102，获取目标检索图片和多个视频图像；Step S102, acquiring target retrieval pictures and multiple video images;

步骤S104，对多个视频图像进行预处理，得到至少一个第一目标视频图像；Step S104, performing preprocessing on multiple video images to obtain at least one first target video image;

步骤S106，根据第一预设模型对至少一个第一目标视频图像进行目标检测处理和目标跟踪处理，得到至少一个第一目标视频图像中的每个第一目标视频图像的全部目标图像序列；Step S106, performing target detection processing and target tracking processing on at least one first target video image according to the first preset model, to obtain all target image sequences of each first target video image in the at least one first target video image;

步骤S108，根据第二预设模型对每个第一目标视频图像的全部目标图像序列进行特征提取处理，得到每个第一目标视频图像的第一特征和第二特征，其中，第一特征为第一目标视频图像的二值化哈希特征，第二特征为第一目标视频图像的原始特征；Step S108, perform feature extraction processing on all target image sequences of each first target video image according to the second preset model, and obtain the first feature and second feature of each first target video image, wherein the first feature is The binary hash feature of the first target video image, the second feature is the original feature of the first target video image;

步骤S110，根据预设近似最邻近算法对第一特征和第二特征进行聚类处理，得到检索模型；Step S110, clustering the first feature and the second feature according to a preset approximate nearest neighbor algorithm to obtain a retrieval model;

步骤S112，对目标检索图像进行抠图处理，得到目标区域图像；Step S112, performing matting processing on the target retrieval image to obtain the target area image;

步骤S114，根据检索模型对目标区域图像进行检索，得到检索结果。In step S114, the image of the target area is retrieved according to the retrieval model, and a retrieval result is obtained.

可选地，多个视频图像可以理解为海量的视频图像，目标检索图片由用户输入，需要说明的是，该目标检索图片可能包含于多个视频图像中，也可能并不包含于多个视频图像中。Optionally, multiple video images can be understood as massive video images, and the target retrieval picture is input by the user. It should be noted that the target retrieval picture may or may not be included in multiple video images in the image.

可选地，执行本申请上述步骤S102至步骤S110，可以先对海量的视频图像进行处理，提取每个视频图像的特征(包括目标检测、目标跟踪、特征提取)，该特征包括原始特征(维度较长)和二值化哈希特征(维度较短，只有0或1两个数字)，进而对上述视频图像的原始特征和二值化哈希特征进行保存及聚类，从而构建检索服务模型。Optionally, by executing the above-mentioned steps S102 to S110 of the present application, a large number of video images can be processed first, and the features of each video image (including target detection, target tracking, and feature extraction) can be extracted, and the features include original features (dimension Longer) and binary hash features (shorter dimension, only two numbers of 0 or 1), and then save and cluster the original features and binary hash features of the above video images, so as to build a retrieval service model .

可选地，在用户输入单张图片作为目标检索图片的情况下，执行步骤S112可以对用户输入的单张图片进行预处理，去掉图片中与目标区域图像无关的信息，把目标区域图像单独抠出。Optionally, in the case where the user inputs a single picture as the target retrieval picture, step S112 may be performed to preprocess the single picture input by the user, remove information irrelevant to the target area image in the picture, and separate the target area image out.

可选地，第一预设模型中可以包含两个子模型，分别为基于深度学习的目标检测子模型和基于深度学习的目标跟踪子模型；第二预设模型可以为基于深度学习的目标特征提取模型。Optionally, the first preset model may contain two sub-models, which are the target detection sub-model based on deep learning and the target tracking sub-model based on deep learning; the second preset model may be target feature extraction based on deep learning Model.

可选地，图2是根据本发明实施例的另一种可选的视频检索方法的流程示意图，如图2所示，步骤S114，根据检索模型对目标区域图像进行检索，得到检索结果包括：Optionally, FIG. 2 is a schematic flow diagram of another optional video retrieval method according to an embodiment of the present invention. As shown in FIG. 2, step S114 searches the image of the target area according to the retrieval model, and the retrieval results include:

步骤S202，获取目标区域图像的第三特征和第四特征，其中，第三特征为目标区域图像的二值化哈希特征，第四特征为目标区域图像的原始特征；Step S202, acquiring the third feature and the fourth feature of the target area image, wherein the third feature is the binary hash feature of the target area image, and the fourth feature is the original feature of the target area image;

步骤S204，计算第三特征与每个第一目标视频图像的第一特征之间的汉明距离，得到至少一个第二目标视频图像；Step S204, calculating the Hamming distance between the third feature and the first feature of each first target video image to obtain at least one second target video image;

步骤S206，计算第四特征与至少一个第二目标视频图像中的每个第二目标视频图像的第二特征的欧式距离，得到目标图像帧，其中，目标图像帧与目标检索图像的相似度大于预设相似度阈值；Step S206, calculating the Euclidean distance between the fourth feature and the second feature of each second target video image in at least one second target video image to obtain the target image frame, wherein the similarity between the target image frame and the target retrieval image is greater than Preset similarity threshold;

步骤S208，获取目标图像帧的帧ID；Step S208, acquiring the frame ID of the target image frame;

步骤S210，在多个视频图像中查找与帧ID对应的视频图像，得到检索结果。Step S210, searching for the video image corresponding to the frame ID in multiple video images, and obtaining the retrieval result.

可选地，执行步骤S202，可以得到目标区域图像中维度较长的原始特征和维度较短的二值化哈希特征。Optionally, by executing step S202, the original features with longer dimensions and the binarized hash features with shorter dimensions in the image of the target area can be obtained.

可选地，执行步骤S204，可以计算用户输入图像的二值化特征与海量视频数据的二值化特征的汉明距离，从而缩小检索范围，得到范围缩小后的海量视频数据特征。其中，汉明距离可以表征上述特征之间的相似度，汉明距离越大则相似度越低。例如，计算汉明距离可以缩小检索范围，例如海量数据库中有十万个视频图像，用户输入一张哈士奇的图片，可能计算汉明距离之后还剩一万个视频图像，这一万个视频图像中可能都是包含狗的。Optionally, step S204 is executed to calculate the Hamming distance between the binarized features of the image input by the user and the binarized features of massive video data, so as to narrow the search scope and obtain the features of massive video data after the narrowed range. Among them, the Hamming distance can represent the similarity between the above features, and the larger the Hamming distance, the lower the similarity. For example, calculating the Hamming distance can narrow the search scope. For example, there are 100,000 video images in a massive database. If a user inputs a picture of a husky, there may be 10,000 video images left after calculating the Hamming distance. These 10,000 video images may contain dogs.

可选地，执行步骤S206至步骤S210，可以计算用户输入图像的原始特征与范围缩小后的海量视频数据的原始特征之间的欧式距离，从而得到海量视频数据中前N条与用户输入图像相似度高的图像帧，进而根据图像帧ID在海量视频数据中查找对应的视频标识、图像所在帧号等相关信息，最终获得视频检索结果。例如，计算欧式距离，即可在上述举例中的一万个包含狗的视频图像中，得到仅包含哈士奇的一千张视频图像。因此，依次计算汉明距离和计算欧式距离，可以将检索的范围进一步缩小。Optionally, by executing steps S206 to S210, the Euclidean distance between the original features of the image input by the user and the original features of the reduced mass video data can be calculated, so as to obtain that the top N items in the mass video data are similar to the user input image High-resolution image frames, and then according to the image frame ID, search for the corresponding video ID, the frame number of the image and other related information in the massive video data, and finally obtain the video retrieval result. For example, by calculating the Euclidean distance, one thousand video images containing only huskies can be obtained from the ten thousand video images containing dogs in the above example. Therefore, calculating the Hamming distance and calculating the Euclidean distance in turn can further narrow the scope of retrieval.

可选地，基于上述，首先按照目标检索图片的二值化哈希特征通过标准正太分布图标获取对应分桶的位置，根据分桶标记从redis中获取相对应的二值向量集合，通过汉明距离比较、排序，获取对应相似度高的二值化哈希特征，完成初步检索。进而按照目标检索图片的原始特征，通过计算欧式距离可以进行进一步的精确检索。最终，经过比较和排序，获取前N条相似度高的图像帧，根据图像帧ID查找对应的视频标识、图像所在帧号等相关信息，从而获得视频检索结果。其中，N设置为10，即搜索返回前10个相似度最高的视频序列。Optionally, based on the above, first obtain the position of the corresponding bucket through the standard normal distribution icon according to the binary hash feature of the target retrieval image, and obtain the corresponding binary vector set from redis according to the bucket mark, and use Hamming Distance comparison, sorting, obtaining binary hash features corresponding to high similarity, and completing preliminary retrieval. Then, according to the original features of the target retrieval image, further precise retrieval can be carried out by calculating the Euclidean distance. Finally, after comparison and sorting, the first N image frames with high similarity are obtained, and relevant information such as the corresponding video ID and the frame number of the image are searched according to the image frame ID, so as to obtain the video retrieval result. Among them, N is set to 10, that is, the search returns the top 10 video sequences with the highest similarity.

可选地，在执行完成步骤S108之后，即在根据第二预设模型对每个第一目标视频图像的全部目标图像序列进行特征提取处理之后，该方法还可以包括：Optionally, after completing step S108, that is, after performing feature extraction processing on all target image sequences of each first target video image according to the second preset model, the method may further include:

步骤S10，将至少一个第一目标视频图像、目标图像序列、第一特征和第二特征结构化存储至数据库中。该数据库可以为Mongodb数据库或Poseidon数据库，上述数据库可以作为检索数据库，在进行视频图像检索时，均需要将目标特征与该数据库中的数据进行比对，得到检索结果。Step S10, at least one first target video image, target image sequence, first feature and second feature are stored in a database in a structured manner. The database can be a Mongodb database or a Poseidon database, and the above-mentioned database can be used as a retrieval database. When performing video image retrieval, it is necessary to compare the target features with the data in the database to obtain retrieval results.

可选地，该预设近似最邻近算法为局部敏感度哈希算法。具体地，基于ANN(Approximate Nearest Neighbor)近似最近邻算法对视频文件的结构化信息进行聚类。基于标准正太分布二值哈希进行分桶，并将分桶后的二值向量数据存储到内存数据redis中，从而构建检索服务。Optionally, the preset approximate nearest neighbor algorithm is a local sensitivity hash algorithm. Specifically, the structural information of the video file is clustered based on the ANN (Approximate Nearest Neighbor) approximate nearest neighbor algorithm. Based on the standard normal distribution binary hash, the buckets are divided, and the binary vector data after bucketing is stored in the memory data redis, so as to construct the retrieval service.

可选地，执行步骤S104，即对多个视频图像进行预处理，得到至少一个第一目标视频图像包括：Optionally, performing step S104, that is, performing preprocessing on multiple video images to obtain at least one first target video image includes:

步骤S20，对多个视频图像中的每个视频图像依次进行长度归一化处理和解码处理，得到第一目标视频图像。Step S20, performing length normalization processing and decoding processing on each of the plurality of video images in sequence to obtain a first target video image.

具体地，对视频图像进行长度归一化处理，可以将连续的视频流截取为长度固定的视频流串，从而便于后期分析与保存；在对视频图像解码处理时，可以通过opencv对视频文件进行解码，并对每帧图像进行尺寸缩放归一化操作。其中，尺寸缩放采用双线性差值算法，缩放的尺寸为1920*1080。Specifically, the length normalization process is performed on the video image, and the continuous video stream can be intercepted into a video stream string with a fixed length, which is convenient for later analysis and storage; when the video image is decoded, the video file can be processed by opencv Decode, and perform size scaling and normalization operations on each frame of image. Among them, the size scaling adopts the bilinear difference algorithm, and the scaled size is 1920*1080.

可选地，该方法还可以包括：步骤S30，根据随机梯度下降算法对第一预设模型和第二预设模型进行训练，直至第一预设模型和第二预设模型达到收敛状态。Optionally, the method may further include: Step S30, training the first preset model and the second preset model according to the stochastic gradient descent algorithm until the first preset model and the second preset model reach a convergence state.

具体地，可以采用上述方式训练第一预设模型：首先可以将图像数据集及其对应的类别标签信息分别对应分成两部分，一部分作为训练样本集，另一部分作为测试样本集，其中，训练样本集和测试样本集中每个样本均包括一张图像及对应的类别标签。进而可以构建第一预设模型中的两个子模型：基于深度学习的目标检测子模型和基于深度学习的目标跟踪子模型，其中，目标检测子模型采用经典的YOLO架构，目标跟踪子模型采用RNN架构。最终，可以利用训练样本集，按照SGD随机梯度下降法对目标检测子模型和目标跟踪子模型进行训练。其中，训练的学习率步长设置为0.01。Specifically, the first preset model can be trained in the above manner: first, the image data set and its corresponding category label information can be divided into two parts respectively, one part is used as a training sample set, and the other part is used as a test sample set, wherein the training samples Each sample in the set and test sample set includes an image and the corresponding category label. Furthermore, two sub-models in the first preset model can be constructed: the target detection sub-model based on deep learning and the target tracking sub-model based on deep learning. Among them, the target detection sub-model adopts the classic YOLO architecture, and the target tracking sub-model adopts RNN architecture. Finally, the training sample set can be used to train the target detection sub-model and target tracking sub-model according to the SGD stochastic gradient descent method. Among them, the training learning rate step size is set to 0.01.

具体地，可以采用上述方式训练第二预设模型：首先将图像数据集及其对应的类别标签信息分别对应分成两部分，一部分作为训练样本集，另一部分作为测试样本集，其中，训练样本集和测试样本集中每个样本均包括一张图像及对应的类别标签。进而，构建深度卷积神经网络架构，深度卷积神经网络架构包含卷积子网络、哈希层、损失层，卷积子网络用于学习图像的原始特征，哈希层用于对原始特征进行特征压缩降维，转换为二进制编码，获得输入图像的二值化哈希特征，损失层用于衡量Softmax分类误差；其中，卷积子网络采用VGG架构。原始特征维度为4096维。二值化哈希特征维度为128维。最终，利用训练样本集，依据深度卷积神经网络架构，按照SGD随机梯度下降法对第二预设模型进行训练，得到基于深度学习的目标特征提取模型。其中，训练的学习率步长设置为0.01。Specifically, the second preset model can be trained in the above manner: first, the image data set and its corresponding category label information are respectively divided into two parts, one part is used as a training sample set, and the other part is used as a test sample set, wherein the training sample set Each sample in the test sample set includes an image and the corresponding category label. Furthermore, a deep convolutional neural network architecture is constructed. The deep convolutional neural network architecture includes a convolutional subnetwork, a hash layer, and a loss layer. The convolutional subnetwork is used to learn the original features of the image, and the hash layer is used to process the original features. The feature is compressed and dimensionally reduced, converted to binary code, and the binary hash feature of the input image is obtained. The loss layer is used to measure the Softmax classification error; among them, the convolutional sub-network adopts the VGG architecture. The original feature dimension is 4096 dimensions. The binary hash feature dimension is 128 dimensions. Finally, using the training sample set, according to the deep convolutional neural network architecture, the second preset model is trained according to the SGD stochastic gradient descent method, and the target feature extraction model based on deep learning is obtained. Among them, the training learning rate step size is set to 0.01.

实施例2Example 2

根据本发明实施例的另一个方面，还提供了一种视频检索装置，如图3所示，该装置包括：获取单元301、第一处理单元303、第二处理单元305、第三处理单元307、第四处理单元309、第五处理单元311以及检索单元313。According to another aspect of the embodiment of the present invention, a video retrieval device is also provided, as shown in FIG. 3 , the device includes: an acquisition unit 301, a first processing unit 303, a second processing unit 305, and a third processing unit 307 , a fourth processing unit 309 , a fifth processing unit 311 , and a retrieval unit 313 .

其中，获取单元301，用于获取目标检索图片和多个视频图像；第一处理单元303，用于对多个视频图像进行预处理，得到至少一个第一目标视频图像；第二处理单元305，用于根据第一预设模型对至少一个第一目标视频图像进行目标检测处理和目标跟踪处理，得到至少一个第一目标视频图像中的每个第一目标视频图像的全部目标图像序列；第三处理单元307，用于根据第二预设模型对每个第一目标视频图像的全部目标图像序列进行特征提取处理，得到每个第一目标视频图像的第一特征和第二特征，其中，第一特征为第一目标视频图像的二值化哈希特征，第二特征为第一目标视频图像的原始特征；第四处理单元309，用于根据预设近似最邻近算法对第一特征和第二特征进行聚类处理，得到检索模型；第五处理单元311，用于对目标检索图像进行抠图处理，得到目标区域图像；检索单元313，用于根据检索模型对目标区域图像进行检索，得到检索结果。Wherein, the obtaining unit 301 is used to obtain the target retrieval picture and multiple video images; the first processing unit 303 is used to preprocess the multiple video images to obtain at least one first target video image; the second processing unit 305, It is used to perform target detection processing and target tracking processing on at least one first target video image according to the first preset model, so as to obtain all target image sequences of each first target video image in the at least one first target video image; the third The processing unit 307 is configured to perform feature extraction processing on all target image sequences of each first target video image according to the second preset model, to obtain the first feature and the second feature of each first target video image, wherein the first One feature is the binary hash feature of the first target video image, and the second feature is the original feature of the first target video image; the fourth processing unit 309 is used to compare the first feature and the second feature according to a preset approximate nearest neighbor algorithm Two features are clustered to obtain the retrieval model; the fifth processing unit 311 is used to perform matting processing on the target retrieval image to obtain the target area image; the retrieval unit 313 is used to retrieve the target area image according to the retrieval model to obtain Search Results.

可选地，如图4所示，检索单元313可以包括：第一获取子单元401、第一计算子单元403、第二计算子单元405、第二获取子单元407以及检索子单元409。Optionally, as shown in FIG. 4 , the retrieval unit 313 may include: a first acquisition subunit 401 , a first calculation subunit 403 , a second calculation subunit 405 , a second acquisition subunit 407 , and a retrieval subunit 409 .

其中，第一获取子单元401，用于获取目标区域图像的第三特征和第四特征，其中，第三特征为目标区域图像的二值化哈希特征，第四特征为目标区域图像的原始特征；第一计算子单元403，用于计算第三特征与每个第一目标视频图像的第一特征之间的汉明距离，得到至少一个第二目标视频图像；第二计算子单元405，用于计算第四特征与至少一个第二目标视频图像中的每个第二目标视频图像的第二特征的欧式距离，得到目标图像帧，其中，目标图像帧与目标检索图像的相似度大于预设相似度阈值；第二获取子单元407，用于获取目标图像帧的帧ID；检索子单元409，用于在多个视频图像中检索与帧ID对应的视频图像，得到检索结果。Wherein, the first obtaining subunit 401 is used to obtain the third feature and the fourth feature of the target area image, wherein, the third feature is the binary hash feature of the target area image, and the fourth feature is the original feature of the target area image. Feature; the first calculation subunit 403 is used to calculate the Hamming distance between the third feature and the first feature of each first target video image to obtain at least one second target video image; the second calculation subunit 405, It is used to calculate the Euclidean distance between the fourth feature and the second feature of each second target video image in the at least one second target video image to obtain the target image frame, wherein the similarity between the target image frame and the target retrieval image is greater than the preset The similarity threshold is set; the second acquisition subunit 407 is used to acquire the frame ID of the target image frame; the retrieval subunit 409 is used to retrieve the video image corresponding to the frame ID among multiple video images to obtain the retrieval result.

实施例3Example 3

根据本发明实施例的又一个方面，还提供了一种存储介质，上述存储介质包括存储的程序，其中，在上述程序运行时控制上述存储介质所在设备执行本申请实施例1中的上述视频检索方法。According to yet another aspect of the embodiments of the present invention, there is also provided a storage medium, the above-mentioned storage medium includes a stored program, wherein, when the above-mentioned program is running, the device where the above-mentioned storage medium is located is controlled to execute the above-mentioned video retrieval in Embodiment 1 of the present application method.

根据本发明实施例的又一方面，还提供了一种处理器，上述处理器用于运行程序，其中，上述程序运行时执行本申请实施例1中的上述视频检索方法。According to yet another aspect of the embodiments of the present invention, there is also provided a processor, the processor is configured to run a program, wherein the above-mentioned video retrieval method in Embodiment 1 of the present application is executed when the above-mentioned program is running.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的技术内容，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，可以为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be realized in other ways. Wherein, the device embodiments described above are only illustrative. For example, the division of the units may be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrate into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.