CN119646874A

Movatterモバイル変換

Info

Publication number: CN119646874A
Application number: CN202411703405.1A
Authority: CN
Inventors: 王伦辉; 朱志庆; 秦士德; 田小龙; 周慧; 舒明炜
Original assignee: Guizhou Colorful New Media Co ltd
Current assignee: Guizhou Colorful New Media Co ltd
Priority date: 2024-11-26
Filing date: 2024-11-26
Publication date: 2025-03-18

Abstract

The invention discloses a multi-mode security fence method based on vector matching, which comprises the following steps of obtaining information to be checked, loading a pre-training model set, carrying out vectorization processing on the information to be checked to generate a vector to be checked, establishing connection with an illegal vector database set, obtaining the similarity of the illegal vector to be checked according to the illegal vector database set, judging whether the information to be checked is safely played, wherein the pre-training model set consists of a text vector model, a picture vector model and a voice vector model and is used for generating vector information from various types of information to be checked. According to the technical scheme, the method and the device can conduct omnibearing detection, interception and modification on the text, the picture, the audio and the video of the media asset, ensure broadcasting safety, greatly improve content auditing efficiency and ensure broadcasting safety.

Description

Multi-mode security fence method based on vector matching

Technical Field

The invention relates to the technical field of multi-mode audio-visual retrieval and compliance supervision in an IPTV environment, in particular to a multi-mode security fence method based on vector matching.

Background

With the development of IPTV service, the number of human-computer interaction scenes increases rapidly. The technology of content production of the generated AI and AIGC (artificial intelligence generated content) is mature, so that quality improvement is brought to video content and richness in an IPTV environment, the viewing requirements of users can be met, and meanwhile, the technical requirement of multi-dimensional safe retrieval and processing of text, pictures, audio and video content is brought.

Unlike content playing of the internet platform, IPTV content comes from multiple sources, and it is difficult to grasp the content at the source end, so in order to ensure that the content is played effectively, it is necessary to grasp the content in multiple modes strictly during playing. In practical application, various features exist in the playing content and the supervision dimension of IPTV, the existing video auditing and playing guarantee technology mainly relies on video content with graphic features to conduct frame-by-frame comparison, even manual frame-by-frame film watching and auditing are low in result and efficiency, human error and leakage exist in accuracy, large content risks exist, and the existing scheme is difficult to meet the safety requirement of live broadcasting.

Therefore, a technical scheme is needed, which can perform quick search and compliance matching for the types of the played contents such as texts, pictures, audio and video, and execute different processing strategies according to the different types of the contents, so that the auditing efficiency of the IPTV contents is improved, and the safety broadcasting requirement is met.

Disclosure of Invention

In order to achieve the above purpose, the present application provides a multi-mode security fence method based on vector matching, comprising the following steps:

Acquiring information to be checked, loading a pre-training model set, and carrying out vectorization processing on the information to be checked to generate a vector to be checked, wherein the information to be checked comprises live broadcast real-time streams and newly added media files;

establishing connection with a violation vector database set, wherein the violation vector database set comprises vector features of various types of violation data;

obtaining the compliance similarity of the vectors to be checked according to the violation vector database set, and judging whether the information to be checked is safely played;

the pre-training model set is composed of a text vector model, a picture vector model and a voice vector model and is used for generating vector information from various types of information to be checked, wherein the types of the information to be checked comprise texts, pictures, audios and videos.

The types of violation data of various types include text, pictures, audio, and video.

When the type is text, the vectorization is realized by text vectorization;

The text vectorization refers to processing the text content through a text vector model to obtain high-dimensional vector representation;

the text vector model is formed based on semantic vectors, chinese STS data sets and self-labeling data training construction.

When the type is a picture, the vectorization processing comprises the steps of cutting and converting the picture to generate a pre-identified picture, loading a picture vector model to generate a vector representation of the pre-identified picture, wherein the picture vector model is generated by training a Flickr30K-CN data set and self-labeling data;

and when the type is a picture, the vectorization processing further comprises the steps of identifying text information of the information to be checked, carrying out text vectorization and generating a text-based vector representation.

When the type is audio, the vectorization processing comprises voice recognition of the audio to generate text information, text vectorization of the text information to generate text-based vector representation;

When the type is audio, the vectorization processing further comprises loading a tone vector model, extracting tone from the audio, and obtaining vectorized representation of the tone.

Further, the tone extraction means for detecting voice activity according to the audio frequency, splitting voice fragments of a plurality of speakers, extracting voiceprints of the speakers in the voice fragments, and obtaining vectorized representation of tone by a tone vector model;

The tone color vector model is a voiceprint recognition model trained based on a sensitive data set, and the content of the sensitive data set is the voiceprint characteristic of the pre-defined sensitive sound.

When the type is video, the vectorization processing comprises:

Acquiring scene characteristics of a video, and acquiring vector representations aiming at the scene characteristics;

acquiring a picture change frame in a video, acquiring scene characteristics of the picture change frame, and obtaining vector representation aiming at the scene characteristics;

When the scene characteristics obtain vector representation, converting the type into a picture, and executing vectorization processing.

Wherein, obtaining the frame of picture change in the video refers to:

Extracting HSV color space of a video appointed frame, and comparing the HSV color space with the previous frame in a pixel value difference mode;

Obtaining average difference values of corresponding pixels of hue, saturation and brightness, and generating characteristic difference values by weighting and summing the difference values of hue, saturation and brightness corresponding weights;

and when the characteristic difference value is larger than a specified threshold value, the video specified frame is a picture change frame.

Further, the compliance similarity of the vectors to be audited is obtained through vector matching;

vector matching uses cosine similarity to measure similarity between two embedded vectors, expressed as: Wherein e₁、e₂ is two embedded vectors respectively.

Further, judging whether the information to be checked is safely broadcasted comprises the steps of adopting different strategies according to different sources of the information if the compliance similarity is higher than a specified threshold value, broadcasting after replacing the illegal content if the illegal content is detected in the live broadcast real-time stream, directly intercepting if the illegal content is a media resource, and further manually intervening by staff.

The invention carries out omnibearing detection on the text, the picture, the audio and the video of the media asset, carries out safety detection and interception on the newly added media asset, carries out dynamic detection and modification on the broadcasting of the existing media asset and live broadcast real-time stream, cuts and replaces illegal characters, pictures, audio and video, ensures the broadcasting safety, greatly improves the content auditing efficiency and ensures the broadcasting safety.

Drawings

FIG. 1 is a step diagram of a multi-modal security fence method provided in accordance with an embodiment of the present invention;

Fig. 2 is a schematic diagram of a pre-training model set according to an embodiment of the present invention.

Detailed Description

The vector-based multi-mode security fence method comprises the steps of firstly collecting illegal media resource data or sensitive voiceprint features as an illegal supervision standard, vectorizing and storing the media resource data as an illegal vector database set, generating multi-vectors for content to be checked by using a pre-training model, then carrying out vector matching, blocking and intercepting the content to be broadcasted with high matching degree, or further classifying the intercepted content to be broadcasted so as to realize a classified content execution processing strategy.

The following describes in detail the specific implementation of the present invention with reference to the drawings accompanying the specification.

Before the method provided by the invention starts to be executed, a violation vector database set is established. The violation vector database set is a feature set comprising a large number of violation texts, pictures, timbres, wherein timbres comprise typical sound features of a given sensitive person, by which the person's speech can be identified. The violation vector database set is an updatable feature library, and workers can continuously increase specific contents of violations and sound features of sensitive crowds according to requirements.

The steps of the vector-based multi-mode security fence method provided by the invention are shown in fig. 1, and the method comprises the following steps:

Step 100, obtaining information to be checked, loading a pre-training model set, and carrying out vectorization processing on the information to be checked to generate a vector to be checked;

the information to be checked supported by the method comprises the live broadcast real-time stream and the newly added media file, and generally, the live broadcast real-time stream has higher requirements on checking accuracy and checking timeliness of checking.

In the method, the type of information to be audited is judged, wherein the type of the information to be audited comprises a text, a picture, an audio and a video, different training models are loaded for the type of the information to be audited in a pre-training model set, and a piece of picture information, audio information and video information can possibly generate a plurality of pieces of vector data.

Specifically, as shown in FIG. 2, the pre-training model set is composed of a text vector model, a picture vector model and a voice vector model and is used for generating vector information from various types of information to be checked.

1) When the type is text, the vectorization is realized by text vectorization;

the text vectorization refers to processing text content through a text vector model to obtain high-dimensional vector representation;

For example, based on pre-training embedding model, training is performed by CoSENT method, and the manually selected Chinese STS dataset is combined with BAAI/bge-large-zh-noinstruct and self-labeling data training generation is supplemented. Compared with a basic model, the model generated by training in the mode has more obvious distinguishing degree and higher accuracy;

meanwhile, the text vector model is also used for auditing the picture and the audio content.

2) When the type is the picture, the vectorization processing comprises the steps of cutting and converting the picture to generate a pre-identification picture, loading a picture vector model, and generating a vector representation of the pre-identification picture aiming at the characteristic of the picture;

the picture vector model uses a Flickr30K-CN data set to supplement self-labeling data training, and training generation is carried out based on OpenAI CLIP;

When the type is a picture, the vectorization process also generates a vector for the text feature of the picture, which comprises adopting ocr to identify the text information of the information to be checked (picture), carrying out text vectorization process and generating a text-based vector representation.

Based on the above operations, two types of vector data may be generated for a picture, one based on the characteristics of the picture itself and the other based on the text characteristics.

3) When the type is audio, the vectorization processing comprises voice recognition of the audio to generate text information, text vectorization of the text information to generate text-based vector representation;

Simultaneously, tone extraction is also performed, namely a tone vector model in the pre-training model is loaded, tone extraction is performed on the audio, and vectorized representation of the corresponding tone is obtained.

Specifically, the tone extraction means for detecting the voice activity of a speaker in audio, splitting voice fragments of a plurality of speakers, extracting voiceprints of the speakers in the voice fragments, and obtaining vectorized representation of tone through a tone vector model;

The tone color vector model is a model generated based on a sensitive data set through ECAPA-TDNN training, and the content of the sensitive data set is voiceprint characteristics of pre-defined sensitive sounds, such as sounds of an IPTV industry evil doing artist and voiceprint characteristics of common sensitive sounds.

4) When the type is video, the vectorization processing comprises:

the method comprises the steps of obtaining scene characteristics of a video, obtaining vector representation aiming at the scene characteristics, specifically, firstly carrying out video scene change detection to obtain split video frames, taking the video frames as pictures, and obtaining the vector representation of the video scene through a picture vector model.

The input video contains a large number of information frames, and if each frame is concerned, the defects of large calculation amount and information redundancy exist, so that the invention aims at the picture vectors of similar scenes, and then carries out feature average pooling on the picture vectors of the similar scenes to finally obtain feature representation of the similar scenes, thereby compressing the features of the similar scenes;

on the other hand, for frames with larger variation amplitude between pictures, namely for a changed scene, a mode of capturing the picture variation frame is adopted in the step, the scene characteristics of the picture variation frame in the video are obtained, and then vector representation is further obtained aiming at the scene characteristics.

Specifically, the method for acquiring the picture change frame in the video comprises the following steps:

and when the characteristic difference value is larger than the specified threshold value, the video specified frame is a picture change frame.

Step S110, establishing connection with a violation vector database set, wherein the violation vector database set comprises vector features of multiple types of violation data, and the types of the multiple types of violation data comprise texts, pictures, audios and videos;

In the step, the compliance similarity of the vectors to be checked is obtained according to the violation vector database set.

Wherein the cosine similarity is used to measure the similarity between two embedded vectors, expressed as:

Wherein e₁、e₂ is two embedded vectors respectively.

Step S120, judging whether the information to be checked is safely broadcasted or not according to the compliance similarity of the vectors to be checked;

the content to be checked provided by the invention comprises media resources and live broadcast real-time streams, wherein the requirements of the two sources on the real-time degree of playing are completely different, so when the compliance similarity is higher than a specified threshold, different strategies are needed to be adopted according to the different sources, and the method is specifically shown as S120 of FIG. 1:

if the illegal content is detected in the live broadcast real-time stream, the illegal content is played after being replaced, and if the illegal content is media resources, the illegal content is directly intercepted, and further manual intervention is carried out by staff.

The method is applied to IPTV practical application scenes, text, picture, audio and video matching is carried out through user input, AI produced content and AI response content, and a judging result of whether the characteristics of the materials to be checked are similar or not is obtained. The multi-card reasoning cluster between the operations has quick response and can judge whether the content is compliant or not in a very short time. And the operation experience of the user and the safety production efficiency are improved.

For example, when the IPTV user interacts with the large language model on a large screen, the IPTV user checks the actual content input by the user in the process of inputting the word form of the user and responding to the language of the user by the large model, and if sensitive words and illegal words appear, the IPTV user prompts that the current question cannot be answered correctly.

If the content is aiming at AIGC content release or audit during creation, a user creates pictures, audio and video content release and submission in AI capacity through prompt words, multi-modal response of the large model to the user prompt words is performed, and whether the content meets the safety requirements is determined through traversal of a feature library.

According to the multi-mode security fence method provided by the invention, whether the rule is violated can be judged within 20 milliseconds when a new media resource or a live stream is processed on a display card of Tesla T4, so that the auditing efficiency is improved, and the detection coverage rate is over 99 percent.

The above disclosure is only a few specific embodiments of the present invention, but the present invention is not limited thereto, and any changes that can be thought by those skilled in the art should fall within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于向量匹配的多模态安全围栏方法，其特征在于，包括以下步骤：1. A multi-modal safety fence method based on vector matching, characterized in that it includes the following steps:

获取待审核信息，加载预训练模型集，对所述待审核信息进行向量化处理，生成待审核向量；所述待审核信息包括直播实时流和新增媒体文件；Obtaining information to be reviewed, loading a pre-trained model set, vectorizing the information to be reviewed, and generating a vector to be reviewed; the information to be reviewed includes a live real-time stream and newly added media files;

与违规向量数据库集建立连接，所述违规向量数据库集中包括多种类型的违规数据的向量特征；Establishing a connection with a violation vector database set, wherein the violation vector database set includes vector features of multiple types of violation data;

根据所述违规向量数据库集，获取所述待审核向量的合规相似度，判断所述待审核信息是否进行安全播出；According to the violation vector database set, the compliance similarity of the vector to be reviewed is obtained to determine whether the information to be reviewed is safe to broadcast;

其中，所述预训练模型集由文本向量模型、图片向量模型和音色向量模型构成，用于将多种类型的待审核信息生成向量信息；所述待审核信息的类型包括：文本、图片、音频和视频。Among them, the pre-trained model set consists of a text vector model, a picture vector model and a timbre vector model, which are used to generate vector information from various types of information to be reviewed; the types of information to be reviewed include: text, picture, audio and video.

2.根据权利要求1所述的多模态安全围栏方法，其特征在于，所述多种类型的违规数据的类型包括：文本、图片、音频和视频。2. The multimodal security fence method according to claim 1 is characterized in that the types of the multiple types of illegal data include: text, pictures, audio and video.

3.根据权利要求2所述的多模态安全围栏方法，其特征在于，所述类型为文本时，所述向量化处理通过文本向量化实现；3. The multimodal safety fence method according to claim 2, characterized in that when the type is text, the vectorization processing is implemented by text vectorization;

所述文本向量化指：通过文本向量模型对所述文本内容进行处理，获取高维的向量表示；The text vectorization refers to: processing the text content through a text vector model to obtain a high-dimensional vector representation;

其中，所述文本向量模型基于语义向量、中文STS数据集和自行标注的数据训练构建形成。The text vector model is constructed based on semantic vectors, Chinese STS dataset and self-annotated data training.

4.根据权利要求3所述的多模态安全围栏方法，其特征在于，所述类型为图片时，所述向量化处理包括：对图片进行裁剪和转换生成预识别图片，加载图片向量模型，生成所述预识别图片的向量表示；其中，所述图片向量模型使用Flickr30K-CN数据集和自行标注的数据训练生成；4. The multimodal safety fence method according to claim 3 is characterized in that when the type is a picture, the vectorization processing includes: cropping and converting the picture to generate a pre-recognized picture, loading a picture vector model, and generating a vector representation of the pre-recognized picture; wherein the picture vector model is generated by training using the Flickr30K-CN data set and self-labeled data;

所述类型为图片时，所述向量化处理还包括：识别所述待审核信息的文本信息，进行文本向量化，生成基于文本的向量表示。When the type is a picture, the vectorization processing further includes: identifying text information of the information to be reviewed, performing text vectorization, and generating a text-based vector representation.

5.根据权利要求3所述的多模态安全围栏方法，其特征在于，所述类型为音频时，所述向量化处理包括：对所述音频进行语音识别，生成文本信息；对所述文本信息进行文本向量化，生成基于文本的向量表示；5. The multimodal safety fence method according to claim 3, characterized in that when the type is audio, the vectorization processing includes: performing speech recognition on the audio to generate text information; performing text vectorization on the text information to generate a text-based vector representation;

所述类型为音频时，所述向量化处理还包括：加载音色向量模型，对所述音频进行音色提取，获取音色的向量化表示。When the type is audio, the vectorization processing further includes: loading a timbre vector model, performing timbre extraction on the audio, and obtaining a vectorized representation of the timbre.

6.根据权利要求5所述的多模态安全围栏方法，其特征在于，所述音色提取指：针对所述音频检测语音活动，拆分出多个说话人的语音片段，对所述语音片段中说话人的声纹进行提取，由所述音色向量模型获取音色的向量化表示；6. The multimodal safety fence method according to claim 5, characterized in that the timbre extraction refers to: detecting voice activity according to the audio, splitting the voice segments of multiple speakers, extracting the voiceprints of the speakers in the voice segments, and obtaining the vectorized representation of the timbre by the timbre vector model;

其中，所述音色向量模型是基于敏感数据集训练的声纹识别模型，所述敏感数据集的内容为预先定义的敏感声音的声纹特征。The timbre vector model is a voiceprint recognition model trained based on a sensitive data set, and the content of the sensitive data set is the voiceprint features of predefined sensitive sounds.

7.根据权利要求3所述的多模态安全围栏方法，其特征在于，所述类型为视频时，所述向量化处理包括：7. The multimodal safety fence method according to claim 3, characterized in that when the type is video, the vectorization processing includes:

获取所述视频的场景特征，针对所述场景特征获得向量表示；Acquire scene features of the video, and obtain vector representation for the scene features;

获取所述视频中画面变化帧，获取所述画面变化帧的场景特征，针对所述场景特征获得向量表示；Obtaining a picture change frame in the video, obtaining a scene feature of the picture change frame, and obtaining a vector representation for the scene feature;

所述针对所述场景特征获得向量表示时，将所述类型转换为图片，执行向量化处理。When obtaining a vector representation for the scene feature, the type is converted into a picture and vectorization processing is performed.

8.根据权利要求7所述的多模态安全围栏方法，其特征在于，获取所述视频中画面变化帧指：8. The multimodal safety fence method according to claim 7, wherein obtaining the picture change frame in the video comprises:

提取视频指定帧的HSV色彩空间，将其与上一帧使用像素值差方式进行对比；Extract the HSV color space of the specified frame of the video and compare it with the previous frame using the pixel value difference method;

获取色相、饱和度和亮度的对应像素的平均差值，将色相、饱和度和亮度的差值对应权重加权求和生成特征差值；Obtain the average difference of the corresponding pixels of hue, saturation and brightness, and sum the corresponding weights of the differences of hue, saturation and brightness to generate a feature difference;

所述特征差值大于指定阈值时，所述视频指定帧为画面变化帧。When the feature difference value is greater than a specified threshold, the specified frame of the video is a picture change frame.

9.根据权利要求7所述的多模态安全围栏方法，其特征在于，所述获取所述待审核向量的合规相似度通过向量匹配实现；9. The multimodal safety fence method according to claim 7, characterized in that the obtaining of the compliance similarity of the vector to be reviewed is achieved by vector matching;

所述向量匹配使用余弦相似度来衡量两个内嵌向量之间的相似度，表示为：其中，e₁、e₂分别为两个内嵌向量。The vector matching uses cosine similarity to measure the similarity between two embedded vectors, expressed as: Among them, e₁ and e₂ are two embedded vectors respectively.

10.根据权利要求7所述的多模态安全围栏方法，其特征在于，判断所述待审核信息是否进行安全播出包括：如果合规相似度高于指定阈值时，针对其来源的不同采取不同的策略；如果是直播实时流中检测到违规内容，进行内容替换后再进行播出；如果是媒体资源，直接进行拦截，进一步由工作人员进行人工介入。10. The multimodal security fence method according to claim 7 is characterized in that judging whether the information to be reviewed is safe to broadcast includes: if the compliance similarity is higher than a specified threshold, adopting different strategies according to different sources; if illegal content is detected in the live real-time stream, the content is replaced before broadcasting; if it is a media resource, it is directly intercepted, and further manual intervention is performed by staff.