Disclosure of Invention
In order to achieve the above purpose, the present application provides a multi-mode security fence method based on vector matching, comprising the following steps:
Acquiring information to be checked, loading a pre-training model set, and carrying out vectorization processing on the information to be checked to generate a vector to be checked, wherein the information to be checked comprises live broadcast real-time streams and newly added media files;
establishing connection with a violation vector database set, wherein the violation vector database set comprises vector features of various types of violation data;
obtaining the compliance similarity of the vectors to be checked according to the violation vector database set, and judging whether the information to be checked is safely played;
the pre-training model set is composed of a text vector model, a picture vector model and a voice vector model and is used for generating vector information from various types of information to be checked, wherein the types of the information to be checked comprise texts, pictures, audios and videos.
The types of violation data of various types include text, pictures, audio, and video.
When the type is text, the vectorization is realized by text vectorization;
The text vectorization refers to processing the text content through a text vector model to obtain high-dimensional vector representation;
the text vector model is formed based on semantic vectors, chinese STS data sets and self-labeling data training construction.
When the type is a picture, the vectorization processing comprises the steps of cutting and converting the picture to generate a pre-identified picture, loading a picture vector model to generate a vector representation of the pre-identified picture, wherein the picture vector model is generated by training a Flickr30K-CN data set and self-labeling data;
and when the type is a picture, the vectorization processing further comprises the steps of identifying text information of the information to be checked, carrying out text vectorization and generating a text-based vector representation.
When the type is audio, the vectorization processing comprises voice recognition of the audio to generate text information, text vectorization of the text information to generate text-based vector representation;
When the type is audio, the vectorization processing further comprises loading a tone vector model, extracting tone from the audio, and obtaining vectorized representation of the tone.
Further, the tone extraction means for detecting voice activity according to the audio frequency, splitting voice fragments of a plurality of speakers, extracting voiceprints of the speakers in the voice fragments, and obtaining vectorized representation of tone by a tone vector model;
The tone color vector model is a voiceprint recognition model trained based on a sensitive data set, and the content of the sensitive data set is the voiceprint characteristic of the pre-defined sensitive sound.
When the type is video, the vectorization processing comprises:
Acquiring scene characteristics of a video, and acquiring vector representations aiming at the scene characteristics;
acquiring a picture change frame in a video, acquiring scene characteristics of the picture change frame, and obtaining vector representation aiming at the scene characteristics;
When the scene characteristics obtain vector representation, converting the type into a picture, and executing vectorization processing.
Wherein, obtaining the frame of picture change in the video refers to:
Extracting HSV color space of a video appointed frame, and comparing the HSV color space with the previous frame in a pixel value difference mode;
Obtaining average difference values of corresponding pixels of hue, saturation and brightness, and generating characteristic difference values by weighting and summing the difference values of hue, saturation and brightness corresponding weights;
and when the characteristic difference value is larger than a specified threshold value, the video specified frame is a picture change frame.
Further, the compliance similarity of the vectors to be audited is obtained through vector matching;
vector matching uses cosine similarity to measure similarity between two embedded vectors, expressed as: Wherein e1、e2 is two embedded vectors respectively.
Further, judging whether the information to be checked is safely broadcasted comprises the steps of adopting different strategies according to different sources of the information if the compliance similarity is higher than a specified threshold value, broadcasting after replacing the illegal content if the illegal content is detected in the live broadcast real-time stream, directly intercepting if the illegal content is a media resource, and further manually intervening by staff.
The invention carries out omnibearing detection on the text, the picture, the audio and the video of the media asset, carries out safety detection and interception on the newly added media asset, carries out dynamic detection and modification on the broadcasting of the existing media asset and live broadcast real-time stream, cuts and replaces illegal characters, pictures, audio and video, ensures the broadcasting safety, greatly improves the content auditing efficiency and ensures the broadcasting safety.
Detailed Description
The vector-based multi-mode security fence method comprises the steps of firstly collecting illegal media resource data or sensitive voiceprint features as an illegal supervision standard, vectorizing and storing the media resource data as an illegal vector database set, generating multi-vectors for content to be checked by using a pre-training model, then carrying out vector matching, blocking and intercepting the content to be broadcasted with high matching degree, or further classifying the intercepted content to be broadcasted so as to realize a classified content execution processing strategy.
The following describes in detail the specific implementation of the present invention with reference to the drawings accompanying the specification.
Before the method provided by the invention starts to be executed, a violation vector database set is established. The violation vector database set is a feature set comprising a large number of violation texts, pictures, timbres, wherein timbres comprise typical sound features of a given sensitive person, by which the person's speech can be identified. The violation vector database set is an updatable feature library, and workers can continuously increase specific contents of violations and sound features of sensitive crowds according to requirements.
The steps of the vector-based multi-mode security fence method provided by the invention are shown in fig. 1, and the method comprises the following steps:
Step 100, obtaining information to be checked, loading a pre-training model set, and carrying out vectorization processing on the information to be checked to generate a vector to be checked;
the information to be checked supported by the method comprises the live broadcast real-time stream and the newly added media file, and generally, the live broadcast real-time stream has higher requirements on checking accuracy and checking timeliness of checking.
In the method, the type of information to be audited is judged, wherein the type of the information to be audited comprises a text, a picture, an audio and a video, different training models are loaded for the type of the information to be audited in a pre-training model set, and a piece of picture information, audio information and video information can possibly generate a plurality of pieces of vector data.
Specifically, as shown in FIG. 2, the pre-training model set is composed of a text vector model, a picture vector model and a voice vector model and is used for generating vector information from various types of information to be checked.
1) When the type is text, the vectorization is realized by text vectorization;
the text vectorization refers to processing text content through a text vector model to obtain high-dimensional vector representation;
The text vector model is formed based on semantic vectors, chinese STS data sets and self-labeling data training construction.
For example, based on pre-training embedding model, training is performed by CoSENT method, and the manually selected Chinese STS dataset is combined with BAAI/bge-large-zh-noinstruct and self-labeling data training generation is supplemented. Compared with a basic model, the model generated by training in the mode has more obvious distinguishing degree and higher accuracy;
meanwhile, the text vector model is also used for auditing the picture and the audio content.
2) When the type is the picture, the vectorization processing comprises the steps of cutting and converting the picture to generate a pre-identification picture, loading a picture vector model, and generating a vector representation of the pre-identification picture aiming at the characteristic of the picture;
the picture vector model uses a Flickr30K-CN data set to supplement self-labeling data training, and training generation is carried out based on OpenAI CLIP;
When the type is a picture, the vectorization process also generates a vector for the text feature of the picture, which comprises adopting ocr to identify the text information of the information to be checked (picture), carrying out text vectorization process and generating a text-based vector representation.
Based on the above operations, two types of vector data may be generated for a picture, one based on the characteristics of the picture itself and the other based on the text characteristics.
3) When the type is audio, the vectorization processing comprises voice recognition of the audio to generate text information, text vectorization of the text information to generate text-based vector representation;
Simultaneously, tone extraction is also performed, namely a tone vector model in the pre-training model is loaded, tone extraction is performed on the audio, and vectorized representation of the corresponding tone is obtained.
Specifically, the tone extraction means for detecting the voice activity of a speaker in audio, splitting voice fragments of a plurality of speakers, extracting voiceprints of the speakers in the voice fragments, and obtaining vectorized representation of tone through a tone vector model;
The tone color vector model is a model generated based on a sensitive data set through ECAPA-TDNN training, and the content of the sensitive data set is voiceprint characteristics of pre-defined sensitive sounds, such as sounds of an IPTV industry evil doing artist and voiceprint characteristics of common sensitive sounds.
4) When the type is video, the vectorization processing comprises:
the method comprises the steps of obtaining scene characteristics of a video, obtaining vector representation aiming at the scene characteristics, specifically, firstly carrying out video scene change detection to obtain split video frames, taking the video frames as pictures, and obtaining the vector representation of the video scene through a picture vector model.
The input video contains a large number of information frames, and if each frame is concerned, the defects of large calculation amount and information redundancy exist, so that the invention aims at the picture vectors of similar scenes, and then carries out feature average pooling on the picture vectors of the similar scenes to finally obtain feature representation of the similar scenes, thereby compressing the features of the similar scenes;
on the other hand, for frames with larger variation amplitude between pictures, namely for a changed scene, a mode of capturing the picture variation frame is adopted in the step, the scene characteristics of the picture variation frame in the video are obtained, and then vector representation is further obtained aiming at the scene characteristics.
Specifically, the method for acquiring the picture change frame in the video comprises the following steps:
Extracting HSV color space of a video appointed frame, and comparing the HSV color space with the previous frame in a pixel value difference mode;
Obtaining average difference values of corresponding pixels of hue, saturation and brightness, and generating characteristic difference values by weighting and summing the difference values of hue, saturation and brightness corresponding weights;
and when the characteristic difference value is larger than the specified threshold value, the video specified frame is a picture change frame.
Step S110, establishing connection with a violation vector database set, wherein the violation vector database set comprises vector features of multiple types of violation data, and the types of the multiple types of violation data comprise texts, pictures, audios and videos;
In the step, the compliance similarity of the vectors to be checked is obtained according to the violation vector database set.
Wherein the cosine similarity is used to measure the similarity between two embedded vectors, expressed as:
Wherein e1、e2 is two embedded vectors respectively.
Step S120, judging whether the information to be checked is safely broadcasted or not according to the compliance similarity of the vectors to be checked;
the content to be checked provided by the invention comprises media resources and live broadcast real-time streams, wherein the requirements of the two sources on the real-time degree of playing are completely different, so when the compliance similarity is higher than a specified threshold, different strategies are needed to be adopted according to the different sources, and the method is specifically shown as S120 of FIG. 1:
if the illegal content is detected in the live broadcast real-time stream, the illegal content is played after being replaced, and if the illegal content is media resources, the illegal content is directly intercepted, and further manual intervention is carried out by staff.
The method is applied to IPTV practical application scenes, text, picture, audio and video matching is carried out through user input, AI produced content and AI response content, and a judging result of whether the characteristics of the materials to be checked are similar or not is obtained. The multi-card reasoning cluster between the operations has quick response and can judge whether the content is compliant or not in a very short time. And the operation experience of the user and the safety production efficiency are improved.
For example, when the IPTV user interacts with the large language model on a large screen, the IPTV user checks the actual content input by the user in the process of inputting the word form of the user and responding to the language of the user by the large model, and if sensitive words and illegal words appear, the IPTV user prompts that the current question cannot be answered correctly.
If the content is aiming at AIGC content release or audit during creation, a user creates pictures, audio and video content release and submission in AI capacity through prompt words, multi-modal response of the large model to the user prompt words is performed, and whether the content meets the safety requirements is determined through traversal of a feature library.
According to the multi-mode security fence method provided by the invention, whether the rule is violated can be judged within 20 milliseconds when a new media resource or a live stream is processed on a display card of Tesla T4, so that the auditing efficiency is improved, and the detection coverage rate is over 99 percent.
The above disclosure is only a few specific embodiments of the present invention, but the present invention is not limited thereto, and any changes that can be thought by those skilled in the art should fall within the protection scope of the present invention.