Movatterモバイル変換


[0]ホーム

URL:


CN119066223B - Video question-answering method and device, electronic equipment and storage medium - Google Patents

Video question-answering method and device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN119066223B
CN119066223BCN202411570790.7ACN202411570790ACN119066223BCN 119066223 BCN119066223 BCN 119066223BCN 202411570790 ACN202411570790 ACN 202411570790ACN 119066223 BCN119066223 BCN 119066223B
Authority
CN
China
Prior art keywords
target
search data
question
video
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411570790.7A
Other languages
Chinese (zh)
Other versions
CN119066223A (en
Inventor
徐聪
周永哲
吴忠人
黄鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co LtdfiledCriticalZhejiang Dahua Technology Co Ltd
Priority to CN202411570790.7ApriorityCriticalpatent/CN119066223B/en
Publication of CN119066223ApublicationCriticalpatent/CN119066223A/en
Application grantedgrantedCritical
Publication of CN119066223BpublicationCriticalpatent/CN119066223B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及人工智能技术领域,尤其涉及一种视频问答方法、装置、电子设备及存储介质,用于提高视频问答的答案准确度。该方法包括:获取输入对象输入的问题,并抽取问题中的目标实体;基于目标实体,查询目标视频所属场景对应的场景知识图谱,获得第一搜索数据,并基于目标实体,查询目标视频对应的时序知识图谱,获得第二搜索数据,其中,时序知识图谱包含:各时间段内,目标视频中的各候选实体间的交互关系;基于第一搜索数据和第二搜索数据,通过大语言模型LLM,获得问题的答案。这样,使得LLM在回答视频问题时会考虑场景知识和时序知识,提高了回答的准确度。

The present application relates to the field of artificial intelligence technology, and in particular to a video question-and-answer method, device, electronic device, and storage medium for improving the accuracy of answers to video questions and answers. The method includes: obtaining a question input by an input object, and extracting a target entity in the question; based on the target entity, querying the scene knowledge graph corresponding to the scene to which the target video belongs, obtaining first search data, and based on the target entity, querying the time series knowledge graph corresponding to the target video, obtaining second search data, wherein the time series knowledge graph includes: the interaction relationship between each candidate entity in the target video in each time period; based on the first search data and the second search data, obtaining the answer to the question through a large language model LLM. In this way, LLM will consider scene knowledge and time series knowledge when answering video questions, thereby improving the accuracy of the answer.

Description

Video question-answering method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a video question-answering method, a video question-answering device, electronic equipment and a storage medium.
Background
Video questions and answers refer to answers in natural language that are generated for a given question by understanding the relevant video content. Video questions and answers are widely used in many scenarios, such as multimedia information retrieval scenarios, intelligent assistants, business service scenarios, and the like.
In the prior art, a contrast language Image Pre-training (CLIP) model is generally used for video question-answering.
However, the CLIP model only supports questions and answers to content on a single still picture in the video, and the information referenced is too one-sided, i.e., the available information is limited, resulting in poor accuracy of the answers obtained.
Disclosure of Invention
The embodiment of the application provides a video question answering method, a video question answering device, electronic equipment and a storage medium, and the accuracy of answers is improved.
The specific technical scheme provided by the embodiment of the application is as follows:
in a first aspect, a video question-answering method is provided, the method including:
acquiring a problem input by an input object, and extracting a target entity in the problem;
Inquiring a scene knowledge graph corresponding to a scene to which a target video belongs based on a target entity to obtain first search data, and inquiring a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data, wherein the time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period;
Based on the first search data and the second search data, answers to the questions are obtained through a large language model LLM.
In a possible embodiment, based on the target entity, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, inquiring a time sequence knowledge graph corresponding to the target video to obtain the second search data, before the second search data is obtained, the method further includes:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
By the method, the documents related to the scenes are edited into the scene knowledge graph, so that the LLM considers the scene knowledge when answering the questions, the questions which are not asked by the LLM are solved, and the accuracy of the answers is improved.
In a possible embodiment, based on the target entity, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, inquiring a time sequence knowledge graph corresponding to the target video to obtain the second search data, before the second search data is obtained, the method further includes:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
By the method, the picture information of the target video is stored as the time sequence knowledge graph according to the event time, so that longer videos can be understood, more information in the videos can be obtained, and related video clips can be found more quickly.
In one possible embodiment, editing the picture information of the target video into time text information includes:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In one possible embodiment, obtaining an answer to a question through a large language model LLM based on the first search data and the second search data, includes:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
By the method, the target search data is provided for the LLM, so that the accuracy of video answer can be improved.
In a second aspect, a video question answering apparatus is provided, the apparatus comprising:
The acquisition module is used for acquiring the problem input by the input object and extracting a target entity in the problem;
The query module is used for querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity to obtain first search data, and querying a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data, wherein the time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period;
and the processing module is used for obtaining answers to the questions through a large language model LLM based on the first search data and the second search data.
In a possible embodiment, based on the target entity, the device queries a scene knowledge graph corresponding to a scene to which the target video belongs to obtain first search data, and based on the target entity, queries a time sequence knowledge graph corresponding to the target video to obtain second search data, and before the device further comprises a first generation module, the first generation module is used for:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
In a possible embodiment, based on the target entity, the device queries a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, queries a time sequence knowledge graph corresponding to the target video to obtain the second search data, and before the second search data is obtained, the device further comprises a second generation module, where the second generation module is configured to:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
In one possible embodiment, when editing the picture information of the target video into the time text information, the second generating module is further configured to:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In a possible embodiment, when obtaining an answer to a question by the large language model LLM based on the first search data and the second search data, the processing module is further configured to:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the first aspects when the program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects above.
In a fifth aspect, the present application provides a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.
In the embodiment of the application, the questions input by the input objects are acquired, target entities in the questions are extracted, then the scene knowledge graph corresponding to the scene to which the target video belongs is queried based on the target entities, the first search data is obtained, the time sequence knowledge graph corresponding to the target video is queried based on the target entities, the second search data is obtained, and finally the answers to the questions are obtained through the large language model LLM based on the first search data and the second search data. Therefore, based on the LLM, the scene knowledge spectrum and the time sequence knowledge spectrum are combined, so that the LLM considers the scene knowledge and the time sequence knowledge when answering the video questions, the situations of illusion and speaking disorder are reduced when answering, and meanwhile, the LLM is endowed with the capability of understanding the ultra-long time sequence video, and the accuracy of video answering is improved.
Drawings
Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application;
FIG. 2 is a flowchart of a video question-answering method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of scene knowledge graph construction in an embodiment of the application;
FIG. 4 is a diagram of a time text message according to an embodiment of the present application;
FIG. 5 is a timing diagram of an embodiment of the present application;
Fig. 6 is a schematic structural diagram of a video question answering device according to an embodiment of the present application;
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.
The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, in the embodiments of the present application, some existing solutions in the industry such as software, components, models, etc. may be mentioned, and they should be regarded as exemplary, only for illustrating the feasibility of implementing the technical solution of the present application, but it does not mean that the applicant has or must not use the solution.
In the technical scheme of the application, the acquisition, transmission, storage, use and the like of the data all meet the requirements of national relevant laws and regulations.
The following briefly describes the design concept of the embodiment of the present application:
Traditional video understanding methods (such as optical flow methods, time series-based hidden Markov models and the like) can realize tasks such as tracking targets, motion state analysis and the like.
With the development of deep learning, most tasks now use a deep learning model for understanding and analyzing videos, such as convolutional neural networks (Convolutional Neural Networks, CNN), long Short-Term Memory (LSTM), three-dimensional convolutional neural networks (3D CNN), and the like.
Recently, large language models (Large Language Model, LLM) have evolved rapidly. The advent of large language models pre-trained on large-scale data sets has introduced a new context learning capability. This allows them to handle various tasks through hints without requiring fine tuning. Generating a pre-trained transformation model (CHAT GENERATIVE PRE-trained Transformer, chatGPT) is the first breakthrough application based on this, which includes the ability to generate code and call other model tools or APIs. Many studies are exploring how to invoke the application programming interface (English: application Programming Interface, API) of Visual models using large language models like ChatGPT to solve the problems in the computer vision field, including Visual-ChatGPT. The advent of guided-tuning further enhanced the ability of these models to respond effectively to user requests and to perform specific tasks. Large language models incorporating video understanding capabilities provide more complex multimodal understanding advantages that enable them to handle and interpret complex interactions between visual and textual data. As well as their impact in the field of natural language processing (Natural Language Processing, NLP), these models act as more general task resolvers, with the benefit of their massive knowledge base and context understanding obtained from large amounts of multimodal data, handling a wider range of tasks. This not only enables them to understand visual content, but also to reason about it in a way that more closely approximates human understanding. Much work is also exploring the use of large language models, vid-LLMs, in video understanding tasks.
Video questions and answers refer to answers in natural language that are generated for a given question by understanding the relevant video content. Video questions and answers are widely used in many scenarios, such as multimedia information retrieval scenarios, intelligent assistants, business service scenarios, and the like.
In the prior art, a contrast language Image Pre-training (CLIP) model is generally used for video question-answering, or LLM is used for video question-answering.
However, the CLIP model only supports questions and answers to content on a single still picture in the video, and the information referenced is too one-sided, i.e., the available information is limited, resulting in poor accuracy of the answers obtained. Meanwhile, when LLM is used alone, because of its uniqueness in the use scenario, LLM may answer questions, resulting in poor accuracy of the obtained answers.
In view of this, in the embodiment of the present application, a video question-answering method, apparatus, electronic device and storage medium are provided, a question input by an input object is obtained, a target entity in the question is extracted, then a scene knowledge graph corresponding to a scene to which a target video belongs is queried based on the target entity, first search data is obtained, and a time sequence knowledge graph corresponding to the target video is queried based on the target entity, second search data is obtained, wherein the time sequence knowledge graph includes an interaction relationship between candidate entities in the target video in each time period, and finally an answer to the question is obtained through a large language model LLM based on the first search data and the second search data. Therefore, based on the LLM, the scene knowledge and the time sequence knowledge are combined, so that the LLM considers the scene knowledge and the time sequence knowledge when answering the video questions, the specific entity of the scene can be identified, the situations of illusion and language disorder during answering are reduced, the LLM is endowed with the capability of understanding the ultra-long time sequence video, and the accuracy of video answering is improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and that the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application. In the application scenario diagram, a server 110 is included, and a terminal device 120 (including a terminal device 1201, a terminal device 1202.
The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal device 120 and the server 110 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
The terminal device 120 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an electronic book reader, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like, and various software such as an application program, an applet, and the like can be installed on the terminal device.
It should be noted that, the number of the terminal devices 120 and the servers 110 shown in fig. 1 is merely illustrative, and the number is not limited in practice, and the embodiment of the present application is not limited in detail.
It should be noted that, the video question-answering method in the embodiment of the present application may be deployed in a computing device, where the computing device may be a server or a terminal device, where the server may be the server 110 shown in fig. 1, or may be another server than the server 110 shown in fig. 1, and the terminal device may be the terminal device 120 shown in fig. 1, or may be another terminal device than the terminal device 120 shown in fig. 1. That is, the method may be executed by the server or the terminal device alone or may be executed by both the server and the terminal device together.
The video question answering method provided by the exemplary embodiments of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in this respect.
Referring to fig. 2, a flowchart of implementing a video question-answering method according to an embodiment of the present application is shown, where the specific implementation flow of the method is as follows:
and step 20, acquiring the problem input by the input object, and extracting the target entity in the problem.
In the embodiment of the application, an input interface is displayed on a terminal device, when an input object is input in a question-answer area of the input interface, a video question-answer request is triggered, a server responds to the video question-answer request, a question input by the input object is obtained, information extraction is carried out on the question, and a target entity corresponding to the question is obtained.
For example, when the input object wants to answer a question in a video, the input box of the question-answering interface of the terminal device is required to input a question text, for example, the input object wants to ask "which people enter the yard in the morning today", and the input box is clicked to input which people enter the yard in the morning today. Then clicking a confirmation key to trigger a video question-answering request, and extracting information of 'which people enter the courtyard in the morning today' to obtain a target entity corresponding to the problem as 'people'.
In addition, the input object can click a voice recognition button to input a problem.
Therefore, the input interface is presented for inputting the problem of the input object, convenience is provided for the input object, and the user experience is improved. And the method can also be used for voice input, so that the time for inputting text of an input object is saved, and the efficiency of intelligent question answering is improved.
In the embodiment of the application, the scene knowledge graph and the time sequence knowledge graph are constructed in advance.
Optionally, in the embodiment of the present application, a possible embodiment is provided for constructing a scene knowledge graph, and specifically the following operations are performed:
And A1, extracting direct relations among first candidate entities in the target document, and setting weights for the direct relations.
The target document comprises relevant texts of scenes to which the target video belongs.
For example, the scene to which the target video belongs is a corridor, and the target document is a corridor-related document.
In the embodiment of the application, the target document is divided into text blocks, and the following operations are respectively executed for each text block, namely, each first candidate entity contained in one text block and the direct relation among each first candidate entity are extracted, and weights are set for the direct relation.
Wherein the weights characterize the importance of the corresponding relationships.
The method is exemplified by referring to fig. 3, which is a schematic diagram of scene knowledge graph construction in the embodiment of the application, a target document is split into text blocks, direct relations among first candidate entities contained in the text blocks are respectively extracted, and weights are set for the direct relations, wherein the direct relations between cameras and the corridor are that the cameras are distributed and controlled in the corridor, the weights are 2, the direct relations between the corridor and lamps are that the corridor is provided with lamps, and the weights are 1.
And step A2, extracting the context proximity relation among the first candidate entities in the target document, and setting weight for the context proximity relation.
In the embodiment of the application, a context proximity relation is added between the first candidate entities, and the weight of the context proximity relation is set.
The context approaching relationship between each first candidate entity is extracted, as shown in fig. 3, wherein the context approaching relationship between the cameras and the corridor is that a plurality of cameras are installed in the corridor, the weight is 4, and the context approaching relationship between the corridor and the lamps is that ceiling lamps and spot lamps are installed in the corridor, and the weight is 2.
And step A3, merging the direct relation and the context proximity relation among the first candidate entities to obtain a scene knowledge graph.
In the embodiment of the application, the direct relation and the context proximity relation between the first candidate entities are combined, and the weight of the direct relation and the weight of the context proximity relation are combined to obtain the scene knowledge graph.
The weights of the direct relationships and the weights of the context proximity relationships may be added to obtain the combined weights, which is not limited in the embodiment of the present application.
The relationship between the cameras and the corridor is that the cameras are distributed in the corridor, a plurality of cameras are installed in the corridor, the weight is 7, the relationship between the corridor and the lamp is that a ceiling lamp and a spotlight are installed in the corridor, and the weight is 3.
In addition, it should be noted that in the embodiment of the present application, the manner or details of constructing the scene knowledge graph may be adjusted according to different scenes, and the comparison in the embodiment of the present application is not limited.
In this way, the documents related to the scene are edited into the knowledge graph and put into the database, and the knowledge of the related scene is matched when the LLM questions are inquired, so that the LLM considers the knowledge of the scene when answering the questions, the questions which are not inquired by the LLM are solved, and the accuracy of the answers is improved.
Optionally, in the embodiment of the present application, a possible embodiment is provided for constructing a time sequence knowledge graph, and specifically the following operations are performed:
and B1, editing the picture information of the target video into time text information.
The time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event, wherein the information of the second candidate entity comprises attribute information and tracking information.
In the embodiment of the application, the picture information of each video frame of the target video is edited into time text information.
Referring to fig. 4, a schematic diagram of time text information in an embodiment of the present application is shown, where the information of the second candidate entity corresponding to 12 points 10 minutes 05 seconds on 1 day of 9 months of 2026 includes that the entity type is a person, the clothing type is a short sleeve, the clothing color is white, the trousers type is a short sleeve, the trousers color is black, the state is static, the position is [ (10, 15), (40, 60) ], and the entity id is 20. The information of the second candidate entity corresponding to the 13 minutes and 25 seconds at the 12 th 1 st 9 th month of the event occurrence time comprises that the entity type is a car, the car color is red, the car type is type 1, the state is that the car moves, the position is [ (20,14), (30, 90) ], and the entity id is 123.
Optionally, in the embodiment of the present application, a possible embodiment is provided for obtaining the time text information, and specifically the following operations are performed:
and step B10, performing target detection on the target video to obtain attribute information of each second candidate entity.
In the embodiment of the application, a target detection algorithm is adopted to carry out target detection on the target video, so as to obtain the attribute information of each second candidate entity.
The target detection algorithm may be a YOLO algorithm, and the attribute information of each second candidate entity includes, but is not limited to, basic attribute information such as entity type, color, entity id, etc., which is not limited in the embodiment of the present application.
And step B11, performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity.
In the embodiment of the application, a target tracking algorithm is adopted to track the target video, so as to obtain the tracking information of each second candidate entity.
Wherein the tracking information of each second candidate entity includes, but is not limited to, status, location, etc.
And step B12, generating time text information of the target video based on the attribute information and the tracking information.
And B2, converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
The time sequence diagram comprises interaction relations among the second candidate entities in each time period, wherein the interaction relations are interaction relations such as static, far away, close to, entering or leaving, and the like, and the interaction relations are not limited in the embodiment of the application.
In the embodiment of the application, time text information is recorded into a graph form according to the occurrence time of each event.
For example, referring to fig. 5, a schematic diagram of a timing chart in an embodiment of the present application is shown, in which entity (person) No. 20 is relatively stationary in entity (car) No. 123 during a period of time from 10 minutes 05 seconds at 1 st month of 2026 to 12 minutes 08 seconds at 1 st month of 2026, entity (car) No. 123 is far away from entity (person) No. 20 during a period of time from 20 minutes 05 seconds at 1 st month of 2026 to 21 minutes 08 seconds at 1 st month of 2026, and entity (person) No. 20 enters entity (building) No. 6.
In this way, the picture information of the target video is stored as a knowledge graph according to the event time, the time sequence information of the video is considered, compared with other video understanding modes, longer videos can be understood, more information in the video is obtained, and related video fragments can be found more quickly when video inquiry is carried out.
And step 21, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity to obtain first search data, and inquiring a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data.
The time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period, and the scene knowledge graph comprises relations among related candidate entities of the scene of the target video.
Step 22, obtaining answers to the questions through a large language model LLM based on the first search data and the second search data.
In the embodiment of the application, after the first search data and the second search data corresponding to the target entity are obtained, the answer to the question is obtained through the large language model LLM based on the first search data and the second search data.
Optionally, in the embodiment of the present application, a possible embodiment is provided for obtaining an answer to a question through LLM, specifically performing the following operations:
and 220, screening out target search data meeting the correlation condition with the problem in the first search data and the second search data through LLM.
In the embodiment of the application, through LLM, data with the correlation with the problem larger than the correlation threshold value is screened out from the first search data and the second search data and is used as target search data.
Step 221, using the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
In the embodiment of the application, the target search data is used as a part of the prompt word of the LLM, and the prompt word and the question are input into the LLM to obtain the answer of the question.
For example, assume that the question is "which people enter the yard in the morning? the answer was" three people enter the courtyard today in the morning, A, B and C, respectively.
In this way, providing LLM with target search data that is useful and comprehensive for answering questions can improve the accuracy of video answers.
Based on the same inventive concept, the embodiment of the present application further provides a video question-answering device, referring to fig. 6, which is a schematic structural diagram of the video question-answering device in the embodiment of the present application, and specifically includes:
the acquiring module 601 is configured to acquire a question input by an input object, and extract a target entity in the question;
The query module 602 is configured to query, based on a target entity, a scene knowledge graph corresponding to a scene to which the target video belongs, obtain first search data, and query, based on the target entity, a time sequence knowledge graph corresponding to the target video, obtain second search data, where the time sequence knowledge graph includes an interaction relationship between candidate entities in the target video in each time period;
The processing module 603 is configured to obtain an answer to the question through the large language model LLM based on the first search data and the second search data.
In one possible embodiment, the first search data is obtained by querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity, and based on the target entity, the device further includes a first generating module 604, where the first generating module 604 is configured to:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
In one possible embodiment, the first search data is obtained by querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity, and based on the target entity, the device further includes a second generating module 605, where the second generating module 605 is configured to:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
In one possible embodiment, when editing the picture information of the target video into the time text information, the second generating module 605 is further configured to:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In a possible embodiment, when obtaining an answer to a question by the large language model LLM based on the first search data and the second search data, the processing module 603 is further configured to:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
Based on the above embodiments, referring to fig. 7, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown.
Embodiments of the present application provide an electronic device that may include a processor 710 (Center Processing Unit, a CPU), a memory 720, an input device 730, an output device 740, and the like, where the input device 730 may include a keyboard, a mouse, a touch screen, and the like, and the output device 740 may include a display device, such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), a Cathode Ray Tube (CRT), and the like.
Memory 720 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides processor 710 with program instructions and data stored in memory 720. In an embodiment of the present application, the memory 720 may be used to store a program of any of the video question-answering methods in the embodiment of the present application.
Processor 710 is configured to execute any of the video question-answering methods according to the embodiments of the present application by calling the program instructions stored in memory 720, and then executing the program instructions.
Based on the above embodiments, in the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video question-answering method in any of the above method embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

Translated fromChinese
1.一种视频问答方法,其特征在于,包括:1. A video question-answering method, comprising:获取输入对象输入的问题,并抽取所述问题中的目标实体;Obtaining a question input by an input object, and extracting a target entity in the question;提取目标文档中各第一候选实体间的直接关系,并为所述直接关系设置权重,其中,所述目标文档包含目标视频所属场景的相关文本;Extracting direct relationships between first candidate entities in a target document and setting weights for the direct relationships, wherein the target document contains text related to a scene to which a target video belongs;提取所述目标文档中所述各第一候选实体间的上下文接近关系,并为所述上下文接近关系设置权重;Extracting contextual proximity relationships between the first candidate entities in the target document, and setting weights for the contextual proximity relationships;合并所述各第一候选实体间的直接关系和上下文接近关系,获得所述目标视频所属场景对应的场景知识图谱;Merging the direct relationship and the context proximity relationship between the first candidate entities to obtain a scene knowledge graph corresponding to the scene to which the target video belongs;基于所述目标实体,查询所述场景知识图谱,获得第一搜索数据,并基于所述目标实体,查询所述目标视频对应的时序知识图谱,获得第二搜索数据,其中,所述时序知识图谱包含:各时间段内,所述目标视频中的各候选实体间的交互关系;Based on the target entity, the scene knowledge graph is queried to obtain first search data, and based on the target entity, the temporal knowledge graph corresponding to the target video is queried to obtain second search data, wherein the temporal knowledge graph includes: the interaction relationship between each candidate entity in the target video in each time period;基于所述第一搜索数据和所述第二搜索数据,通过大语言模型LLM,获得所述问题的答案。Based on the first search data and the second search data, an answer to the question is obtained through a large language model LLM.2.如权利要求1所述的方法,其特征在于,所述基于所述目标实体,查询目标视频所属场景对应的场景知识图谱,获得第一搜索数据,并基于所述目标实体,查询所述目标视频对应的时序知识图谱,获得第二搜索数据之前,还包括:2. The method according to claim 1, characterized in that before the step of querying the scene knowledge graph corresponding to the scene to which the target video belongs based on the target entity to obtain the first search data, and querying the temporal knowledge graph corresponding to the target video based on the target entity to obtain the second search data, further comprises:将所述目标视频的画面信息编辑为时间文本信息,所述时间文本信息包含:各事件发生时间,以及所述各事件发生时间对应的第二候选实体的信息;Editing the screen information of the target video into time text information, wherein the time text information includes: the time of occurrence of each event, and information of the second candidate entity corresponding to the time of occurrence of each event;根据所述各事件发生时间,将所述时间文本信息转化为时序图,获得所述时序知识图谱。According to the occurrence time of each event, the time text information is converted into a time series graph to obtain the time series knowledge graph.3.如权利要求2所述的方法,其特征在于,所述将所述目标视频的画面信息编辑为时间文本信息,包括:3. The method according to claim 2, wherein the step of editing the picture information of the target video into time text information comprises:对所述目标视频进行目标检测,获得各第二候选实体的属性信息;Performing target detection on the target video to obtain attribute information of each second candidate entity;对所述目标视频中的所述各第二候选实体进行目标跟踪,获得所述各第二候选实体的跟踪信息;Performing target tracking on each of the second candidate entities in the target video to obtain tracking information of each of the second candidate entities;基于所述属性信息和所述跟踪信息,生成所述目标视频的时间文本信息。Based on the attribute information and the tracking information, temporal text information of the target video is generated.4.如权利要求1所述的方法,其特征在于,所述基于所述第一搜索数据和所述第二搜索数据,通过大语言模型LLM,获得所述问题的答案,包括:4. The method according to claim 1, wherein obtaining the answer to the question through a large language model (LLM) based on the first search data and the second search data comprises:通过所述LLM,筛选出所述第一搜索数据和所述第二搜索数据中与所述问题满足相关性条件的目标搜索数据;By using the LLM, target search data that meets a relevance condition with the question is screened out from the first search data and the second search data;将所述目标搜索数据作为所述LLM的提示词的一部分,并将所述提示词输入至所述LLM中,得到所述问题的答案。The target search data is used as a part of the prompt words of the LLM, and the prompt words are input into the LLM to obtain the answer to the question.5.一种视频问答装置,其特征在于,包括:5. A video question-answering device, comprising:获取模块,用于获取输入对象输入的问题,并抽取所述问题中的目标实体;An acquisition module, used to acquire a question input by an input object and extract a target entity in the question;第一生成模块用于提取目标文档中各第一候选实体间的直接关系,并为所述直接关系设置权重,其中,所述目标文档包含目标视频所属场景的相关文本;The first generation module is used to extract the direct relationship between the first candidate entities in the target document and set the weight for the direct relationship, wherein the target document contains the relevant text of the scene to which the target video belongs;提取所述目标文档中所述各第一候选实体间的上下文接近关系,并为所述上下文接近关系设置权重;Extracting contextual proximity relationships between the first candidate entities in the target document, and setting weights for the contextual proximity relationships;合并所述各第一候选实体间的直接关系和上下文接近关系,获得所述目标视频所属场景对应的场景知识图谱;Merging the direct relationship and the context proximity relationship between the first candidate entities to obtain a scene knowledge graph corresponding to the scene to which the target video belongs;查询模块,用于基于所述目标实体,查询所述场景知识图谱,获得第一搜索数据,并基于所述目标实体,查询所述目标视频对应的时序知识图谱,获得第二搜索数据,其中,所述时序知识图谱包含:各时间段内,所述目标视频中的各候选实体间的交互关系;A query module is used to query the scene knowledge graph based on the target entity to obtain first search data, and query the time series knowledge graph corresponding to the target video based on the target entity to obtain second search data, wherein the time series knowledge graph includes: the interaction relationship between each candidate entity in the target video in each time period;处理模块,用于基于所述第一搜索数据和所述第二搜索数据,通过大语言模型LLM,获得所述问题的答案。A processing module is used to obtain an answer to the question through a large language model LLM based on the first search data and the second search data.6.如权利要求5所述的装置,其特征在于,所述基于所述目标实体,查询目标视频所属场景对应的场景知识图谱,获得第一搜索数据,并基于所述目标实体,查询所述目标视频对应的时序知识图谱,获得第二搜索数据之前,所述装置还包括第二生成模块,所述第二生成模块用于:6. The device according to claim 5, characterized in that before querying the scene knowledge graph corresponding to the scene to which the target video belongs based on the target entity to obtain the first search data, and querying the temporal knowledge graph corresponding to the target video based on the target entity to obtain the second search data, the device further comprises a second generation module, wherein the second generation module is used to:将所述目标视频的画面信息编辑为时间文本信息,所述时间文本信息包含:各事件发生时间,以及所述各事件发生时间对应的第二候选实体的信息;Editing the screen information of the target video into time text information, wherein the time text information includes: the time of occurrence of each event, and information of the second candidate entity corresponding to the time of occurrence of each event;根据所述各事件发生时间,将所述时间文本信息转化为时序图,获得所述时序知识图谱。According to the occurrence time of each event, the time text information is converted into a time series graph to obtain the time series knowledge graph.7.一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求 1~4中任一项所述方法的步骤。7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method described in any one of claims 1 to 4 when executing the program.8.一种计算机可读存储介质,其上存储有计算机程序,其特征在于:所述计算机程序被处理器执行时实现权利要求1~4中任一项所述方法的步骤。8. A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 4 when executed by a processor.
CN202411570790.7A2024-11-052024-11-05Video question-answering method and device, electronic equipment and storage mediumActiveCN119066223B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411570790.7ACN119066223B (en)2024-11-052024-11-05Video question-answering method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411570790.7ACN119066223B (en)2024-11-052024-11-05Video question-answering method and device, electronic equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN119066223A CN119066223A (en)2024-12-03
CN119066223Btrue CN119066223B (en)2025-01-17

Family

ID=93639879

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411570790.7AActiveCN119066223B (en)2024-11-052024-11-05Video question-answering method and device, electronic equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN119066223B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119760151B (en)*2024-12-102025-10-03北京百度网讯科技有限公司Knowledge graph generation method, question answering method and device, intelligent agent, equipment, medium and product

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116541490A (en)*2023-03-272023-08-04山东大学Complex scene video question-answering method and system for cloud service robot
CN118093956A (en)*2023-12-162024-05-28华南理工大学Question answering method for multi-granularity time sequence knowledge graph

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112836057B (en)*2019-11-222024-03-26华为技术有限公司Knowledge graph generation method, device, terminal and storage medium
CN114120166B (en)*2021-10-142023-09-22北京百度网讯科技有限公司 Video question and answer method, device, electronic equipment and storage medium
CN114328947A (en)*2021-11-232022-04-12泰康保险集团股份有限公司Knowledge graph-based question and answer method and device
CN115391511B (en)*2022-08-292025-05-27京东方科技集团股份有限公司 Video question-answering method, device, system and storage medium
CN117370608A (en)*2023-09-202024-01-09中山大学Video question-answering method and system with interpretability and knowledge inspiring capability
CN117648423A (en)*2023-12-122024-03-05中国电信股份有限公司Question and answer method and device based on time sequence knowledge graph
CN118865196B (en)*2024-06-252025-08-29浙江大学 A video understanding method and system based on large language model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116541490A (en)*2023-03-272023-08-04山东大学Complex scene video question-answering method and system for cloud service robot
CN118093956A (en)*2023-12-162024-05-28华南理工大学Question answering method for multi-granularity time sequence knowledge graph

Also Published As

Publication numberPublication date
CN119066223A (en)2024-12-03

Similar Documents

PublicationPublication DateTitle
US11899681B2 (en)Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
KR20190139751A (en)Method and apparatus for processing video
CN113569037A (en) A message processing method, device and readable storage medium
US20100100371A1 (en)Method, System, and Apparatus for Message Generation
US20210049354A1 (en)Human object recognition method, device, electronic apparatus and storage medium
CN119066223B (en)Video question-answering method and device, electronic equipment and storage medium
KR102122918B1 (en)Interactive question-anwering apparatus and method thereof
CN117932022A (en)Intelligent question-answering method and device, electronic equipment and storage medium
CN114064943A (en)Conference management method, conference management device, storage medium and electronic equipment
CN117349515A (en) Search processing methods, electronic devices and storage media
CN118277588A (en)Query request processing method, electronic device and storage medium
KR20250044145A (en)Application prediction based on a visual search determination
CN117093600A (en)Search prompt word generation method and device, electronic equipment and storage medium
CN112507139A (en)Knowledge graph-based question-answering method, system, equipment and storage medium
CN111223014B (en)Method and system for online generation of subdivision scene teaching courses from a large number of subdivision teaching contents
CN118760759B (en) Document-oriented question-answering method, device, electronic device, storage medium and product
US20180293299A1 (en)Query processing
CN117093695A (en)Knowledge graph-based community intelligent question and answer method and device and electronic equipment
CN116303975B (en)Training method of recall model, recall method and related equipment
CN114301886A (en)Multimedia resource identification method, device, equipment and storage medium
CN112765447A (en)Data searching method and device and electronic equipment
CN119781888B (en)Information processing method, apparatus, device and storage medium
US20250165529A1 (en)Interactive real-time video search based on knowledge graph
CN120493947B (en) A method, device and equipment for constructing multimodal cross-domain question-answering data
CN120216660A (en) Information retrieval method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp