Disclosure of Invention
The embodiment of the application provides a video question answering method, a video question answering device, electronic equipment and a storage medium, and the accuracy of answers is improved.
The specific technical scheme provided by the embodiment of the application is as follows:
in a first aspect, a video question-answering method is provided, the method including:
acquiring a problem input by an input object, and extracting a target entity in the problem;
Inquiring a scene knowledge graph corresponding to a scene to which a target video belongs based on a target entity to obtain first search data, and inquiring a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data, wherein the time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period;
Based on the first search data and the second search data, answers to the questions are obtained through a large language model LLM.
In a possible embodiment, based on the target entity, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, inquiring a time sequence knowledge graph corresponding to the target video to obtain the second search data, before the second search data is obtained, the method further includes:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
By the method, the documents related to the scenes are edited into the scene knowledge graph, so that the LLM considers the scene knowledge when answering the questions, the questions which are not asked by the LLM are solved, and the accuracy of the answers is improved.
In a possible embodiment, based on the target entity, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, inquiring a time sequence knowledge graph corresponding to the target video to obtain the second search data, before the second search data is obtained, the method further includes:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
By the method, the picture information of the target video is stored as the time sequence knowledge graph according to the event time, so that longer videos can be understood, more information in the videos can be obtained, and related video clips can be found more quickly.
In one possible embodiment, editing the picture information of the target video into time text information includes:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In one possible embodiment, obtaining an answer to a question through a large language model LLM based on the first search data and the second search data, includes:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
By the method, the target search data is provided for the LLM, so that the accuracy of video answer can be improved.
In a second aspect, a video question answering apparatus is provided, the apparatus comprising:
The acquisition module is used for acquiring the problem input by the input object and extracting a target entity in the problem;
The query module is used for querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity to obtain first search data, and querying a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data, wherein the time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period;
and the processing module is used for obtaining answers to the questions through a large language model LLM based on the first search data and the second search data.
In a possible embodiment, based on the target entity, the device queries a scene knowledge graph corresponding to a scene to which the target video belongs to obtain first search data, and based on the target entity, queries a time sequence knowledge graph corresponding to the target video to obtain second search data, and before the device further comprises a first generation module, the first generation module is used for:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
In a possible embodiment, based on the target entity, the device queries a scene knowledge graph corresponding to a scene to which the target video belongs to obtain the first search data, and based on the target entity, queries a time sequence knowledge graph corresponding to the target video to obtain the second search data, and before the second search data is obtained, the device further comprises a second generation module, where the second generation module is configured to:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
In one possible embodiment, when editing the picture information of the target video into the time text information, the second generating module is further configured to:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In a possible embodiment, when obtaining an answer to a question by the large language model LLM based on the first search data and the second search data, the processing module is further configured to:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the first aspects when the program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects above.
In a fifth aspect, the present application provides a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.
In the embodiment of the application, the questions input by the input objects are acquired, target entities in the questions are extracted, then the scene knowledge graph corresponding to the scene to which the target video belongs is queried based on the target entities, the first search data is obtained, the time sequence knowledge graph corresponding to the target video is queried based on the target entities, the second search data is obtained, and finally the answers to the questions are obtained through the large language model LLM based on the first search data and the second search data. Therefore, based on the LLM, the scene knowledge spectrum and the time sequence knowledge spectrum are combined, so that the LLM considers the scene knowledge and the time sequence knowledge when answering the video questions, the situations of illusion and speaking disorder are reduced when answering, and meanwhile, the LLM is endowed with the capability of understanding the ultra-long time sequence video, and the accuracy of video answering is improved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.
The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, in the embodiments of the present application, some existing solutions in the industry such as software, components, models, etc. may be mentioned, and they should be regarded as exemplary, only for illustrating the feasibility of implementing the technical solution of the present application, but it does not mean that the applicant has or must not use the solution.
In the technical scheme of the application, the acquisition, transmission, storage, use and the like of the data all meet the requirements of national relevant laws and regulations.
The following briefly describes the design concept of the embodiment of the present application:
Traditional video understanding methods (such as optical flow methods, time series-based hidden Markov models and the like) can realize tasks such as tracking targets, motion state analysis and the like.
With the development of deep learning, most tasks now use a deep learning model for understanding and analyzing videos, such as convolutional neural networks (Convolutional Neural Networks, CNN), long Short-Term Memory (LSTM), three-dimensional convolutional neural networks (3D CNN), and the like.
Recently, large language models (Large Language Model, LLM) have evolved rapidly. The advent of large language models pre-trained on large-scale data sets has introduced a new context learning capability. This allows them to handle various tasks through hints without requiring fine tuning. Generating a pre-trained transformation model (CHAT GENERATIVE PRE-trained Transformer, chatGPT) is the first breakthrough application based on this, which includes the ability to generate code and call other model tools or APIs. Many studies are exploring how to invoke the application programming interface (English: application Programming Interface, API) of Visual models using large language models like ChatGPT to solve the problems in the computer vision field, including Visual-ChatGPT. The advent of guided-tuning further enhanced the ability of these models to respond effectively to user requests and to perform specific tasks. Large language models incorporating video understanding capabilities provide more complex multimodal understanding advantages that enable them to handle and interpret complex interactions between visual and textual data. As well as their impact in the field of natural language processing (Natural Language Processing, NLP), these models act as more general task resolvers, with the benefit of their massive knowledge base and context understanding obtained from large amounts of multimodal data, handling a wider range of tasks. This not only enables them to understand visual content, but also to reason about it in a way that more closely approximates human understanding. Much work is also exploring the use of large language models, vid-LLMs, in video understanding tasks.
Video questions and answers refer to answers in natural language that are generated for a given question by understanding the relevant video content. Video questions and answers are widely used in many scenarios, such as multimedia information retrieval scenarios, intelligent assistants, business service scenarios, and the like.
In the prior art, a contrast language Image Pre-training (CLIP) model is generally used for video question-answering, or LLM is used for video question-answering.
However, the CLIP model only supports questions and answers to content on a single still picture in the video, and the information referenced is too one-sided, i.e., the available information is limited, resulting in poor accuracy of the answers obtained. Meanwhile, when LLM is used alone, because of its uniqueness in the use scenario, LLM may answer questions, resulting in poor accuracy of the obtained answers.
In view of this, in the embodiment of the present application, a video question-answering method, apparatus, electronic device and storage medium are provided, a question input by an input object is obtained, a target entity in the question is extracted, then a scene knowledge graph corresponding to a scene to which a target video belongs is queried based on the target entity, first search data is obtained, and a time sequence knowledge graph corresponding to the target video is queried based on the target entity, second search data is obtained, wherein the time sequence knowledge graph includes an interaction relationship between candidate entities in the target video in each time period, and finally an answer to the question is obtained through a large language model LLM based on the first search data and the second search data. Therefore, based on the LLM, the scene knowledge and the time sequence knowledge are combined, so that the LLM considers the scene knowledge and the time sequence knowledge when answering the video questions, the specific entity of the scene can be identified, the situations of illusion and language disorder during answering are reduced, the LLM is endowed with the capability of understanding the ultra-long time sequence video, and the accuracy of video answering is improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and that the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application. In the application scenario diagram, a server 110 is included, and a terminal device 120 (including a terminal device 1201, a terminal device 1202.
The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal device 120 and the server 110 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
The terminal device 120 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an electronic book reader, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like, and various software such as an application program, an applet, and the like can be installed on the terminal device.
It should be noted that, the number of the terminal devices 120 and the servers 110 shown in fig. 1 is merely illustrative, and the number is not limited in practice, and the embodiment of the present application is not limited in detail.
It should be noted that, the video question-answering method in the embodiment of the present application may be deployed in a computing device, where the computing device may be a server or a terminal device, where the server may be the server 110 shown in fig. 1, or may be another server than the server 110 shown in fig. 1, and the terminal device may be the terminal device 120 shown in fig. 1, or may be another terminal device than the terminal device 120 shown in fig. 1. That is, the method may be executed by the server or the terminal device alone or may be executed by both the server and the terminal device together.
The video question answering method provided by the exemplary embodiments of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in this respect.
Referring to fig. 2, a flowchart of implementing a video question-answering method according to an embodiment of the present application is shown, where the specific implementation flow of the method is as follows:
and step 20, acquiring the problem input by the input object, and extracting the target entity in the problem.
In the embodiment of the application, an input interface is displayed on a terminal device, when an input object is input in a question-answer area of the input interface, a video question-answer request is triggered, a server responds to the video question-answer request, a question input by the input object is obtained, information extraction is carried out on the question, and a target entity corresponding to the question is obtained.
For example, when the input object wants to answer a question in a video, the input box of the question-answering interface of the terminal device is required to input a question text, for example, the input object wants to ask "which people enter the yard in the morning today", and the input box is clicked to input which people enter the yard in the morning today. Then clicking a confirmation key to trigger a video question-answering request, and extracting information of 'which people enter the courtyard in the morning today' to obtain a target entity corresponding to the problem as 'people'.
In addition, the input object can click a voice recognition button to input a problem.
Therefore, the input interface is presented for inputting the problem of the input object, convenience is provided for the input object, and the user experience is improved. And the method can also be used for voice input, so that the time for inputting text of an input object is saved, and the efficiency of intelligent question answering is improved.
In the embodiment of the application, the scene knowledge graph and the time sequence knowledge graph are constructed in advance.
Optionally, in the embodiment of the present application, a possible embodiment is provided for constructing a scene knowledge graph, and specifically the following operations are performed:
And A1, extracting direct relations among first candidate entities in the target document, and setting weights for the direct relations.
The target document comprises relevant texts of scenes to which the target video belongs.
For example, the scene to which the target video belongs is a corridor, and the target document is a corridor-related document.
In the embodiment of the application, the target document is divided into text blocks, and the following operations are respectively executed for each text block, namely, each first candidate entity contained in one text block and the direct relation among each first candidate entity are extracted, and weights are set for the direct relation.
Wherein the weights characterize the importance of the corresponding relationships.
The method is exemplified by referring to fig. 3, which is a schematic diagram of scene knowledge graph construction in the embodiment of the application, a target document is split into text blocks, direct relations among first candidate entities contained in the text blocks are respectively extracted, and weights are set for the direct relations, wherein the direct relations between cameras and the corridor are that the cameras are distributed and controlled in the corridor, the weights are 2, the direct relations between the corridor and lamps are that the corridor is provided with lamps, and the weights are 1.
And step A2, extracting the context proximity relation among the first candidate entities in the target document, and setting weight for the context proximity relation.
In the embodiment of the application, a context proximity relation is added between the first candidate entities, and the weight of the context proximity relation is set.
The context approaching relationship between each first candidate entity is extracted, as shown in fig. 3, wherein the context approaching relationship between the cameras and the corridor is that a plurality of cameras are installed in the corridor, the weight is 4, and the context approaching relationship between the corridor and the lamps is that ceiling lamps and spot lamps are installed in the corridor, and the weight is 2.
And step A3, merging the direct relation and the context proximity relation among the first candidate entities to obtain a scene knowledge graph.
In the embodiment of the application, the direct relation and the context proximity relation between the first candidate entities are combined, and the weight of the direct relation and the weight of the context proximity relation are combined to obtain the scene knowledge graph.
The weights of the direct relationships and the weights of the context proximity relationships may be added to obtain the combined weights, which is not limited in the embodiment of the present application.
The relationship between the cameras and the corridor is that the cameras are distributed in the corridor, a plurality of cameras are installed in the corridor, the weight is 7, the relationship between the corridor and the lamp is that a ceiling lamp and a spotlight are installed in the corridor, and the weight is 3.
In addition, it should be noted that in the embodiment of the present application, the manner or details of constructing the scene knowledge graph may be adjusted according to different scenes, and the comparison in the embodiment of the present application is not limited.
In this way, the documents related to the scene are edited into the knowledge graph and put into the database, and the knowledge of the related scene is matched when the LLM questions are inquired, so that the LLM considers the knowledge of the scene when answering the questions, the questions which are not inquired by the LLM are solved, and the accuracy of the answers is improved.
Optionally, in the embodiment of the present application, a possible embodiment is provided for constructing a time sequence knowledge graph, and specifically the following operations are performed:
and B1, editing the picture information of the target video into time text information.
The time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event, wherein the information of the second candidate entity comprises attribute information and tracking information.
In the embodiment of the application, the picture information of each video frame of the target video is edited into time text information.
Referring to fig. 4, a schematic diagram of time text information in an embodiment of the present application is shown, where the information of the second candidate entity corresponding to 12 points 10 minutes 05 seconds on 1 day of 9 months of 2026 includes that the entity type is a person, the clothing type is a short sleeve, the clothing color is white, the trousers type is a short sleeve, the trousers color is black, the state is static, the position is [ (10, 15), (40, 60) ], and the entity id is 20. The information of the second candidate entity corresponding to the 13 minutes and 25 seconds at the 12 th 1 st 9 th month of the event occurrence time comprises that the entity type is a car, the car color is red, the car type is type 1, the state is that the car moves, the position is [ (20,14), (30, 90) ], and the entity id is 123.
Optionally, in the embodiment of the present application, a possible embodiment is provided for obtaining the time text information, and specifically the following operations are performed:
and step B10, performing target detection on the target video to obtain attribute information of each second candidate entity.
In the embodiment of the application, a target detection algorithm is adopted to carry out target detection on the target video, so as to obtain the attribute information of each second candidate entity.
The target detection algorithm may be a YOLO algorithm, and the attribute information of each second candidate entity includes, but is not limited to, basic attribute information such as entity type, color, entity id, etc., which is not limited in the embodiment of the present application.
And step B11, performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity.
In the embodiment of the application, a target tracking algorithm is adopted to track the target video, so as to obtain the tracking information of each second candidate entity.
Wherein the tracking information of each second candidate entity includes, but is not limited to, status, location, etc.
And step B12, generating time text information of the target video based on the attribute information and the tracking information.
And B2, converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
The time sequence diagram comprises interaction relations among the second candidate entities in each time period, wherein the interaction relations are interaction relations such as static, far away, close to, entering or leaving, and the like, and the interaction relations are not limited in the embodiment of the application.
In the embodiment of the application, time text information is recorded into a graph form according to the occurrence time of each event.
For example, referring to fig. 5, a schematic diagram of a timing chart in an embodiment of the present application is shown, in which entity (person) No. 20 is relatively stationary in entity (car) No. 123 during a period of time from 10 minutes 05 seconds at 1 st month of 2026 to 12 minutes 08 seconds at 1 st month of 2026, entity (car) No. 123 is far away from entity (person) No. 20 during a period of time from 20 minutes 05 seconds at 1 st month of 2026 to 21 minutes 08 seconds at 1 st month of 2026, and entity (person) No. 20 enters entity (building) No. 6.
In this way, the picture information of the target video is stored as a knowledge graph according to the event time, the time sequence information of the video is considered, compared with other video understanding modes, longer videos can be understood, more information in the video is obtained, and related video fragments can be found more quickly when video inquiry is carried out.
And step 21, inquiring a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity to obtain first search data, and inquiring a time sequence knowledge graph corresponding to the target video based on the target entity to obtain second search data.
The time sequence knowledge graph comprises interaction relations among candidate entities in the target video in each time period, and the scene knowledge graph comprises relations among related candidate entities of the scene of the target video.
Step 22, obtaining answers to the questions through a large language model LLM based on the first search data and the second search data.
In the embodiment of the application, after the first search data and the second search data corresponding to the target entity are obtained, the answer to the question is obtained through the large language model LLM based on the first search data and the second search data.
Optionally, in the embodiment of the present application, a possible embodiment is provided for obtaining an answer to a question through LLM, specifically performing the following operations:
and 220, screening out target search data meeting the correlation condition with the problem in the first search data and the second search data through LLM.
In the embodiment of the application, through LLM, data with the correlation with the problem larger than the correlation threshold value is screened out from the first search data and the second search data and is used as target search data.
Step 221, using the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
In the embodiment of the application, the target search data is used as a part of the prompt word of the LLM, and the prompt word and the question are input into the LLM to obtain the answer of the question.
For example, assume that the question is "which people enter the yard in the morning? the answer was" three people enter the courtyard today in the morning, A, B and C, respectively.
In this way, providing LLM with target search data that is useful and comprehensive for answering questions can improve the accuracy of video answers.
Based on the same inventive concept, the embodiment of the present application further provides a video question-answering device, referring to fig. 6, which is a schematic structural diagram of the video question-answering device in the embodiment of the present application, and specifically includes:
the acquiring module 601 is configured to acquire a question input by an input object, and extract a target entity in the question;
The query module 602 is configured to query, based on a target entity, a scene knowledge graph corresponding to a scene to which the target video belongs, obtain first search data, and query, based on the target entity, a time sequence knowledge graph corresponding to the target video, obtain second search data, where the time sequence knowledge graph includes an interaction relationship between candidate entities in the target video in each time period;
The processing module 603 is configured to obtain an answer to the question through the large language model LLM based on the first search data and the second search data.
In one possible embodiment, the first search data is obtained by querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity, and based on the target entity, the device further includes a first generating module 604, where the first generating module 604 is configured to:
Extracting a direct relation among first candidate entities in a target document, and setting weights for the direct relation, wherein the target document contains related texts of scenes to which a target video belongs;
Extracting context proximity relations among first candidate entities in the target document, and setting weights for the context proximity relations;
And merging the direct relation and the context proximity relation among the first candidate entities to obtain the scene knowledge graph.
In one possible embodiment, the first search data is obtained by querying a scene knowledge graph corresponding to a scene to which the target video belongs based on the target entity, and based on the target entity, the device further includes a second generating module 605, where the second generating module 605 is configured to:
Editing the picture information of the target video into time text information, wherein the time text information comprises occurrence time of each event and information of a second candidate entity corresponding to the occurrence time of each event;
and converting the time text information into a time sequence diagram according to the occurrence time of each event, and obtaining a time sequence knowledge graph.
In one possible embodiment, when editing the picture information of the target video into the time text information, the second generating module 605 is further configured to:
performing target detection on the target video to obtain attribute information of each second candidate entity;
performing target tracking on each second candidate entity in the target video to obtain tracking information of each second candidate entity;
based on the attribute information and the tracking information, time text information of the target video is generated.
In a possible embodiment, when obtaining an answer to a question by the large language model LLM based on the first search data and the second search data, the processing module 603 is further configured to:
screening target search data meeting correlation conditions with the problems in the first search data and the second search data through LLM;
And taking the target search data as a part of the prompt word of the LLM, and inputting the prompt word into the LLM to obtain an answer to the question.
Based on the above embodiments, referring to fig. 7, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown.
Embodiments of the present application provide an electronic device that may include a processor 710 (Center Processing Unit, a CPU), a memory 720, an input device 730, an output device 740, and the like, where the input device 730 may include a keyboard, a mouse, a touch screen, and the like, and the output device 740 may include a display device, such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), a Cathode Ray Tube (CRT), and the like.
Memory 720 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides processor 710 with program instructions and data stored in memory 720. In an embodiment of the present application, the memory 720 may be used to store a program of any of the video question-answering methods in the embodiment of the present application.
Processor 710 is configured to execute any of the video question-answering methods according to the embodiments of the present application by calling the program instructions stored in memory 720, and then executing the program instructions.
Based on the above embodiments, in the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video question-answering method in any of the above method embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.