CN119893025A

Movatterモバイル変換

Info

Publication number: CN119893025A
Application number: CN202411157138.2A
Authority: CN
Inventors: 周强
Original assignee: Shenzhen Zhiwei Science & Technology Co ltd
Current assignee: Shenzhen Zhiwei Science & Technology Co ltd
Priority date: 2024-08-21
Filing date: 2024-08-21
Publication date: 2025-04-25

Abstract

The invention discloses a video conference method, a video conference device, electronic equipment and a video conference storage medium, and belongs to the technical field of video conferences, wherein the video conference method comprises the steps of acquiring scene images by cameras at the same fixed position and at least acquiring images at different positions of the same target in the same scene at intervals of preset distances; the method comprises the steps of establishing a 3D coordinate system according to a scene image, determining scene and target parameters, correcting the scene and the 3D coordinate system corresponding to the scene according to the height value of a camera, the parameter data of a target and a horizon plane, collecting a video image and a voice signal synchronous with the video image, determining the position of a current sounder in the video image according to the video image and the voice signal based on the 3D coordinate system, and carrying out identity marking on the current sounder according to the position of the current sounder in the video image and the corresponding voice information. The problem that the position identification is inaccurate possibly exists in the video conference is solved, and the overall performance and the user experience of the system are improved.

Description

Video conference method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of video conferences, and particularly relates to a video conference method, a video conference device, electronic equipment and a storage medium.

Background

Video conference method video conference refers to a conference in which people located at two or more places are talking face to face through communication devices and networks. Video conferences can be divided into point-to-point conferences and multipoint conferences, depending on the number of sites involved.

By using the video conference system, a participant can hear the sound of other meeting places, see the images, actions and expressions of participants in other meeting places, and can send electronic demonstration contents, so that the participant has the feeling of being personally on the scene.

At present, a video conference is affected by a scene environment and uncertainty of a voice signal, so that voiceprint characteristics of a speaker also have various uncertainties, and further recognition accuracy of a participant is affected, particularly, a plurality of ginseng and or multi-terminal video conferences are affected, for example, by the scene environment and camera hardware, the problem of inaccurate position recognition possibly exists, and the problem of picking up a camera, namely, for different cameras, the problem of large recognition accuracy difference possibly caused by different imaging components and other hardware of the cameras, and the like. Therefore, the existing video conference method and device still have a large room for improvement.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a video conference method and a video conference device.

The technical scheme adopted for solving the technical problems is that the video conference method comprises the following steps:

acquiring scene images and images of at least different positions of the same target in the same scene at preset distances through cameras at the same fixed position;

Establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value and length and/or width data of a target;

Correcting the scene and a 3D coordinate system corresponding to the scene according to the height value of the camera, at least two length or width data of the target and the horizon plane;

when a participant appears in a scene, collecting a video image and a voice signal synchronous with the video image;

determining the position of a current sounder in the video image based on the 3D coordinate system and according to the video image and the voice signal;

and labeling the identity of the current sounder according to the position of the current sounder in the video image and the corresponding voice information.

Further, the obtaining, by the camera at the same fixed position, the scene image and the images at least at different positions of the same target in the same scene at intervals of a preset distance includes:

Fixing the camera to a certain height, and acquiring a scene image through the camera;

and sequentially setting the targets at the first position and the second position, and respectively acquiring images containing the complete target length and/or width.

Further, the setting the target at the first position and the second position in sequence, and respectively acquiring images including the complete target length and/or width, and then further includes:

And sequentially acquiring images containing the complete length and/or width of the target at a third position and a fourth position of the target, wherein the first position, the second position, the third position and the fourth position are positioned in different directions, and at least two positions are positioned at the edge of the scene.

Further, the establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value, and length and/or width data of a target, including:

establishing an internal 3D coordinate system according to a preset AR system and the scene image, and mapping the ground of the scene into a horizon plane in the image;

and measuring the height value of the camera relative to the ground of the scene and the length and/or width data of the target.

Further, the correcting the scene and the 3D coordinate system corresponding to the scene according to the height value of the camera, at least two length or width data of the target, and the horizon plane includes:

determining a camera field angle according to the height value of the camera;

Determining the proportional relation between a target object and imaging according to the length or width data of the target and the imaging of the target object in a scene;

Determining the mapping relation of the scene relative to each orthographic projection plane of a 3D coordinate system according to the view angle and the horizon plane;

and correcting the mapping corresponding relation between the scene position and the 3D coordinate position according to the proportional relation.

Further, the determining, based on the 3D coordinate system and according to the video image and the voice signal, the position of the current speaker in the video image includes:

determining voiceprint characteristics of a sounder according to the voice signals;

determining identity information of a sounder according to the voiceprint characteristics;

the position of the speaker in the video image is determined in the conference scene using the directional microphone and the camera.

Further, the method further comprises the following steps:

and according to the position of the current sounder in the video image and the corresponding voiceprint information, carrying out identity marking on the current sounder, converting the voice information into text information, and adding the text in each piece of text information with identity information for identifying the sounder.

A video conferencing device comprising:

The image acquisition module is used for acquiring scene images and images of at least different positions of the same target in the same scene at intervals of preset distances through cameras at the same fixed position;

The coordinate system generation module is used for establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value and length and/or width data of a target;

the data correction module is used for correcting the scene and a 3D coordinate system corresponding to the scene according to the height value of the camera, at least two length or width data of the target and the horizon plane;

the synchronous acquisition module is used for acquiring a video image and a voice signal synchronous with the video image when a participant appears in a scene;

the position determining module is used for determining the position of the current sounder in the video image based on the 3D coordinate system according to the video image and the voice signal;

and the identity marking module is used for marking the identity of the current sounder according to the position of the current sounder in the video image and the corresponding voice information.

An electronic device, comprising:

And one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform a video conferencing method.

A computer-readable storage medium storing a computer program that causes a processor to perform a video conferencing method.

The beneficial effects of the invention are as follows:

The method comprises the steps of obtaining a scene image through a camera at the same fixed position and images of the same target at least at different positions at preset distance in the same scene, establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value and length and/or width data of the target, correcting the scene and the 3D coordinate system corresponding to the scene according to the camera height value, the at least two length or width data of the target and the horizon plane, collecting a video image and voice signals synchronous with the video image when a participant appears in the scene, determining the position of a current sounder in the video image according to the 3D coordinate system and the video image and the voice signals, and marking the identity of the current sounder according to the position of the current sounder in the video image and the corresponding voice information. Through the 3D space modeling and the multi-mode information fusion method, the problem that the video conference is inaccurate in position identification and the problem of picking up cameras possibly exist is effectively solved, namely, for different cameras, the problem that the identification accuracy possibly caused by different imaging components and other hardware is large in phase difference is solved, and the overall performance and user experience of the system are improved.

Drawings

Fig. 1 is a flowchart illustrating steps of a video conference method according to an embodiment of the present invention;

Fig. 2 is a block diagram of a video conference apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, a step flow chart of a video conference method of the present embodiment is shown, including:

Step S100, acquiring scene images and images of at least different positions of the same target in the same scene at intervals of preset distances through cameras at the same fixed position;

step 200, a 3D coordinate system is established according to the scene image, a horizon plane of the scene is determined, and a camera height value and length and/or width data of a target are measured;

Step S300, correcting a scene and a 3D coordinate system corresponding to the scene according to the height value of the camera, at least two length or width data of the target and a horizon plane;

Step 400, when a participant appears in the scene, collecting a video image and a voice signal synchronous with the video image;

Step S500, determining the position of the current speaker in the video image based on the 3D coordinate system according to the video image and the voice signal;

And step 600, labeling the identity of the current sounder according to the position of the current sounder in the video image and the corresponding voice information.

In this embodiment, a 3D coordinate system is established by acquiring a scene image and images of a target at different positions through a camera at the same fixed position, determining a horizon plane of the scene, and measuring a height value of the camera and length and/or width data of the target. The method is favorable for establishing an accurate three-dimensional space model for subsequent positioning and correction, correcting a scene and a 3D coordinate system corresponding to the scene according to the height value of the camera, the length or width data of a target and a horizon plane, ensuring that consistent spatial relationship and proportion can be maintained even under different camera visual angles, reducing recognition errors caused by hardware differences, and collecting video images and synchronous voice signals when a participant appears in the scene. And determining the position of the current speaker in the video image by combining the video image and the voice signal through a 3D coordinate system. The multi-mode information fusion can improve the positioning accuracy, especially when voiceprint features are unclear or environmental noise is large. And labeling the identity of the current sounder according to the position of the sounder in the video image and the corresponding voice information. This helps the participant to discern fast, improves the interactive efficiency of meeting.

Through 3D space modeling and correction, recognition errors caused by scene environment and hardware differences can be reduced, and recognition accuracy in multi-person participation or multi-terminal video conferences is improved. Combining visual and voice information for positioning can provide supplement when single mode information is insufficient for accurate recognition, so that robustness of the system is enhanced, and accurate voiceprint and position recognition is helpful for participants to recognize more quickly, so that conference flow and participant experience are optimized.

In an embodiment of the present application, the obtaining, by using the cameras at the same fixed position, the scene image, and the images at least at different positions of the same target in the same scene at intervals of a preset distance includes:

The camera is fixed to a certain height, and a scene image is acquired through the camera, for example, the camera is fixed at a height of 1.5 to 1.8 meters, specifically, the camera can be fixed to the upper end of a large screen or the position of a whiteboard stop, and the scene image is acquired through the camera at the fixed position to serve as a background image of a scene.

A measuring target, which may be a ruler or a target with a fixed length or a person, is arranged at a first position and a second position in sequence, and images containing the complete length and/or width of the target are acquired respectively, for example, a fixed length measuring ruler or a fixed length rod is respectively arranged at different positions within the field of view of a camera, and images of at least two positions are acquired respectively, wherein the horizontal distances between the two positions and the camera are different, for example, 3 meters apart, and the imaging size of the same target in the same scene is different because the same target is closer to the camera than the same target is farther from the camera. And taking the two images of the same target as a measuring target for subsequent comparison.

And sequentially acquiring images containing the complete length and/or width of the target at a third position and a fourth position of the target, wherein the first position, the second position, the third position and the fourth position are positioned in different directions, and at least two positions are positioned at the edge of the scene. By setting the edge position, the scene can be divided into local areas, only the image and audio content in the divided areas are rendered, and the system resources can be saved and the processing efficiency of the system on data can be improved except no response. In response to sound only in a local area, when the distance of the audio signal is not within the scene, i.e. in the video image, if the sound is outside the divided scene, the system treats the audio as noise to mask it out.

It should be noted that, because the proportional relationship of the images may be nonlinear, more positions may be set to locate the target in these positions, to acquire corresponding images, and to correct the scene and the 3D coordinates according to the difference between the images and the actual measurement, so that the identified positions are more accurate.

It should be noted that, in the video conference, the video and the audio can be better corresponding to each other in the subsequent processing and recognition due to the fact that the video and the audio are different signals and different recognition modes. Although the face recognition technology is quite mature, in a video conference, conference participants may only display side faces or be blocked by other people in the video, and due to the limitation of conference room places, particularly in smaller conference rooms, the participants may be relatively close to each other, so that the recognition difficulty is increased. The position of the participant can be accurately identified through the correction of the scene, and the position corresponds to the voiceprint and the identity information of the participant. When the sound source is detected not to be in the scene, the voice information can be shielded, only the audio in the scene is effective audio, the sound can be free from the influence of external noise, and particularly, when the environment is not ideal and unstable, the problem of extremely poor conference effect can be caused. Thereby improving audio effects in video conferences.

In an embodiment of the present application, the establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value, and length and/or width data of a target, includes:

And the height value of the camera relative to the ground of the scene and the length and/or width data of the target are measured. Since the video or the system itself does not know the actual ground position, by establishing 3D coordinates, the nearest ground that the vision can reach in the video is taken as the horizon, and the scene ground is formed with the nearest ground that is imaged.

In an embodiment of the present application, the correcting the scene and the 3D coordinate system corresponding to the scene according to the height value of the camera and at least two length or width data of the target and the horizon plane includes:

the camera angle of view (‌ FOV) ‌ is divided into a horizontal angle of view (‌ HFOV) ‌ and a vertical angle of view (‌ VFOV) ‌, ‌ which determine the camera's field of view in the horizontal and vertical directions, respectively, based on the camera's height value. The ‌ focal length and the size of the imaging region are the main factors affecting the field angle. The longer the ‌ focal length, the smaller the ‌ field angle, the narrower the ‌ field of view, and the shorter the ‌ focal length, the larger the ‌ field angle, and the wider the ‌ field of view. ‌ the horizontal angle of view is mainly determined by the focal length and the width of the imaging region, ‌ the vertical angle of view is mainly determined by the focal length and the height of the imaging region. The calculation formulas horizontal angle of view (‌ HFOV) =2 arctan (w/2 f), vertical angle of view (‌ VFOV) =2 arctan (h/2 f), where w is the field width, h is the field height, and f is the focal length of the lens. ‌ A

Determining a proportional relation between a target object and imaging according to the length or width data of the target and the imaging of the target object in a scene, wherein the length or width of the imaging of the target in the system can be measured by using a tool provided by the system, and the size corresponding relation between the object and the imaging can be obtained by measuring the length or width of the target;

The method comprises the steps of determining mapping relation of a scene relative to each orthographic projection plane of a 3D coordinate system according to the viewing angle and the horizon plane, correcting mapping correspondence of the scene position and the 3D coordinate position according to the proportional relation, determining the correspondence according to the viewing angle and the horizon plane, determining the ground of the scene as an intersecting sitting surface of an X axis and a Y axis of the 3D coordinate system, and mapping the actual scene and the 3D coordinate according to the size correspondence. Through the scene mapping relation and the position mapping relation, when a participant sounds, which sound can be positioned more accurately. The method solves the problem that in the target AR video conference, as the 3D coordinate system of a scene is not corrected, the image and the coordinate position are offset due to the fact that the image of the camera can have distortion, the identified position can be offset, so that the disorder problem of 'Zhang guan Lian' easily appears in identification, namely, the position of a B participant can be shown by sitting due to the coordinate problem, which is originally generated by the A participant, is identified. Therefore, by the above 3D coordinates and scene correction, it is ensured that the recognition position can be more accurate.

In one embodiment of the present application, the determining, based on the 3D coordinate system and according to the video image and the voice signal, a position of a current speaker in the video image includes:

The method comprises the steps of determining voiceprint characteristics of a sounder according to the voice signals, determining identity information of the sounder according to the voiceprint characteristics, and determining the position of the sounder in a video image by utilizing a directional microphone and a camera in a conference scene.

In this embodiment, after the multi-person video conference starts, a video image of the conference site and a language signal synchronized with the video image may be collected in real time, then a speaker is determined based on the collected video image and the collected voice signal, and an identity is marked in the video image for the determined speaker.

According to the position of the current sounder in the video image and the corresponding voiceprint information, the method further comprises the steps of marking the identity of the current sounder, converting voice information into text information, and adding the text in each piece of text information with the identity information for identifying the sounder.

For example, when the identity of the determined speaker is marked in the video image, the relevant information of the corresponding speaker can be marked in the determined speaker position in the video image, wherein the relevant information of the speaker comprises the information of the name, the sex, the position contact way and the like of the participant, and the information can also be the code number, such as A001, A002 and the like, of the participant.

After the identity of the determined speaker is marked in the video image, the video image can be sent to conference equipment of each participant for use by multiple participants of the participant, for example, when a conference summary is generated, or can be sent to a third party, through the generated summary, the voice is respectively converted into a corresponding text, the corresponding identity tags are added in the text according to the voiceprint information, the text can be ordered in time sequence, and the speaking content of a person at a certain moment can be known when the person is checked.

It should be noted that, the approximate position of the speaker can be obtained directly through the directional microphone, and the speaker can be identified according to the combination of voiceprint analysis and image analysis, so that the identification accuracy is further improved.

In the present embodiment, the sound signal may be processed using a voice model, and for example, a deep neural network DNN model, a convolutional neural network CNN model, or the like may be employed. And extracting voiceprint features from the acquired voice signals through the voice model to obtain voiceprint vectors, and inputting the voiceprint vectors into the classification model to obtain the information of the speaker.

In an embodiment of the present application, the speaker can be identified and distinguished by using an image identification model, that is, in this embodiment, the collected video image is only identified for lips in the image, and face recognition is not required, so as to distinguish whether the speaker is speaking, and the speaker in the image is positioned and labeled with identity information in combination with voiceprint identification, where, for lip analysis, a CNN-LSTM model, or a 3D-ConvNet model can be used. The CNN-LSTM model is an integrated model of a convolutional neural network CNN and a long-short-term memory network LSTM, wherein the CNN part of the model processes data, and one-dimensional results are input into the LSTM model. The 3D-ConvNet model is a convolutional neural network model with a convolutional kernel of 3D, and a 3D convolutional module in the 3D-ConvNet model can be used for extracting time and space characteristics of video frames in a first video image and identifying whether a participant is speaking.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

As shown in fig. 2, in an embodiment of the present invention, there is also disclosed a video conference apparatus, including:

The image acquisition module 100 is configured to acquire, through cameras at the same fixed position, images of a scene, and images of the same target at least at different positions in the same scene at a preset distance;

the coordinate system generating module 200 is configured to establish a 3D coordinate system according to the scene image, determine a horizon plane of the scene, measure a camera height value, and length and/or width data of a target;

The data correction module 300 is configured to correct a scene and a 3D coordinate system corresponding to the scene according to the height value of the camera, at least two length or width data of the target, and a horizon plane;

The synchronous acquisition module 400 is used for acquiring a video image and a voice signal synchronous with the video image when a participant appears in a scene;

A position determining module 500, configured to determine a position of a current speaker in the video image based on the 3D coordinate system and according to the video image and the voice signal;

the identity labeling module 600 is configured to label the identity of the current speaker according to the position of the current speaker in the video image and the corresponding voice information.

With reference to FIG. 3, in an embodiment of the present invention, the present invention also provides a computer device, the computer device 12 described above being in the form of a general purpose computing device, and the components of the computer device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 connecting the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, a processor, or a local bus 18 using any of a variety of bus 18 architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus 18, micro channel architecture (MAC) bus 18, enhanced ISA bus 18, video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 31 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (commonly referred to as a "hard disk drive"). Although not shown in fig. 3, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk such as a CD-ROM, DVD-ROM, or other optical media may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, the program modules 42 being configured to carry out the functions of embodiments of the invention.

A program/utility 41 having a set (at least one) of program modules 42 may be stored, for example, in a memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet, through network adapter 20. As shown, network adapter 21 communicates with other modules of computer device 12 over bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to, microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage system 34, among others.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the video conference method provided by the embodiment of the present invention.

Namely, the processing unit 16 is implemented when executing the program, by acquiring a scene image and images of at least different positions of the same target in the same scene at intervals of preset distances through a camera at the same fixed position, establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value and length and/or width data of the target, correcting the scene and the 3D coordinate system corresponding to the scene according to the camera height value, the at least two length or width data of the target and the horizon plane, collecting a video image and a voice signal synchronous with the video image when a participant appears in the scene, determining the position of a current speaker in the video image based on the 3D coordinate system and according to the video image and the voice signal, and labeling the identity of the current speaker according to the position of the current speaker in the video image and the corresponding voice information.

In an embodiment of the present application, the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video conference method as provided in all embodiments of the present application.

The method comprises the steps of obtaining a scene image through a camera at the same fixed position and images of at least different positions of the same target in the same scene at preset distances, establishing a 3D coordinate system according to the scene image, determining a horizon plane of the scene, measuring a camera height value and length and/or width data of the target, correcting the scene and the 3D coordinate system corresponding to the camera according to the camera height value, the at least two length or width data of the target and the horizon plane, collecting a video image and voice signals synchronous with the video image when a participant appears in the scene, determining the position of a current speaker in the video image according to the video image and the voice signals, and labeling the identity of the current speaker according to the position of the current speaker in the video image and the corresponding voice information.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++, python and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing describes the principles and embodiments of the present invention in detail using specific examples to facilitate understanding of the method and core ideas of the present invention, and meanwhile, the present invention should not be construed as being limited to the embodiments and application scope of the present invention, since the technical personnel in the art can change the scope of the present invention according to the ideas.

Claims

1. A method of video conferencing, comprising:

2. The method according to claim 1, wherein the capturing, by the camera at the same fixed location, the image of the scene, and the image of the same object at least at different locations spaced apart by a predetermined distance in the same scene, includes:

3. The method of claim 2, wherein the positioning the object in the first position and the second position sequentially, and acquiring images containing the entire length and/or width of the object, respectively, further comprises:

4. The method according to claim 1, wherein the establishing a 3D coordinate system from the scene image and determining a horizon plane of the scene and measuring camera height values and length and/or width data of the object comprises:

5. The method according to claim 1, wherein correcting the scene and its corresponding 3D coordinate system based on the height value of the camera and at least two length or width data of the object, and the horizon plane, comprises:

determining a camera field angle according to the height value of the camera;

6. The method of claim 1, wherein the determining the location of the current speaker in the video image based on the 3D coordinate system and from the video image and the speech signal comprises:

7. The method as recited in claim 1, further comprising:

8. A video conferencing device, comprising:

9. An electronic device, comprising:

And one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that it stores a computer program causing a processor to perform the method of any one of claims 1 to 7.