Intellectual Property Office Application No GI32300027.6 RTM Date:3 July 2023 The following terms are registered trade marks and should be read as such wherever they occur in this document: Nokia HAIP (p.35) Apple AirTag (p.35) Bluetooth (p.35) Synopsys Inc (p.41) Cadence Design (p.41) Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo
RECALIBRATION SIGNALING
Field
The present application relates to apparatus and methods for recalibration signalling but not exclusively for recalibration signalling within an audio capturing system or Immersive Voice and Audio Services (IVAS) environment.
Background
An immersive phone call captures and plays spatial audio. A capturing system configured to capture spatial audio requires a microphone array. Spatial audio capturing algorithms determine parameters based on measurements from the captured audio signals from the microphones within the microphone array. For example, from the captured audio signals the algorithms can measure time differences for an arrival of a sound source at the microphones. The measured time difference of arrival can be used to calculate the direction of arrival for the sound source if the microphone locations in the array are known.
Should a microphone array change in some way, the capturing system is recalibrated so that the new properties of the array are obtained and the capturing algorithms adapted based on these properties. Finding the new properties (e.g., microphone locations) is called geometry calibration. There are methods to do this, but calibration does not happen instantly. After geometry calibration the capturing algorithms have to be updated to use the new data describing the changed microphone array.
Furthermore immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
Metadata-assisted spatial audio (MASA) is one input format defined for IVAS. It uses audio signal(s) together with corresponding spatial metadata. The spatial metadata comprises parameters which define the spatial aspects of the audio signals and which may contain for example, directions and direct-to-total energy ratios in frequency bands. The MASA stream can be generated as part of multi-microphone audio capture analysis and processing, or it can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (for example, a 5.1 audio channel mix) or other content by means of a suitable format conversion.
Summary
According to a first aspect there is provided an apparatus for capturing spatial audio signals, the apparatus comprising means for: determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
The means for generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format may be further for: generating at least one transport audio signal based on the capture audio signals and the spatial capture format; determining at least one metadata associated with the at least one transport audio signal, the at least one metadata based on an analysis of the capture audio signals.
The means for capturing, and controlling an outputting a further spatial audio signal based on the further spatial capture format may be further for: generating at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and determining at least one further metadata associated with the at least one further transport audio signal, the at least one further metadata based on an analysis of the capture audio signals following the change in the at least two audio microphones.
The means for determining a microphone change mitigation strategy may be further for identifying at least one alternate transport audio signal format.
The at least one transport audio signal may be defined by a first transport audio signal format and the means for identifying at least one alternate transport audio signal format may be for associating the at least one alternate transport audio signal format with the first transport audio signal format.
The means for generating and controlling the outputting of the microphone change signal and the mitigation audio signal may be for selecting the alternate transport audio signal to generate the mitigation audio signal based on the first transport audio signal format based on the microphone change mitigation strategy.
The means for may be further for receiving at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy, and the means for generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the spatial capture format may be further for generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the received at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy.
The means may be further for receiving an acknowledgment of the output of the spatial capture format and microphone change mitigation strategy, and the means for generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the spatial capture format may be further for generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the received acknowledgment of the output of the spatial capture format and microphone change mitigation strategy.
According to a second aspect there is provided an apparatus for outputting spatial audio signals, the apparatus comprising means for: receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format. The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; 30 and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
The means for receiving the spatial audio signal from the further apparatus may be further for: receiving at least one transport audio signal; and receiving at least one metadata associated with the at least one transport audio signal.
The means for receiving the further spatial audio signal based on the further spatial capture format may be further for: receiving at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and receiving at least one further metadata associated with the at least one further transport audio signal.
The means for receiving a microphone change mitigation strategy may be further for receiving information identifying at least one alternate transport audio signal format.
The means may be further for generating and outputting to the further apparatus at least one supported capture type and mitigation strategy based on the received spatial capture format and microphone change mitigation strategy.
The means may be further for generating and outputting an acknowledgment of the receipt of the spatial capture format and microphone change mitigation strategy.
According to a third aspect there is provided a method for capturing spatial audio signals, the method comprising: determining a spatial capture format, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; 5 and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
Generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format further comprises: generating at least one transport audio signal based on the capture audio signals and the spatial capture format; determining at least one metadata associated with the at least one transport audio signal, the at least one metadata based on an analysis of the capture audio signals.
Capturing, and controlling the outputting of the further spatial audio signal based on the further spatial capture format may further comprises: generating at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and determining at least one further metadata associated with the at least one further transport audio signal, the at least one further metadata based on an analysis of the capture audio signals following the change in the at least two audio microphones.
Determining a microphone change mitigation strategy may further comprise identifying at least one alternate transport audio signal format.
The at least one transport audio signal may be defined by a first transport audio signal format and identifying at least one alternate transport audio signal format may comprise associating the at least one alternate transport audio signal format with the first transport audio signal format.
Generating and controlling the outputting of the microphone change signal and the mitigation audio signal may comprise selecting the alternate transport audio signal to generate the mitigation audio signal based on the first transport audio signal format based on the microphone change mitigation strategy.
The method may further comprise receiving at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy, and generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the spatial capture format may further comprise generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the received at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy.
The method may further comprise receiving an acknowledgment of the output of the spatial capture format and microphone change mitigation strategy, and generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the spatial capture format may further comprise generating, and controlling the outputting of a spatial audio signal from the capture audio signals based on the received acknowledgment of the output of the spatial capture format and microphone change mitigation strategy.
According to a fourth aspect there is provided a method for outputting spatial audio signals, the method comprising: receiving a spatial capture format from an apparatus, the spatial capture format being based on the apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; receiving a further spatial audio signal and a further spatial capture format from the apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
Receiving the spatial audio signal from the apparatus may further comprise: receiving at least one transport audio signal; and receiving at least one metadata associated with the at least one transport audio signal.
Receiving the further spatial audio signal based on the further spatial capture format may further comprise: receiving at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and receiving at least one further metadata associated with the at least one further transport audio signal.
Receiving a microphone change mitigation strategy may further comprise receiving information identifying at least one alternate transport audio signal format.
The method may further comprise generating and outputting to the apparatus at least one supported capture type and mitigation strategy based on the received spatial capture format and microphone change mitigation strategy.
The method may further comprise generating and outputting an acknowledgment of the receipt of the spatial capture format and microphone change mitigation strategy.
According to a fifth aspect there is provided an apparatus for capturing spatial audio signals, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
The apparatus caused to perform generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the spatial capture format may be further caused to perform: generating at least one transport audio signal based on the capture audio signals and the spatial capture format; determining at least one metadata associated with the at least one transport audio signal, the at least one metadata based on an analysis of the capture audio signals.
The apparatus caused to perform generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format may be further caused to perform: generating at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and determining at least one further metadata associated with the at least one further transport audio signal, the at least one further metadata based on an analysis of the capture audio signals following the change in the at least two audio microphones.
The apparatus caused to perform determining the microphone change mitigation strategy may be further caused to perform identifying at least one alternate transport audio signal format.
The at least one transport audio signal may be defined by a first transport audio signal format and the apparatus caused to perform identifying at least one alternate transport audio signal format may be caused to perform associating the at least one alternate transport audio signal format with the first transport audio signal format.
The apparatus caused to perform generating and controlling the outputting of the microphone change signal and the mitigation audio signal may be caused to perform selecting the alternate transport audio signal to generate the mitigation audio signal based on the first transport audio signal format based on the microphone change mitigation strategy.
The apparatus may be further caused to perform receiving at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy, and the apparatus caused to perform generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the spatial capture format may be further caused to perform generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the received at least one supported capture type and mitigation strategy based on the output spatial capture format and microphone change mitigation strategy.
The apparatus may be further caused to perform receiving an acknowledgment of the output of the spatial capture format and microphone change mitigation strategy, and the means for generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the spatial capture format may be further for generating, and controlling the outputting of the spatial audio signal from the capture audio signals based on the received acknowledgment of the output of the spatial capture format and microphone change mitigation strategy.
According to a sixth aspect there is provided an apparatus for outputting spatial audio signals, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
The spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals; and a geometry of microphones of the at least two microphones configured to capture audio signals.
The further spatial capture format may comprise at least one of: a number of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones; and a geometry of microphones of the at least two microphones configured to capture audio signals following the change in the at least two audio microphones.
The apparatus caused to perform receiving the spatial audio signal from the 25 further apparatus may be further caused to perform: receiving at least one transport audio signal; and receiving at least one metadata associated with the at least one transport audio signal.
The apparatus caused to perform receiving the further spatial audio signal based on the further spatial capture format may be further caused to perform: receiving at least one further transport audio signal based on the capture audio signals and the further spatial capture format; and receiving at least one further metadata associated with the at least one further transport audio signal.
The apparatus caused to perform receiving the microphone change mitigation strategy may be further caused to perform receiving information identifying at least one alternate transport audio signal format.
The apparatus may be further caused to perform generating and outputting to the further apparatus at least one supported capture type and mitigation strategy based on the received spatial capture format and microphone change mitigation strategy.
The apparatus may be further caused to perform generating and outputting an acknowledgment of the receipt of the spatial capture format and microphone change mitigation strategy.
According to a seventh aspect there is provided an apparatus for capturing spatial audio signals, the apparatus comprising: determining circuitry configured to determine a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining circuitry configured to determine a microphone change mitigation strategy; controlling circuitry configured to control an outputting of the spatial capture format and microphone change mitigation strategy; generating circuitry, and controlling circuitry configured to generate and control an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining circuitry configured to determine a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining circuitry configured to determine a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling circuitry configured to generate and control an outputting a further spatial audio signal based on the further spatial capture format.
According to an eighth aspect there is provided an apparatus for outputting spatial audio signals, the apparatus comprising: receiving circuitry configured to receive a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving circuitry configured to receive a microphone change mitigation strategy; receiving circuitry configured to receive a spatial audio signal from the further apparatus; generating circuitry configured to generate an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving circuitry configured to receive a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving circuitry configured to receiving a mitigation audio signal from the further apparatus; generating circuitry configured to generate the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving circuitry configured to receive a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating circuitry configured to generate the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for capturing spatial audio signals, the apparatus caused to perform at least the following: determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for outputting spatial audio signals, the apparatus caused to perform at least the following: receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for capturing spatial audio signals, to perform at least the following: determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for outputting spatial audio signals, to perform at least the following: receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
According to a thirteenth aspect there is provided an apparatus for capturing spatial audio signals, the apparatus comprising: means for determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; means for determining a microphone change mitigation strategy; means for controlling an outputting of the spatial capture format and microphone change mitigation strategy; means for generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; means for determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; means for determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and means for generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
According to a fourteenth aspect there is provided an apparatus for outputting spatial audio signals, the apparatus comprising: means for receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; means for receiving a microphone change mitigation strategy; means for receiving a spatial audio signal from the further apparatus; means for generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; means for receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and means for generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
According to a fifteenth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for capturing spatial audio signals, to perform at least the following: determining a spatial capture format for the apparatus, the spatial capture format being based on at least two microphones configured to capture audio signals; determining a microphone change mitigation strategy; controlling an outputting of the spatial capture format and microphone change mitigation strategy; generating, and controlling an outputting of a spatial audio signal from the capture audio signals based on the spatial capture format; determining a change in the at least two audio microphones and based on the determined change generating and controlling an outputting of a microphone change signal and a mitigation audio signal, from the capture audio signals based on the microphone change mitigation strategy; determining a further spatial capture format for the apparatus, the further spatial capture format being based on the change in the at least two audio microphones; and generating, and controlling an outputting a further spatial audio signal based on the further spatial capture format.
According to a sixteenth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for outputting spatial audio signals, to perform at least the following: receiving a spatial capture format from a further apparatus, the spatial capture format being based on the further apparatus at least two microphones configured to capture audio signals; receiving a microphone change mitigation strategy; receiving a spatial audio signal from the further apparatus; generating an output spatial audio signal from the received spatial audio signal based on the spatial capture format; receiving a microphone change signal configured to indicate a change in the at least two audio microphones and based on the microphone change signal: receiving a mitigation audio signal from the further apparatus; generating the output spatial audio signal from the mitigation audio signal based on the microphone change mitigation strategy; and receiving a further spatial audio signal and a further spatial capture format from the further apparatus, the further spatial audio signal based on a further spatial capture format; and generating the output spatial audio signal from the received further spatial audio signal and based on the further spatial capture format.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments; Figure 2 shows schematically a capture apparatus as shown in the system 30 of apparatus as shown in Figure 1 according to some embodiments; Figure 3 shows schematically a playback apparatus as shown in the system of apparatus as shown in Figure 1 according to some embodiments; Figures 4a and 4b show a flow diagram of the operation of the example capture apparatus as shown in Figure 2 according to some embodiments; Figure 5 show a flow diagram of the operation of the example playback apparatus as shown in Figure 3 according to some embodiments; Figure 6 shows a signalling flow diagram of the operation of the example capture and playback apparatus during an initialization operation according to some embodiments; Figure 7 shows a signalling flow diagram of the operation of the example capture and playback apparatus during a call operation according to some 10 embodiments; Figures 8 and 9 show example apparatus configurations suitable for being employed as capture apparatus; and Figure 10 shows a schematic view of an example device suitable for implementing the apparatus shown in previous figures.
Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for spatial audio capture and playback. These are applicable to 20 immersive communication technologies such as 3GPP Immersive Voice and Audio Services (IVAS).
As indicated above spatial audio capture apparatus changes can require microphone array geometry calibration before they are able to correctly determine spatial audio signals and associated spatial metadata (or parameters).
A non-exclusive list of cases which can require calibration or re-calibration are: Adding of microphones, e.g., an extra peripheral is added to be used as distributed microphone array with built-in device microphones. This can happen when user is already in a call with their mobile device and enters a meeting room that has tabletop teleconferencing microphone devices that the phone can connect to. The device microphones and tabletop microphones will form a distributed microphone array, which can be configured to employ geometry calibration to find out the microphone locations and then the capture algorithms can adapt to the new configuration; Change of microphone array shape, e.g., in a foldable device when the device shape changes. When the device folding changes slightly and microphones are located in separate folding halves of the device the pairwise distances of the microphones changes and microphone geometry array calibration can be performed and the capturing algorithm must be updated to use the new data; Change of orientation and distance of a distributed microphone array, e.g., a teleconference microphone array is moved on a conference room table. This 10 could happen when user moves a tabletop microphone closer to a talking person during the call. If the moved microphone is pad of a distributed microphone array, the pairwise distances of the microphones change, which requires microphone array geometry calibration and capture algorithm parameter updates; Initialization of a microphone array, e.g., it is used for the first time. This can happen if a new peripheral is attached or if an application is downloaded from app store to a previously unknown device. The device can be configured to perform a full microphone array geometry calibration to get microphone locations that can be used by the spatial audio capturing algorithm.
Before the first and during subsequent microphone array geometry calibrations any capture apparatus (and the capturing algorithm) does not accurately know the microphone locations and pairwise microphone distances. As these locations and distances are used with the timing differences the parameters such as direction of arrival, when the microphone location information is incorrect, the capture apparatus cannot analyse spatial sound information and cannot produce spatial audio.
The microphone array geometry calibration operation may be fast or slow depending on the details. The range could be, e.g., in seconds, tens of seconds or some minutes.
As indicated above an application of spatial audio is in immersive voice call which in the following is here defined to be a scalable voice call that can use, e.g., types of voice for communication such as: mono audio, stereo audio, spatial audio and object audio (which can be mono, stereo, or spatial audio). Spatial audio furthermore with respect to the following disclosure can non-exclusively include formats such as: stereo + metadata (e.g., MASA); multichannel speaker signals (e.g., 5.1); or ambisonics. It is also possible to send object audio and background audio separately where the object audio could be of one format (e.g., mono) and background audio could be some spatial audio format. An immersive voice call can therefore be implemented between at least two client devices or apparatus. In the following the terms device and apparatus are interchangeable. A first of the at least two client devices or apparatus is arranged as a capture apparatus configured to capture audio signals from the audio environment within it is located. These audio signals can be encoded and sent/stored. The (encoded) audio signals can be received by at least one further of the at least two client devices or apparatus arranged as playback apparatus configured to decode and generate audio signals for suitable transducers. It would be understood that for two-directional or two-way or multi-directional communication that the first client device comprises suitable playback apparatus and the further client device comprises suitable capture apparatus.
In some situations when an immersive call client sends spatial audio to another client, the receiving client is configured to handle the rendering of the audio. In other words, a suitable audio signal output is generated and passed to suitable audio transducers to produce the output for the listener. The rendering may be implemented on headphones or device speakers. The rendering is dependent on the receiving a client device properties and therefore the sending client (capture apparatus) cannot make decisions related to it.
When microphone array geometry calibration happens during capturing of spatial audio, the capturing algorithm may be able to adapt to the situation. For example, if there are microphones whose geometry has not been affected, the algorithm can adapt to the unchanged subset of the array and when the geometry calibration finishes adapt to use the full array. The above could happen, e.g., in a distributed microphone array comprising mobile device with multiple microphones and tabletop teleconferencing device also with multiple microphones. When the tabletop unit is moved the geometry of the distributed array changes and the distance and orientation between the devices has to be determined by performing geometry calibration. During the calibration the capture apparatus can be configured to adapt by using only the mobile device microphone array or only the tabletop microphone array which are both parts of the distributed microphone array that have remained unchanged. However by using a limited microphone selection the quality of the audio signals captured for the call is reduced.
Furthermore when the capture apparatus continues to employ the 5 parameters that match the geometry before the change, a best case result is one where spatial features are lost from the audio and a worst case result is one where the audio becomes unintelligible.
Although it may be possible to configure the client devices such that when a change is detected they can start to negotiate a new connection, i.e., decide to use mono or stereo audio instead of spatial audio. The negotiation will take an amount of time during which the audio quality is degraded for the aforementioned reasons. Also when the capture device is configured with the new parameters and the playback device (client) resumes decoding audio signals captured using the new parameters there will be some disruption to the generated audio signals. This 15 happens because the client devices (the capture and playback apparatus) have to allocate and initialize new algorithms for the capturing and decoding respectively. Furthermore the capture apparatus may also be configured to initialize the microphone capture On the capturing software), which will create further disruption because first recording (using the old microphone configuration) is stopped and then a new recording with another microphone configuration is started.
Disruptions in spatial audio capture due to changes in microphone geometry array and the resulting calibration are temporary by nature. When the microphone array geometry calibration is performed and the parameters for the capturing algorithm are updated, the capture apparatus can be configured to resume capturing spatial audio. If a disruption is caused in the immersive voice call every time that microphone array geometry is affected and every time when the calibration finishes, the user experience of the call will be bad.
In some embodiments an immersive voice calling application is provided with the ability to generate and signal a mitigation strategy at the beginning of a call in order to prevent disruptions during the call due to temporary unavailability of spatial audio capturing caused by changes in microphone array geometry.
In some embodiments the following operations or steps are implemented: -Start an immersive voice call - Client A o Publishes its spatial audio capturing format o Publishes its mitigation strategy for temporary outages in spatial audio capture due to changes in microphone array geometry -Client B o Sets up to receive and decode/render audio in format published by client A o Sets up mitigation strategy to client A's audio stream - During the call there is change to microphone array geometry of device A o Device A signals that spatial audio is now unavailable and starts to send audio stream using mitigation strategy o Device B has already initialized algorithms for mitigation strategy and can start to decode/render seamlessly the changed audio stream - After a short time Device A has implemented microphone array geometry calibration and is ready to resume sending spatial audio o Device A signals that spatial audio is available again and starts to send audio without mitigation strategy o Device B receives information about change and can switch seamlessly back to decoding/rendering spatial audio In some embodiments the Client A device can be, e.g., a distributed telco microphone array with a main tabletop unit with two microphones and a satellite microphone unit with one microphone. In this example both units are on a table and the separate microphone is sometimes moved in the call to be closer to a person speaking.
Furthermore in some embodiments the mitigation is, e.g., that the device captures spatial audio in first order ambisonics format (four channels of audio). When the satellite microphone is moved and system cannot produce spatial audio due to ongoing microphone array geometry calibration, the mitigation strategy is to send the unprocessed stereo signal from tabletop in channels 0 and 1 of the ambisonics stream.
With such a mitigation strategy the receiving unit can prepare for the change and handle switching between the two formats seamlessly.
An immersive call system employing the embodiments as described in further detail herein is thus aimed at following: - Reducing or removing the temporary unavailability of spatial audio due to microphone array geometry configuration because a mitigation strategy has been prepared already during the initialization of the call; - Maintaining voice intelligibility; - Reducing or removing interruptions in the audio caused by microphone array geometry configuration due to mitigation strategy.
With respect to Figure 1 is shown a schematic view of an example system 10 of apparatus within which some embodiments can be implemented.
In the example shown in Figure 1 there is shown a first client device 109 communicating to a second client device 113 via a suitable communication interface 111. This is a simplified example of a full communications system which is aimed at explaining the embodiments described herein. In this example the first client device 109 is operating as a capture device (or apparatus) and the second client device 113 is operating as a playback or rendering device (or apparatus). As described herein the first client device 109 can communicate with more than one 'second' client device (a one-to-many call) but for clarity the following example shows a one-to-one call being made.
In the example shown in Figure 1 the first client device 109 (or capture apparatus) is shown receiving audio signals from a microphone array represented by the microphones 101, 103, 105 and a further microphone 107. The combination of the microphones can in themselves to be considered to form one or more array of microphones by switching microphones on or off or by grouping the output of the microphones in a suitable manner. The microphones can be any suitable type. In some embodiments at least one of the microphones is located on the first client device On other words in some embodiments the first client device comprises at least one microphone). Furthermore in some embodiments at least one of the microphones is not physically located on or connected to the first client device 109.
In such embodiments there may be a wired or wireless connection between the microphone and the first client device 109 to enable the audio signals to be received by the first client device 109.
In some embodiments at least one of the microphones can be moved relative to the others and thus alter the effective geometry of the microphone array. Similarly in some embodiments at least one of the microphones can be selected/deselected and therefore alter the effective geometry of the remaining microphones forming the microphone array.
Furthermore in the example shown in Figure 1 the second client device 113 (or playback/rendering apparatus) is shown outputting audio signals to a suitable output device or apparatus, which in this example is a headset 117. The connection 115 between the second client device 113 and the output device can be any suitable one and be unidirectional or bi-directional (for example in some embodiments the headset 117 can comprise at least one microphone for audio capture or head orientation/position sensors configured to enable the second client device to generate an output audio signal based on the head orientation/position sensor output. The output device can be any suitable output transducer, for example headphones, earphones, speakers etc. With respect to Figure 2 is show a schematic view of an example first client device 109 operating as a capture apparatus.
The first client device 109 in the example shown in Figure 2 is shown comprising a geometry change determiner/controller 207. The geometry change determiner/controller 207 in some embodiments is configured to control the initialization and operation of the immersive call (or more generally the supply of real-time or near-real-time spatial audio signals) and the operation of which is described in further detail herein.
Furthermore the first client device 109 is shown comprising a geometry calibrator 205 (or more generally microphone configuration determiner). The geometry calibrator 205 is configured to determine the configuration of the microphones currently supplying audio signals to the first client device 109. The determination of microphone configuration parameters (such as relative orientation and or distance between microphone pairs and microphone locations) are known in the art and not described herein in further detail. The output of the geometry calibrator 205 can be passed to a metadata generator 203 for assisting in generating spatial audio parameters for a spatial audio signal. Furthermore the output of the geometry calibrator 205 can, in some embodiments, be passed to the multiplexer/encoder 209 to be encoded and passed to the second client device. The first client device 109 in some embodiments comprises a transport signal generator 201. The transport signal generator 201 is configured to receive the audio signals from the microphones 101, 103, 105 and 107 and generate an audio signal which can be passed to the multiplexer/encoder 209 to be encoded and output to the second client device 113. The transport signal generator 201 can be configured to generate transport audio signals which may be multi-channel, stereo, binaural or mono audio signals. The generation of transport audio signals can be implemented using any suitable method, for example selecting a left-right microphone pair, and applying suitable processing to the signal pair, such as automatic gain control, microphone noise removal, wind noise removal, and equalization. The operation of the transport signal generator 201 can in some embodiments be controlled by the geometry change determiner/controller 207 and is described in further detail later.
The first client device 109 in some embodiments comprises a metadata generator 203. The (spatial) metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total ratio, spread coherence, distance, etc.) per time-frequency tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene within which the microphones are located. For example a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined.
As described above, parametric spatial metadata representation can use multiple concurrent spatial directions. With MASA in IVAS, the defined maximum number of concurrent directions is two. For each concurrent direction, there may be associated parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance. In some embodiments other parameters such as Diffuseto-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined.
The metadata generator 203 can be configured to perform spatial analysis on the input audio signals yielding suitable spatial metadata in frequency bands.
For the aforementioned input types, there exists known methods to generate suitable spatial metadata, for example directions and direct-to-total energy ratios (or similar parameters such as diffuseness, i.e., ambient-to-total ratios) in frequency bands. These methods are not detailed herein, however, some examples may comprise the performing of a suitable time-frequency transform for the input signals, and then in frequency bands estimating delay-values between microphone pairs that maximize the inter-microphone correlation, and formulating the corresponding direction value to that delay (as described in GB Patent Application Number 1619573.7 and PCT Patent Application Number PCT/F12017/050778), and formulating a ratio parameter based on the correlation value. The direct-to-total energy ratio parameter for multi-channel captured microphone array signals can be estimated based on the normalized cross-correlation parameter cor'(k, n) between a microphone pair at band k, the value of the cross-correlation parameter lies between -1 and 1. A direct-to-total energy ratio parameter r(k, n) can be determined by comparing the normalized cross-correlation parameter to a diffuse field cor'(k,n)-corfp(k,n) normalized cross correlation parameter corb (k, n) as r(k, n) = 1-corb(k,n) The direct-to-total energy ratio is explained further in PCT publication W02017/005978 which is incorporated herein by reference.
The metadata can be of various forms and in some embodiments comprise spatial metadata and other metadata. A typical parameterization for the spatial metadata is one direction parameter in each frequency band characterized as an azimuth value cp (k, n) value and elevation value 0 (k, n) and an associated directto-total energy ratio in each frequency band r(k,n), where k is the frequency band index and n is the temporal frame index.
In some embodiments the parameters generated may differ from frequency band to frequency band. Thus, for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.
In some embodiments when the audio input is a FOA signal or B-format microphone the metadata generator 203 can be configured to determine parameters such as an intensity vector, based on which the direction parameter is obtained, and to compare the intensity vector length to the overall sound field energy estimate to determine the ratio parameter. This method is known in the literature as Directional Audio Coding (DirAC).
As such the output of the metadata generator 203 is (spatial) metadata determined in frequency bands. The (spatial) metadata may involve directions and energy ratios in frequency bands but may also have any of the metadata types listed previously. The (spatial) metadata can vary over time and over frequency.
The first client device furthermore comprises a multiplexer/encoder 209 the multiplexer/encoder can be configured to receive the (spatial) metadata. Transport audio signals, information from the geometry calibrator 205 and information from the geometry change determiner/controller 207 and generate a suitable call (or more generally spatial audio) bitstream to be output on the communication interface 111. The multiplexer/encoder 209, for example, could comprise an IVAS encoder, or any other suitable encoder. The multiplexer/encoder 209, in such embodiments is configured to encode the audio signals and the metadata and information and form an IVAS bit stream. The multiplexer/encoder 209 can encode the audio signal using any suitable audio signal encoder. For example an Enhanced Voice Services (EVS) or Immersive Voice and Audio Services (IVAS) stereo core encoder implementation can be applied to the transport (audio) signals to generate suitable encoded transport audio signals. In some embodiments the multiplexer/encoder 209 is configured to quantize the determined direction parameters (azimuth and elevation or other co-ordinate systems) and indices identifying the quantized value encoded, for example using entropy encoding.
With respect to Figure 3 is shown a schematic view of an example second client device 113 operating as a playback/rendering apparatus.
The second client device 113 in the example shown in Figure 3 is shown comprising demultiplexer/decoder 301. The demultiplexer/decoder 301 is configured to receive, retrieve, or otherwise obtain the bitstream and from the bitstream generate suitable demultiplexed and decoded streams. For example decoded information from the geometry calibrator 205 and geometry change determiner/controller 207 can be passed to a mitigation controller 303, decoded audio signals to the transport signal regenerator 305 and spatial metadata to the metadata regenerator 307.
The second client device 113 can comprise a mitigation controller 303 configured to receive from the demultiplexer/decoder 301 any information from the first client device 109 with respect to the capture apparatus microphone configuration and the determination of any configuration changes. The mitigation controller 303 is configured based on this information to control the operation of the transport signal regenerator 305 and the metadata regenerator 307 and furthermore the renderer/output audio generator 309. The operation of the mitigation controller 303 is described in further detail later however the mitigation controller 303 is configured to detect when the first client device 109 is undergoing a capture reconfiguration and configuration determination and furthermore what action to perform when this is signalled.
The second client device 113 further comprises a transport signal regenerator 305 configured to obtain the decoded audio signal from the demultiplexer/decoder 301 and based on control from the mitigation controller pass suitable audio signals from which the output audio generator 309 is able to generate audio signals.
Furthermore the second client device 113 further comprises a metadata regenerator 307 configured to obtain the decoded spatial metadata from the demultiplexer/decoder 301 and based on control from the mitigation controller pass suitable metadata from which the output audio generator 309 is able to generate audio signals.
Furthermore the second client device 113 can comprise a renderer/output audio generator 309 configured to obtain the regenerated metadata, audio signals and mitigation information and generate suitable output audio signals which can be output over communications link 115 to the headset 117 or other suitable transducer means.
With respect to Figures 4a and 4b are shown flow diagrams showing an example operation of the first client device 109 shown in Figure 2. Furthermore with 5 respect to Figure 5 is shown a flow diagram showing an example operation of the second client device 113 shown in Figure 3.
The initial operation, with respect to the first client device, is that of starting the voice call as shown in Figure 4a by step 401.
The first client device, and specifically the geometry calibrator 205, is configured to generate and publish a determined spatial capture format, which can be in the form of an array geometry calibration, as shown in Figure 4a by step 403. The first client device, and specifically the geometry change determiner/controller 207 can then be configured to generate and publish a mitigation strategy (information) for temporary outages in spatial audio capture due 15 to changes in microphone array geometry as shown in Figure 4a by step 405. Additionally the first client device, and specifically the transport signal generator 201 and multiplexer/encoder 207 is configured to generate the (encoded) transport audio signals, the metadata generator the spatial metadata (encoded) as shown in Figure 4 by step 407. These are generated based on the spatial capture format information (or array geometry calibration).
The first client device and specifically the geometry change determiner/controller can be configured to check or determine any change of microphone array geometry as shown in Figure 4b by step 409 Where there is no change then the check or monitoring can be maintained 25 as shown by the no change arrow.
Where there is change then the first client device, and specifically the geometry change determiner/controller 207 is configured to generate and publish information indicating that spatial audio is unavailable as shown in Figure 4b by step 411.
The transport signal generator and multiplexer/encoder can then be configured to generate and publish an audio stream according to a mitigation strategy determined and signalled to the second client device earlier as shown in Figure 4b by step 413.
Then the first client device and the geometry calibrator 205 is configured to determine and publish the new changed spatial capture format as shown in Figure 4b by step 415. In other words recalibrate the array geometry and inform the second client device of this change -as well as configuring the first client device transport signal generator and metadata generator of this geometry change.
The new spatial capture format information can then cause the generation and publishing of the transport audio signals and spatial metadata based on the new spatial capture format information as shown by the arrow back to step 407.
The initial operation, with respect to the second client device, is that of starting the voice call as shown in Figure 5 by step 501.
The second client device is configured to receive a determined spatial capture format, which can be in the form of an array geometry calibration, as shown in Figure 5 by step 503. The spatial capture format can be employed by the metadata regenerator, the transport signal generator and the mitigation controller.
The second client device, and specifically the mitigation controller 303 can then be configured to receive the mitigation strategy (information) for temporary outages in spatial audio capture due to changes in microphone array geometry as shown in Figure 5 by step 505.
Additionally the second client device, and specifically the transport signal regenerator 305 is configured to receive and regenerate the transport audio signals, the metadata regenerator 307 is configured to receive and regenerate the spatial metadata. These regenerated transport audio signals and metadata can be employed by the renderer/output audio generator 309 to generate a suitable output audio signal. Thus the operation of receiving/obtaining/decoding and rendering the spatial metadata is shown in Figure 5 by step 507.
The second client device, and specifically the mitigation controller can be configured to monitor for or check whether the spatial audio is unavailable as shown by Figure 5 by step 509. This can for example be implemented by monitoring or detecting an indicator or signal from the first client device (such as described in Figure 4b by step 411).
Where spatial audio is still available then the monitoring operation is looped back on itself as shown by the cok' arrow.
When spatial audio is not available then the second client device is configured to receive/obtain/decode and render/output an output audio signals audio stream based on the mitigation strategy controlled by the mitigation controller 303. The rendering/outputting output audio based on the mitigation strategy is shown by Figure 5 by step 511.
Then the second client device is configured to receive/obtain a new spatial capture format with recalibration array geometry information as shown in Figure 5 by step 513.
Having received the new spatial capture format then this information can be used as the basis for the new spatial audio signal processing. For example the transport signal regenerator 305 is configured to receive and regenerate the transport audio signals based on the new spatial capture format and the metadata regenerator 307 is configured to receive and regenerate the spatial metadata also based on the new spatial capture format. The generation of spatial audio output based on the new spatial capture format information (or array geometry calibration) is shown by Figure 5 by the arrow pointing back to step 507.
With respect to Figure 6 is shown a signalling flow diagram of the interactions between the first client device 600 and second client device 602 during an initialization phase of operation of the immersive call.
In this example the first client device 600 is configured to publish the capture types (also known as spatial capture formats or capture array geometries) that it supports as shown by 601. In some embodiments these are encoder input format dependent encoding types that are transmitted over to the second client device 602 during the call. In some embodiments the capture types details are not relevant to the decoding second client device 602, and the second client device is configured to ignore this information.
Examples of encoder input formats / encoding types are binaural, surround loudspeakers (5.1, 7.1, and so on), parametric surround (stereo + meta, Metadataassisted spatial audio (MASA)), or am bisonics of some degree (e.g., FOA). In some embodiments the information also defines any audio coding technology related to the encoding, such as AAC. Although this can be used in decoding the received data stream at the second client device this information is not directly relevant with respect to the handling of the temporary disruptions as described herein.
Second, the second client device 602 is configured to respond by 603 with the set of supported capture types (e.g., encoder input formats) and any mitigation strategies related to those capture types. In some embodiments the mitigation strategy may be specific to a capture type (e.g., reinterpret subset of channels of ambisonics stream as stereo) or independent of capture type (e.g., switch to stereo stream). In some embodiments, regardless of whether the mitigation strategies are specific to a capture type, the mitigation strategy can be declared in relation to a capture type because there may be other limitations to what mitigation strategies a client can use with a certain encoding type. These limitations for example could be related to processing power or available memory.
When the first client device 600 receives the second client device's supported encoding types and mitigation strategies, the first client device is configured to select one of these (a 'best' combination to use) as shown by 605. The selection of the (best) strategy can be determined based on any suitable selection. For example the selections can be chosen by employing a pre-defined set of rules, a scoring system, or some other heuristic. Then the first client device sends information identifying the selections to the second client device by 607. When the second client device 602 receives the information (which could be called initialization information) from the first client device 600 then the second client device 602 is configured to initialize the decoding (the demultiplexidecoder, the metadata regenerator and the transport signal generator and the renderer) and mitigation implementation (the mitigation controller) to be ready to use as shown by 609.
In some embodiments the mitigation system is initialized by default in the beginning and before it is triggered or needed. This strategy is to enable starting the use of mitigation strategy without any further delay if mitigation is needed during the call.
However in some embodiments the second client device can be configured (for example when it can guarantee that mitigation can be initialized quickly enough to not cause any audio artifacts) to initialize the system for mitigation when it is needed or any time until that time.
When initialization is completed, the second client device 602 can be configured to notifiy the first client device that it is able to start streaming audio for the call as shown by 611.
Then in some embodiments the first client device 600 is configured to signal to the second client device 602 that the streaming is started as shown by 613. In some embodiments the first client device 600 signals a start of streaming by simple streaming the data stream comprising the spatial audio signals.
As described previously, for an immersive audio call where the communication is two way, i.e., both client devices are both capturing and sending audio as well as decoding and playing audio, this process has to be done twice and Figure 6 only shows the part where the first client device sets up their recording and encoding, while the second client device sets up their decoding and playback. Figure 7 shows a signalling flow diagram of a communication sequence between the clients when mitigation processing is initialized or started.
In this example during a call, the first client device 700 is configured to operate as the capture device or apparatus. Thus as shown by 701 the first client device is configured to implement capture of the audio signals and encode the audio and other data (into frames as defined by IVAS) using the primary encoding format.
The encoded data (frames) is sent to the second client device as shown by 703.
The second client device 702 is configured to receive and decode the data (frames) based on the initialized primary encoding format as shown by 705.
The system (the first client device 700) determines or detects a change in the capturing microphone array that prevents continuing capture with the primary encoding format, for example a change in the microphone array geometry as shown by 707 the mitigation process is started.
The mitigation strategy as shown in Figure 6 was already initialized at the beginning of the call.
Thus as shown by 709 the first client device 700 is configured to continue capturing audio but using the previously signalled mitigation capture method. For example the first client device is configured to select only the input channels that are included in the mitigation strategy and encode and send these with no spatial metadata (due to the lack of confidence in accuracy in the parameter estimation). The encoding of the mitigation selected audio signals is also configured with the solution set by the mitigation strategy. In some embodiments a separate encoder may be used or new parameters for the existing encoder are used.
Furthermore the first client device 700 is configured to send the encoded audio signals based on the mitigation strategy to the second client device 702 as shown in Figure 7 by 711. Additionally in some embodiments information is sent with the encoded audio signals that the capture and encoding is using mitigation capturing and encoding.
In such embodiments there is no further communication needed because the mitigation strategy was agreed on and initialized during call initialization.
Thus the second client device 702 as shown by 713 is configured to receive the encoded (frames) audio signals and decode the audio signals using the agreed mitigation strategy.
When it is detected that the primary (spatial audio) encoding method is available again, i.e., by receiving information that microphone array calibration is ready, a similar sequence is repeated to return to primary encoding.
In other words a first client device 700 is configured to determine the new capture (array geometry) configuration or receive the updated parameters from the microphone array calibration and initialize processing algorithms with the new data.
The first client device 700 can then start encoding with the primary capture and encoding method and send the encoded (frames) to the second client device 702 as shown by 715.
The second client device 702 can thus be configured to switch (back) to the primary encoding method and decoder and generate the output audio signals based on the primary encoded method (and in some embodiments based on the capture calibration information) seamlessly since mitigation processing is temporary by nature. The second client device is configured to anticipate a return to primary encoding and has configured the decoder with required resources to not interrupt playback. As such the second client device is configured to decode the spatial or primary encoded data as shown by 717.
With respect to Figure 8 is shown an example teleconferencing system with a distributed microphone array. There is shown a main conferencing unit 801 with two microphones 803 and 805 and a satellite microphone unit 811 comprising a satellite microphone 813. The main conferencing unit 801 is designed to be stationary during conference. The satellite microphone unit 811 has one microphone 813 and is a wireless device that can be moved during a conference call, e.g., to be moved closer to a participant that is speaking. This could be helpful if the meeting space is large or noisy. Together the main conference unit 801 and satellite microphone unit 811 form a distributed microphone array with three microphones.
When the satellite microphone unit 811 is moved in relation to the main teleconferencing unit 801, the microphone array geometry changes and mitigation processing is required while the system recalibrates. There are various ways to detect if the microphone has moved.
For example, in some embodiments, the satellite microphone unit comprises an inertial measurement unit, (for example a digital gyroscope or compass) it can sense and determine a change in orientation and to some degree can also integrate changes in acceleration to recognize a translation of the device. Any suitable inertial navigation techniques can be employed to track the location of the device based on changes in acceleration.
In some embodiments inertial measurement can be replaced with any suitable beacon orientation/location determination technologies or determination technologies that use acoustic measurement or radio frequency techniques such as known from Nokia HAIP or Apple AirTag.
In some embodiments where the movement of the microphone is in the microphone device, (for example as shown in Figure 8 where the satellite microphone unit is equipped with an inertial measurement unit -IMU) the microphone device can be configured to communicate via a suitable wireless (radio) link such as Bluetooth. In this example the satellite microphone unit is configured to send a message to the main conference unit about the movement of the satellite microphone unit.
In some other embodiments, where the movement is not determined at the microphone device (for example the satellite microphone unit does not implement orientation/motion determination as it is equipped with a location tag like Apple AirTag or RFID-technology), then the main conference unit is determined to track the movement and no messaging is needed between the devices.
With respect to Figure 9 is shown a further example capture device with a built-in microphone array that has changing geometry. In this example, a laptop 901 is equipped with two separate microphone groups, group A 907 and group B 917 that can together capture spatial audio. Group A 907 microphones can for example comprise a front microphone 905 directed inwards and back microphone 903 directed outwards. Group B 917 microphones can further comprise a left microphone 913 and a right microphone 915. The positions of these microphones are on fixed locations on the case of the laptop but when the lid of the laptop is moved, the angle of the hinge of the laptop changes and the distance between the microphones changes.
In some embodiments it is therefore possible to detect a microphone geometry change with a mechanical sensor, for example in the hinge of the laptop shown in Figure 8. In such embodiments all of the positions of the microphones can be calculated or predetermined before the call so there is no need for a special geometric calibration processing, but there would still be mitigation applied during the time when the lid moves.
As described above the mitigation strategy control employed is configured to control the capture and encoding of audio signals during the 'mitigation' period where the reconfiguration of the capture apparatus parameters is being implemented (for example the geometry of the microphone array following a change is being determined). The mitigation period processing can be any 'other' processing that is implemented using a subset of the audio signals from microphones in a microphone array during a temporary time window when the full array cannot be used.
In some embodiments the mitigation strategy is related to the format of the data that is encoded. In other words, if a capture and encoding algorithm is capable of using a subset of microphones in an array to produce the same format of output as with the full array, then no mitigation processing as defined herein is needed. In such embodiments not only is a change in the microphone geometry determined but also the effect of the microphone geometry change on the current capture and encoding method determined. For example in some embodiments if a microphone within the microphone array is moved but it is determined that the microphone is currently inactive or not contributing to the audio signal or the metadata determination as it is not part of the current 'active' subset of microphones in use then there is no mitigation period. In such embodiments as a background process a new geometry configuration determination can be implemented based on the change or a mitigation period initiated when the moved microphone is determined to be an 'active' microphone.
In some embodiments when the geometry change determiner/controller 207 determines a format changes, i.e., from spatial audio to stereo, the mitigation strategies can be applied.
In some embodiments the geometry change determiner/controller 207 can be configured to control the transport signal generator to generate the same transport format as the 'primary' or spatial encoding strategy but reinterpret the transport audio signal channels. An example of this processing is transmitting a stereo signal in two of the channels of Ambisonics output. This can be understood as, e.g., maintaining a first encoder input format but changing its content (e.g., from four First-order Ambisonics component channels to two stereo channels and two empty channels) and providing a related indicator or signal to the encoder. In other words, the transport format stays the same but the data in interpreted differently.
This can be beneficial in avoiding artifacts in the output audio related to the switching. The codec can have the mitigation built-in and it becomes easier to guarantee the quality of the output.
In the case where a codec would be deconstructed and a new codec would be constructed during start of mitigation processing the coordination of the two separate codecs is more difficult.
In some embodiments the geometry change determiner/controller 207 can be configured to implement a mitigation strategy which changes the transport stream to a completely different type. As the strategy has been signalled to the receiver before the capture and processing starts, the receiving client device's responsibility is to have prepared for the (potential) change and support the transition with acceptable quality.
In some embodiments the client device (and the geometry change determiner/controller 207 or mitigation controller 303) can furthermore be configured to apply further mitigation processing to make a transition between 'primary or spatial mode and 'mitigation' mode smoother.
In such embodiments processing is employed during the transition frames from one format to another. For example when switching from ambisonics rendering to stereo playback, it is possible to use direction cues from the last frames of the ambisonics signal and apply them to the rendering of the stereo signal and then do a smooth fade to plain stereo where there is no spatial elements. In such embodiments the transition is less pronounced than normal volume crossfading. Furthermore in some embodiments there is no possibility to implement long cross-fade since the need for switching to a mitigation strategy comes unexpectedly.
In some embodiments it is possible to delay the switch between 'primary' or spatial mode and mitigation' mode. In such embodiments when a client detects a change in the microphone array that necessitates mitigation processing, the client device is configured to send an indication or signal warning that mitigation processing will start after a period or time (or for example X number of frames where X is some small number). Thus while output audio is detected to be incorrect for the old audio type, there will be time by the receiving client device to prepare a cross fading between the primary' mode processing and 'mitigation' mode processing. Furthermore in some embodiments it is also possible to send both primary and mitigation processing data for a short overlapping period of time.
In such embodiments as defined herein an improvement in immersive voice call quality is aimed for in the case where there is a microphone array that can change geometry during call. Possible usage scenarios are: Telecommunications system with separate microphones that can be moved (e.g., on conference room table); Foldable mobile devices; Mobile devices with capture using device microphones and headset microphones; and Distributed telecomunication system formed from several mobile devices, e.g., several participants in a conference room and everyone places their device on the table and they form one distributed capture system.
With respect to Figure 10 an example electronic device which may be used as any of the apparatus parts of the system as described above. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. The device may for example be configured to implement the encoder/analyser part and/or the decoder part as shown in Figure 1 or any functional block as described above.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises at least one memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore, in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (loT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.
The transceiver input/output port 1409 may be configured to receive the signals.
In some embodiments the device 1400 may be employed as at least part of 25 the synthesis device. The input/output port 1409 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GOSH, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.