Movatterモバイル変換


[0]ホーム

URL:


US20210074315A1 - Augmented multi-tier classifier for multi-modal voice activity detection - Google Patents

Augmented multi-tier classifier for multi-modal voice activity detection
Download PDF

Info

Publication number
US20210074315A1
US20210074315A1US17/101,048US202017101048AUS2021074315A1US 20210074315 A1US20210074315 A1US 20210074315A1US 202017101048 AUS202017101048 AUS 202017101048AUS 2021074315 A1US2021074315 A1US 2021074315A1
Authority
US
United States
Prior art keywords
classifier
tier
features
classifiers
inputs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/101,048
Inventor
Dimitrios Dimitriadis
Eric Zavesky
Matthew Burlick
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Intellectual Property I LP
Original Assignee
AT&T Intellectual Property I LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Intellectual Property I LPfiledCriticalAT&T Intellectual Property I LP
Priority to US17/101,048priorityCriticalpatent/US20210074315A1/en
Assigned to AT&T INTELLECTUAL PROPERTY I, L.P.reassignmentAT&T INTELLECTUAL PROPERTY I, L.P.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: DIMITRIADIS, DIMITRIOS, BURLICK, MATTHEW, ZAVESKY, ERIC
Publication of US20210074315A1publicationCriticalpatent/US20210074315A1/en
Abandonedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

Voice activity in a media signal is detected in an augmented, multi-tier classifier architecture. For instance, a first voice activity indicator, detected in a first modality for a human subject, is received from a first classifier. Then, the system can receive, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different. The system then concatenates, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output, and determine voice activity based on the classifier output.

Description

Claims (20)

What is claimed is:
1. A method, comprising:
generating, by a system comprising a processor, via a group of first tier classifiers using respective feature vectors representing features from multiple modalities of data captured from inputs comprising audio inputs and visual inputs, respective first decision outputs of first tier classifiers of the group of first tier classifiers related to classifying the inputs as human voice activity; and
determining, by the system, via a second tier classifier using the respective feature vectors and the respective first decision outputs of the first tier classifiers, a second decision output related to classifying the inputs as the human voice activity.
2. The method ofclaim 1, further comprising determining, by the system, whether the inputs comprise the human voice activity based on the second decision output.
3. The method ofclaim 1, further comprising combining, by the system, via the second tier classifier, the respective feature vectors and the respective first decision outputs.
4. The method ofclaim 1, further comprising concatenating, by the system, via the second tier classifier, the respective feature vectors and the respective first decision outputs.
5. The method ofclaim 1, wherein the features comprise acoustic features comprising at least one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, or an acoustic energy.
6. The method ofclaim 1, wherein the features comprise visual features comprising at least one of a parametric feature, an appearance-based feature, a dimension of a mouth region, image region intensity, a discrete cosine transformation, an image saturation, an image brightness, an image texture, a video motion, a head movement, a movement of a specific item relative to other items, a movement of the specific item relative to a background, a micro-expression on a face, an extremity of expression, a detected lighting change, a head angle relative to a camera, or a local binary pattern.
7. The method ofclaim 1, wherein the first tier classifiers are respectively trained for different types of the features.
8. A system, comprising:
a processor; and
a memory that stores executable instructions that, when executed by the processor, facilitate performance of operations, comprising:
generating, via first tier classifiers using respective feature vectors representing features from multiple modalities of data captured from inputs comprising audio inputs and visual inputs, respective first classifier outputs of the first tier classifiers related to classifying the inputs as human voice activity; and
determining, via a second tier classifier using the respective feature vectors and the respective first classifier outputs of the first tier classifiers, a second classifier output related to classifying the inputs as the human voice activity.
9. The system ofclaim 8, wherein the operations further comprise determining whether the inputs comprise the human voice activity based on the second classifier output.
10. The system ofclaim 8, wherein the respective first classifier outputs comprise output vectors.
11. The system ofclaim 10, wherein the operations further comprise, concatenating, via the second tier classifier, the respective feature vectors and the output vectors.
12. The system ofclaim 8, wherein the features comprise acoustic features comprising at least one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, or acoustic energy.
13. The system ofclaim 8, wherein the features comprise visual features comprising at least one of a parametric feature, an appearance-based feature, dimensions of a mouth region, image region intensity, discrete cosine transformations, image saturation, image brightness, image texture, video motion, head movement, movement of a specific item relative to other items, movement of the specific item relative to a background, micro-expressions on a person's face, extremity of expression, detected lighting changes, head angle relative to a camera, or local binary patterns.
14. The system ofclaim 8, wherein the first tier classifiers are respectively trained for different types of the features.
15. A non-transitory machine-readable medium, comprising executable instructions that, when executed by a processor, facilitate performance of operations, comprising:
generating, via first tier classifiers using respective feature vectors representing features from multiple modalities of data captured from inputs comprising audio inputs and visual inputs, respective first classifier outputs of the first tier classifiers related to classifying the inputs as human voice activity; and
determining, via a second tier classifier using the respective feature vectors and the respective first classifier outputs of the first tier classifiers, a second classifier output related to classifying the inputs as the human voice activity.
16. The non-transitory machine-readable medium ofclaim 15, wherein the operations further comprise determining whether the inputs comprise the human voice activity based on the second classifier output.
17. The non-transitory machine-readable medium ofclaim 15, wherein the operations further comprise combining, via the second tier classifier, the respective feature vectors and the respective first classifier outputs.
18. The non-transitory machine-readable medium ofclaim 15, wherein the features comprise acoustic features comprising at least one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, or acoustic energy.
19. The non-transitory machine-readable medium ofclaim 15, wherein the features comprise visual features comprising at least one of a parametric feature, an appearance-based feature, dimensions of a mouth region, image region intensity, discrete cosine transformations, image saturation, image brightness, image texture, video motion, head movement, movement of a specific item relative to other items, movement of the specific item relative to a background, micro-expressions on a person's face, extremity of expression, detected lighting changes, head angle relative to a camera, or local binary patterns.
20. The non-transitory machine-readable medium ofclaim 15, wherein the first tier classifiers are respectively trained for different types of the features.
US17/101,0482013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detectionAbandonedUS20210074315A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US17/101,048US20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Applications Claiming Priority (3)

Application NumberPriority DateFiling DateTitle
US13/974,453US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection
US15/894,245US20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection
US17/101,048US20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US15/894,245ContinuationUS20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection

Publications (1)

Publication NumberPublication Date
US20210074315A1true US20210074315A1 (en)2021-03-11

Family

ID=52481151

Family Applications (3)

Application NumberTitlePriority DateFiling Date
US13/974,453Active2035-07-03US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection
US15/894,245AbandonedUS20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection
US17/101,048AbandonedUS20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Family Applications Before (2)

Application NumberTitlePriority DateFiling Date
US13/974,453Active2035-07-03US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection
US15/894,245AbandonedUS20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection

Country Status (1)

CountryLink
US (3)US9892745B2 (en)

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9240184B1 (en)*2012-11-152016-01-19Google Inc.Frame-level combination of deep neural network and gaussian mixture models
US10134423B2 (en)*2013-12-062018-11-20Tata Consultancy Services LimitedSystem and method to provide classification of noise data of human crowd
US9530400B2 (en)*2014-09-292016-12-27Nuance Communications, Inc.System and method for compressed domain language identification
US10403269B2 (en)2015-03-272019-09-03Google LlcProcessing audio waveforms
US10575103B2 (en)2015-04-102020-02-25Starkey Laboratories, Inc.Neural network-driven frequency translation
US10339921B2 (en)2015-09-242019-07-02Google LlcMultichannel raw-waveform neural networks
US10229700B2 (en)*2015-09-242019-03-12Google LlcVoice activity detection
CN105654949B (en)*2016-01-072019-05-07北京云知声信息技术有限公司A kind of voice awakening method and device
US11166027B2 (en)2016-06-102021-11-02Apple Inc.Content adaptation for streaming
US10771631B2 (en)2016-08-032020-09-08Dolby Laboratories Licensing CorporationState-based endpoint conference interaction
US10224058B2 (en)2016-09-072019-03-05Google LlcEnhanced multi-channel acoustic models
US10529320B2 (en)2016-12-212020-01-07Google LlcComplex evolution recurrent neural networks
US10140980B2 (en)2016-12-212018-11-27Google LCCComplex linear projection for acoustic modeling
EP3600130B1 (en)*2017-03-202023-07-12Align Technology, Inc.Generating a virtual depiction of an orthodontic treatment of a patient
EP3602544A4 (en)2017-03-232020-02-05Joyson Safety Systems Acquisition LLCSystem and method of correlating mouth images to input commands
US11416714B2 (en)*2017-03-242022-08-16Revealit CorporationMethod, system, and apparatus for identifying and revealing selected objects from video
CN108346428B (en)*2017-09-132020-10-02腾讯科技(深圳)有限公司Voice activity detection and model building method, device, equipment and storage medium thereof
US10403303B1 (en)*2017-11-022019-09-03Gopro, Inc.Systems and methods for identifying speech based on cepstral coefficients and support vector machines
CN108229336B (en)*2017-12-132021-06-04北京市商汤科技开发有限公司Video recognition and training method and apparatus, electronic device, program, and medium
US11481571B2 (en)2018-01-122022-10-25Microsoft Technology Licensing, LlcAutomated localized machine learning training
US11429807B2 (en)*2018-01-122022-08-30Microsoft Technology Licensing, LlcAutomated collection of machine learning training data
US12217760B2 (en)*2018-04-172025-02-04GONGIO Ltd.Metadata-based diarization of teleconferences
CN119105724A (en)*2018-05-042024-12-10谷歌有限责任公司 Hotword-free deployment of automated assistant function
US11553988B2 (en)2018-06-292023-01-17Align Technology, Inc.Photo of a patient with new simulated smile in an orthodontic treatment review software
DE102018212902B4 (en)*2018-08-022024-12-19Bayerische Motoren Werke Aktiengesellschaft Method for determining a digital assistant for performing a vehicle function from a plurality of digital assistants in a vehicle, computer-readable medium, system, and vehicle
WO2020043007A1 (en)*2018-08-272020-03-05Guangdong Oppo Mobile Telecommunications Corp., Ltd.Method, system, and computer-readable medium for purifying voice using depth information
US10846522B2 (en)*2018-10-162020-11-24Google LlcSpeaking classification using audio-visual data
US10861457B2 (en)*2018-10-262020-12-08Ford Global Technologies, LlcVehicle digital assistant authentication
US20210027772A1 (en)2019-07-242021-01-28Gong I.O Ltd.Unsupervised automated extraction of conversation structure from recorded conversations
CN110196914B (en)*2019-07-292019-12-27上海肇观电子科技有限公司Method and device for inputting face information into database
CN110648656A (en)*2019-08-282020-01-03北京达佳互联信息技术有限公司Voice endpoint detection method and device, electronic equipment and storage medium
CN112487182B (en)*2019-09-122024-04-12华为技术有限公司 Text processing model training method, text processing method and device
US11580869B2 (en)*2019-09-232023-02-14Revealit CorporationComputer-implemented interfaces for identifying and revealing selected objects from video
US11915123B2 (en)*2019-11-142024-02-27International Business Machines CorporationFusing multimodal data using recurrent neural networks
CN113139561B (en)*2020-01-172024-05-03Tcl科技集团股份有限公司Garbage classification method, garbage classification device, terminal equipment and storage medium
KR20210112726A (en)*2020-03-062021-09-15엘지전자 주식회사Providing interactive assistant for each seat in the vehicle
JP7511374B2 (en)*2020-03-312024-07-05グローリー株式会社 Speech activity detection device, voice recognition device, speech activity detection system, speech activity detection method, and speech activity detection program
CN111814609B (en)*2020-06-242023-09-29厦门大学Micro-expression recognition method based on deep forest and convolutional neural network
CN111933175B (en)*2020-08-062023-10-24北京中电慧声科技有限公司Active voice detection method and system based on noise scene recognition
CN112017633B (en)*2020-09-102024-04-26北京地平线信息技术有限公司Speech recognition method, device, storage medium and electronic equipment
KR20220059629A (en)*2020-11-032022-05-10현대자동차주식회사Vehicle and method for controlling thereof
WO2022125351A2 (en)*2020-12-092022-06-16Cerence Operating CompanyAutomotive infotainment system with spatially-cognizant applications that interact with a speech interface
US12175970B2 (en)*2020-12-242024-12-24Cerence Operating CompanySpeech dialog system for multiple passengers in a car
CN112863488A (en)*2021-01-152021-05-28广东优碧胜科技有限公司Voice signal processing method and device and electronic equipment
US11996114B2 (en)2021-05-152024-05-28Apple Inc.End-to-end time-domain multitask learning for ML-based speech enhancement
CN113255556A (en)*2021-06-072021-08-13斑马网络技术有限公司Multi-mode voice endpoint detection method and device, vehicle-mounted terminal and storage medium
CN113434731B (en)*2021-06-302024-01-19平安科技(深圳)有限公司Music video genre classification method, device, computer equipment and storage medium
CN114420155A (en)*2021-11-292022-04-29江苏科技大学Underwater sound target identification method based on multi-mode fusion
CN114049887B (en)*2021-12-062025-03-11宁波蛙声科技有限公司 Real-time voice activity detection method and system for audio and video conferencing
CN118056207A (en)*2021-12-072024-05-17渊慧科技有限公司Large-scale retrieval for sequence generation
US11646036B1 (en)*2022-01-312023-05-09Humancore LlcTeam member identification based on psychographic categories
JP7693593B2 (en)*2022-03-152025-06-17株式会社東芝 Voice activity detection device, learning device, and voice activity detection program

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030018475A1 (en)*1999-08-062003-01-23International Business Machines CorporationMethod and apparatus for audio-visual speech detection and recognition
US7472063B2 (en)*2002-12-192008-12-30Intel CorporationAudio-visual feature fusion and support vector machine useful for continuous speech recognition
US20100277579A1 (en)*2009-04-302010-11-04Samsung Electronics Co., Ltd.Apparatus and method for detecting voice based on motion information
US20120278074A1 (en)*2008-11-102012-11-01Google Inc.Multisensory speech detection

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6570610B1 (en)*1997-09-152003-05-27Alan KipustSecurity system with proximity sensing for an electronic device
US20020120643A1 (en)*2001-02-282002-08-29Ibm CorporationAudio-visual data collection system
US7321854B2 (en)*2002-09-192008-01-22The Penn State Research FoundationProsody based audio/visual co-analysis for co-verbal gesture recognition
EP1443498B1 (en)*2003-01-242008-03-19Sony Ericsson Mobile Communications ABNoise reduction and audio-visual speech activity detection
US7251603B2 (en)*2003-06-232007-07-31International Business Machines CorporationAudio-only backoff in audio-visual speech recognition system
US7269560B2 (en)*2003-06-272007-09-11Microsoft CorporationSpeech detection and enhancement using audio/video fusion
US20050197843A1 (en)2004-03-072005-09-08International Business Machines CorporationMultimodal aggregating unit
US20050216254A1 (en)2004-03-242005-09-29Gupta Anurag KSystem-resource-based multi-modal input fusion
US7430324B2 (en)2004-05-252008-09-30Motorola, Inc.Method and apparatus for classifying and ranking interpretations for multimodal input fusion
US7742641B2 (en)2004-12-062010-06-22Honda Motor Co., Ltd.Confidence weighted classifier combination for multi-modal identification
US20060235684A1 (en)*2005-04-142006-10-19Sbc Knowledge Ventures, LpWireless device to access network-based voice-activated services using distributed speech recognition
US20070061335A1 (en)*2005-09-142007-03-15Jorey RamerMultimodal search query processing
US7860718B2 (en)*2005-12-082010-12-28Electronics And Telecommunications Research InstituteApparatus and method for speech segment detection and system for speech recognition
US8589161B2 (en)2008-05-272013-11-19Voicebox Technologies, Inc.System and method for an integrated, multi-modal, multi-device natural language voice services environment
US8326637B2 (en)2009-02-202012-12-04Voicebox Technologies, Inc.System and method for processing multi-modal device interactions in a natural language voice services environment
US8650029B2 (en)*2011-02-252014-02-11Microsoft CorporationLeveraging speech recognizer feedback for voice activity detection
EP2523149B1 (en)2011-05-112023-01-11Tata Consultancy Services Ltd.A method and system for association and decision fusion of multimodal inputs
US8589167B2 (en)*2011-05-112013-11-19Nuance Communications, Inc.Speaker liveness detection
US9214157B2 (en)*2011-12-062015-12-15At&T Intellectual Property I, L.P.System and method for machine-mediated human-human conversation
KR20140031790A (en)*2012-09-052014-03-13삼성전자주식회사Robust voice activity detection in adverse environments
US20140149177A1 (en)*2012-11-232014-05-29Ari M. FrankResponding to uncertainty of a user regarding an experience by presenting a prior experience

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030018475A1 (en)*1999-08-062003-01-23International Business Machines CorporationMethod and apparatus for audio-visual speech detection and recognition
US7472063B2 (en)*2002-12-192008-12-30Intel CorporationAudio-visual feature fusion and support vector machine useful for continuous speech recognition
US20120278074A1 (en)*2008-11-102012-11-01Google Inc.Multisensory speech detection
US20100277579A1 (en)*2009-04-302010-11-04Samsung Electronics Co., Ltd.Apparatus and method for detecting voice based on motion information

Also Published As

Publication numberPublication date
US20180182415A1 (en)2018-06-28
US20150058004A1 (en)2015-02-26
US9892745B2 (en)2018-02-13

Similar Documents

PublicationPublication DateTitle
US20210074315A1 (en)Augmented multi-tier classifier for multi-modal voice activity detection
US11978440B2 (en)Wakeword detection
US12243531B2 (en)Determining input for speech processing engine
US9412361B1 (en)Configuring system operation using image data
US11545174B2 (en)Emotion detection using speaker baseline
JP7431291B2 (en) System and method for domain adaptation in neural networks using domain classifiers
JP7108144B2 (en) Systems and methods for domain adaptation in neural networks using cross-domain batch normalization
US11854538B1 (en)Sentiment detection in audio data
WO2020247231A1 (en)Multiple classifications of audio data
Ariav et al.A deep architecture for audio-visual voice activity detection in the presence of transients
US11205428B1 (en)Deleting user data using keys
US11514900B1 (en)Wakeword detection
Joshi et al.A Study of speech emotion recognition methods
JP2025510553A (en) Multitask Learning for Personalized Keyword Spotting
AbrehaAn environmental audio-based context recognition system using smartphones
JP7178331B2 (en) Information processing device, information processing method and program
US11531736B1 (en)User authentication as a service
JP2011191542A (en)Voice classification device, voice classification method, and program for voice classification
Yoshida et al.Audio-visual voice activity detection based on an utterance state transition model
US12200449B1 (en)User orientation estimation
Wöllmer et al.Temporal and Situational Context Modeling for Improved Dominance Recognition in Meetings.
Bregler et al.Improving acoustic speaker verification with visual body-language features
Vildjiounaite et al.Requirements and software framework for adaptive multimodal affect recognition
Inoue et al.Enhanced speaker diarization with detection of backchannels using eye-gaze information in poster conversations.
WO2023183664A1 (en)Multi-task learning for personalized keyword spotting

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:AT&T INTELLECTUAL PROPERTY I, L.P., GEORGIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DIMITRIADIS, DIMITRIOS;ZAVESKY, ERIC;BURLICK, MATTHEW;SIGNING DATES FROM 20130822 TO 20130823;REEL/FRAME:054441/0278

STPPInformation on status: patent application and granting procedure in general

Free format text:APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPPInformation on status: patent application and granting procedure in general

Free format text:DOCKETED NEW CASE - READY FOR EXAMINATION

STPPInformation on status: patent application and granting procedure in general

Free format text:NON FINAL ACTION MAILED

STCBInformation on status: application discontinuation

Free format text:ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION


[8]ページ先頭

©2009-2025 Movatter.jp