Movatterモバイル変換


[0]ホーム

URL:


US20180182415A1 - Augmented multi-tier classifier for multi-modal voice activity detection - Google Patents

Augmented multi-tier classifier for multi-modal voice activity detection
Download PDF

Info

Publication number
US20180182415A1
US20180182415A1US15/894,245US201815894245AUS2018182415A1US 20180182415 A1US20180182415 A1US 20180182415A1US 201815894245 AUS201815894245 AUS 201815894245AUS 2018182415 A1US2018182415 A1US 2018182415A1
Authority
US
United States
Prior art keywords
classifier
output
input
human subject
time period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/894,245
Inventor
Dimitrios Dimitriadis
Eric Zavesky
Matthew Burlick
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Intellectual Property I LP
Original Assignee
AT&T Intellectual Property I LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Intellectual Property I LPfiledCriticalAT&T Intellectual Property I LP
Priority to US15/894,245priorityCriticalpatent/US20180182415A1/en
Publication of US20180182415A1publicationCriticalpatent/US20180182415A1/en
Assigned to AT&T INTELLECTUAL PROPERTY I, L.P.reassignmentAT&T INTELLECTUAL PROPERTY I, L.P.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: DIMITRIADIS, DIMITRIOS, BURLICK, MATTHEW, ZAVESKY, ERIC
Priority to US17/101,048prioritypatent/US20210074315A1/en
Abandonedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

Disclosed herein are systems, methods, and computer-readable storage media for detecting voice activity in a media signal in an augmented, multi-tier classifier architecture. A system configured to practice the method can receive, from a first classifier, a first voice activity indicator detected in a first modality for a human subject. Then, the system can receive, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different. The system can concatenate, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output, and determine voice activity based on the classifier output.

Description

Claims (20)

We claim:
1. A method comprising:
receiving, from a first classifier, a first modality output detected from a first input from a human subject during a first time period;
receiving, from a second classifier, a second modality output detected from second input from the human subject during a second time period, wherein the first time period and the second time period differ;
combining first classifier output from the first classifier with second classifier output from the second classifier to yield classifier group output; and
determining voice activity based on the classifier group output.
2. The method ofclaim 1, wherein an acoustic feature associated with the first input comprises one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, and acoustic energy.
3. The method ofclaim 1, wherein a visual feature associated with the second input comprises one of a parametric feature, an appearance-based feature, dimensions of a mouth region, image region intensity, discrete cosine transformations, image saturation, image brightness, image texture, video motion, head movement, one of movement of a specific item relative to other items and movement of the specific item relative to a background, micro-expressions on a person's face, extremity of expression, detected lighting changes, head angle relative to a camera, and local binary patterns.
4. The method ofclaim 1, wherein original features associated with the human subject are associated with a video frame.
5. The method ofclaim 1, wherein the combining of the first classifier output from the first classifier with the second classifier output from the second classifier further comprises:
performing late fusion of original features associated with the human subject using weighted majority voting.
6. The method ofclaim 1, wherein the first input comprises voice input.
7. The method ofclaim 1, wherein the second input comprises visual input.
8. A system comprising:
a processor; and
a computer-readable medium having instructions which, when executed by the processor, cause the processor to perform operations comprising:
receiving, from a first classifier, a first modality output detected from a first input from a human subject during a first time period;
receiving, from a second classifier, a second modality output detected from second input from the human subject during a second time period, wherein the first time period and the second time period differ;
combining first classifier output from the first classifier with second classifier output from the second classifier to yield classifier group output; and
determining voice activity based on the classifier group output.
9. The system ofclaim 8, wherein an acoustic feature associated with the first input comprises one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, and acoustic energy.
10. The system ofclaim 8, wherein a visual feature associated with the second input comprises one of a parametric feature, an appearance-based feature, dimensions of a mouth region, image region intensity, discrete cosine transformations, image saturation, image brightness, image texture, video motion, head movement, one of movement of a specific item relative to other items and movement of the specific item relative to a background, micro-expressions on a person's face, extremity of expression, detected lighting changes, head angle relative to a camera, and local binary patterns.
11. The system ofclaim 8, wherein original features associated with the human subject are associated with a video frame.
12. The system ofclaim 8, wherein the combining of the first classifier output from the first classifier with the second classifier output from the second classifier further comprises:
performing late fusion of original features associated with the human subject using weighted majority voting.
13. The system ofclaim 8, wherein the first input comprises voice input.
14. The system ofclaim 8, wherein the second input comprises visual input.
15. A computer-readable storage device storing instructions which, when executed by a computing device, cause the computing device to perform operations comprising:
receiving, from a first classifier, a first modality output detected from a first input from a human subject during a first time period;
receiving, from a second classifier, a second modality output detected from second input from the human subject during a second time period, wherein the first time period and the second time period differ;
combining first classifier output from the first classifier with second classifier output from the second classifier to yield classifier group output; and
determining voice activity based on the classifier group output.
16. The computer-readable storage device ofclaim 15, wherein an acoustic feature associated with the first input comprises one of Mel-frequency cepstral coefficients, a first derivative of Mel-frequency cepstral coefficients, a second derivative of Mel-frequency cepstral coefficients, and acoustic energy.
17. The computer-readable storage device ofclaim 15, wherein a visual feature associated with the second input comprises one of a parametric feature, an appearance-based feature, dimensions of a mouth region, image region intensity, discrete cosine transformations, image saturation, image brightness, image texture, video motion, head movement, one of movement of a specific item relative to other items and movement of the specific item relative to a background, micro-expressions on a person's face, extremity of expression, detected lighting changes, head angle relative to a camera, and local binary patterns.
18. The computer-readable storage device ofclaim 15, wherein original features associated with the human subject are associated with a video frame.
19. The computer-readable storage device ofclaim 15, wherein the combining of the first classifier output from the first classifier with the second classifier output from the second classifier further comprises:
performing late fusion of original features associated with the human subject using weighted majority voting.
20. The computer-readable storage device ofclaim 15, wherein the first input comprises voice input or wherein the second input comprises visual input.
US15/894,2452013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detectionAbandonedUS20180182415A1 (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
US15/894,245US20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection
US17/101,048US20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
US13/974,453US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection
US15/894,245US20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US13/974,453ContinuationUS9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection

Related Child Applications (1)

Application NumberTitlePriority DateFiling Date
US17/101,048ContinuationUS20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Publications (1)

Publication NumberPublication Date
US20180182415A1true US20180182415A1 (en)2018-06-28

Family

ID=52481151

Family Applications (3)

Application NumberTitlePriority DateFiling Date
US13/974,453Active2035-07-03US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection
US15/894,245AbandonedUS20180182415A1 (en)2013-08-232018-02-12Augmented multi-tier classifier for multi-modal voice activity detection
US17/101,048AbandonedUS20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Family Applications Before (1)

Application NumberTitlePriority DateFiling Date
US13/974,453Active2035-07-03US9892745B2 (en)2013-08-232013-08-23Augmented multi-tier classifier for multi-modal voice activity detection

Family Applications After (1)

Application NumberTitlePriority DateFiling Date
US17/101,048AbandonedUS20210074315A1 (en)2013-08-232020-11-23Augmented multi-tier classifier for multi-modal voice activity detection

Country Status (1)

CountryLink
US (3)US9892745B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10909380B2 (en)*2017-12-132021-02-02Beijing Sensetime Technology Development Co., LtdMethods and apparatuses for recognizing video and training, electronic device and medium
JP2021162685A (en)*2020-03-312021-10-11グローリー株式会社Utterance section detection device, voice recognition device, utterance section detection system, utterance section detection method, and utterance section detection program
US20230253009A1 (en)*2018-05-042023-08-10Google LlcHot-word free adaptation of automated assistant function(s)

Families Citing this family (49)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9240184B1 (en)*2012-11-152016-01-19Google Inc.Frame-level combination of deep neural network and gaussian mixture models
US10134423B2 (en)*2013-12-062018-11-20Tata Consultancy Services LimitedSystem and method to provide classification of noise data of human crowd
US9530400B2 (en)*2014-09-292016-12-27Nuance Communications, Inc.System and method for compressed domain language identification
US10403269B2 (en)2015-03-272019-09-03Google LlcProcessing audio waveforms
US10575103B2 (en)2015-04-102020-02-25Starkey Laboratories, Inc.Neural network-driven frequency translation
US10339921B2 (en)2015-09-242019-07-02Google LlcMultichannel raw-waveform neural networks
US10229700B2 (en)*2015-09-242019-03-12Google LlcVoice activity detection
CN105654949B (en)*2016-01-072019-05-07北京云知声信息技术有限公司A kind of voice awakening method and device
US11166027B2 (en)2016-06-102021-11-02Apple Inc.Content adaptation for streaming
US10771631B2 (en)2016-08-032020-09-08Dolby Laboratories Licensing CorporationState-based endpoint conference interaction
US10224058B2 (en)2016-09-072019-03-05Google LlcEnhanced multi-channel acoustic models
US10529320B2 (en)2016-12-212020-01-07Google LlcComplex evolution recurrent neural networks
US10140980B2 (en)2016-12-212018-11-27Google LCCComplex linear projection for acoustic modeling
EP3600130B1 (en)*2017-03-202023-07-12Align Technology, Inc.Generating a virtual depiction of an orthodontic treatment of a patient
EP3602544A4 (en)2017-03-232020-02-05Joyson Safety Systems Acquisition LLCSystem and method of correlating mouth images to input commands
US11416714B2 (en)*2017-03-242022-08-16Revealit CorporationMethod, system, and apparatus for identifying and revealing selected objects from video
CN108346428B (en)*2017-09-132020-10-02腾讯科技(深圳)有限公司Voice activity detection and model building method, device, equipment and storage medium thereof
US10403303B1 (en)*2017-11-022019-09-03Gopro, Inc.Systems and methods for identifying speech based on cepstral coefficients and support vector machines
US11481571B2 (en)2018-01-122022-10-25Microsoft Technology Licensing, LlcAutomated localized machine learning training
US11429807B2 (en)*2018-01-122022-08-30Microsoft Technology Licensing, LlcAutomated collection of machine learning training data
US12217760B2 (en)*2018-04-172025-02-04GONGIO Ltd.Metadata-based diarization of teleconferences
US11553988B2 (en)2018-06-292023-01-17Align Technology, Inc.Photo of a patient with new simulated smile in an orthodontic treatment review software
DE102018212902B4 (en)*2018-08-022024-12-19Bayerische Motoren Werke Aktiengesellschaft Method for determining a digital assistant for performing a vehicle function from a plurality of digital assistants in a vehicle, computer-readable medium, system, and vehicle
WO2020043007A1 (en)*2018-08-272020-03-05Guangdong Oppo Mobile Telecommunications Corp., Ltd.Method, system, and computer-readable medium for purifying voice using depth information
US10846522B2 (en)*2018-10-162020-11-24Google LlcSpeaking classification using audio-visual data
US10861457B2 (en)*2018-10-262020-12-08Ford Global Technologies, LlcVehicle digital assistant authentication
US20210027772A1 (en)2019-07-242021-01-28Gong I.O Ltd.Unsupervised automated extraction of conversation structure from recorded conversations
CN110196914B (en)*2019-07-292019-12-27上海肇观电子科技有限公司Method and device for inputting face information into database
CN110648656A (en)*2019-08-282020-01-03北京达佳互联信息技术有限公司Voice endpoint detection method and device, electronic equipment and storage medium
CN112487182B (en)*2019-09-122024-04-12华为技术有限公司 Text processing model training method, text processing method and device
US11580869B2 (en)*2019-09-232023-02-14Revealit CorporationComputer-implemented interfaces for identifying and revealing selected objects from video
US11915123B2 (en)*2019-11-142024-02-27International Business Machines CorporationFusing multimodal data using recurrent neural networks
CN113139561B (en)*2020-01-172024-05-03Tcl科技集团股份有限公司Garbage classification method, garbage classification device, terminal equipment and storage medium
KR20210112726A (en)*2020-03-062021-09-15엘지전자 주식회사Providing interactive assistant for each seat in the vehicle
CN111814609B (en)*2020-06-242023-09-29厦门大学Micro-expression recognition method based on deep forest and convolutional neural network
CN111933175B (en)*2020-08-062023-10-24北京中电慧声科技有限公司Active voice detection method and system based on noise scene recognition
CN112017633B (en)*2020-09-102024-04-26北京地平线信息技术有限公司Speech recognition method, device, storage medium and electronic equipment
KR20220059629A (en)*2020-11-032022-05-10현대자동차주식회사Vehicle and method for controlling thereof
WO2022125351A2 (en)*2020-12-092022-06-16Cerence Operating CompanyAutomotive infotainment system with spatially-cognizant applications that interact with a speech interface
US12175970B2 (en)*2020-12-242024-12-24Cerence Operating CompanySpeech dialog system for multiple passengers in a car
CN112863488A (en)*2021-01-152021-05-28广东优碧胜科技有限公司Voice signal processing method and device and electronic equipment
US11996114B2 (en)2021-05-152024-05-28Apple Inc.End-to-end time-domain multitask learning for ML-based speech enhancement
CN113255556A (en)*2021-06-072021-08-13斑马网络技术有限公司Multi-mode voice endpoint detection method and device, vehicle-mounted terminal and storage medium
CN113434731B (en)*2021-06-302024-01-19平安科技(深圳)有限公司Music video genre classification method, device, computer equipment and storage medium
CN114420155A (en)*2021-11-292022-04-29江苏科技大学Underwater sound target identification method based on multi-mode fusion
CN114049887B (en)*2021-12-062025-03-11宁波蛙声科技有限公司 Real-time voice activity detection method and system for audio and video conferencing
CN118056207A (en)*2021-12-072024-05-17渊慧科技有限公司Large-scale retrieval for sequence generation
US11646036B1 (en)*2022-01-312023-05-09Humancore LlcTeam member identification based on psychographic categories
JP7693593B2 (en)*2022-03-152025-06-17株式会社東芝 Voice activity detection device, learning device, and voice activity detection program

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20020120643A1 (en)*2001-02-282002-08-29Ibm CorporationAudio-visual data collection system
US20140067388A1 (en)*2012-09-052014-03-06Samsung Electronics Co., Ltd.Robust voice activity detection in adverse environments

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6570610B1 (en)*1997-09-152003-05-27Alan KipustSecurity system with proximity sensing for an electronic device
US6594629B1 (en)*1999-08-062003-07-15International Business Machines CorporationMethods and apparatus for audio-visual speech detection and recognition
US7321854B2 (en)*2002-09-192008-01-22The Penn State Research FoundationProsody based audio/visual co-analysis for co-verbal gesture recognition
US7472063B2 (en)*2002-12-192008-12-30Intel CorporationAudio-visual feature fusion and support vector machine useful for continuous speech recognition
EP1443498B1 (en)*2003-01-242008-03-19Sony Ericsson Mobile Communications ABNoise reduction and audio-visual speech activity detection
US7251603B2 (en)*2003-06-232007-07-31International Business Machines CorporationAudio-only backoff in audio-visual speech recognition system
US7269560B2 (en)*2003-06-272007-09-11Microsoft CorporationSpeech detection and enhancement using audio/video fusion
US20050197843A1 (en)2004-03-072005-09-08International Business Machines CorporationMultimodal aggregating unit
US20050216254A1 (en)2004-03-242005-09-29Gupta Anurag KSystem-resource-based multi-modal input fusion
US7430324B2 (en)2004-05-252008-09-30Motorola, Inc.Method and apparatus for classifying and ranking interpretations for multimodal input fusion
US7742641B2 (en)2004-12-062010-06-22Honda Motor Co., Ltd.Confidence weighted classifier combination for multi-modal identification
US20060235684A1 (en)*2005-04-142006-10-19Sbc Knowledge Ventures, LpWireless device to access network-based voice-activated services using distributed speech recognition
US20070061335A1 (en)*2005-09-142007-03-15Jorey RamerMultimodal search query processing
US7860718B2 (en)*2005-12-082010-12-28Electronics And Telecommunications Research InstituteApparatus and method for speech segment detection and system for speech recognition
US8589161B2 (en)2008-05-272013-11-19Voicebox Technologies, Inc.System and method for an integrated, multi-modal, multi-device natural language voice services environment
US9009053B2 (en)*2008-11-102015-04-14Google Inc.Multisensory speech detection
US8326637B2 (en)2009-02-202012-12-04Voicebox Technologies, Inc.System and method for processing multi-modal device interactions in a natural language voice services environment
KR101581883B1 (en)*2009-04-302016-01-11삼성전자주식회사 Speech detection apparatus and method using motion information
US8650029B2 (en)*2011-02-252014-02-11Microsoft CorporationLeveraging speech recognizer feedback for voice activity detection
EP2523149B1 (en)2011-05-112023-01-11Tata Consultancy Services Ltd.A method and system for association and decision fusion of multimodal inputs
US8589167B2 (en)*2011-05-112013-11-19Nuance Communications, Inc.Speaker liveness detection
US9214157B2 (en)*2011-12-062015-12-15At&T Intellectual Property I, L.P.System and method for machine-mediated human-human conversation
US20140149177A1 (en)*2012-11-232014-05-29Ari M. FrankResponding to uncertainty of a user regarding an experience by presenting a prior experience

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20020120643A1 (en)*2001-02-282002-08-29Ibm CorporationAudio-visual data collection system
US20140067388A1 (en)*2012-09-052014-03-06Samsung Electronics Co., Ltd.Robust voice activity detection in adverse environments

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10909380B2 (en)*2017-12-132021-02-02Beijing Sensetime Technology Development Co., LtdMethods and apparatuses for recognizing video and training, electronic device and medium
US20230253009A1 (en)*2018-05-042023-08-10Google LlcHot-word free adaptation of automated assistant function(s)
JP2021162685A (en)*2020-03-312021-10-11グローリー株式会社Utterance section detection device, voice recognition device, utterance section detection system, utterance section detection method, and utterance section detection program
JP7511374B2 (en)2020-03-312024-07-05グローリー株式会社 Speech activity detection device, voice recognition device, speech activity detection system, speech activity detection method, and speech activity detection program

Also Published As

Publication numberPublication date
US20150058004A1 (en)2015-02-26
US20210074315A1 (en)2021-03-11
US9892745B2 (en)2018-02-13

Similar Documents

PublicationPublication DateTitle
US20210074315A1 (en)Augmented multi-tier classifier for multi-modal voice activity detection
US11978440B2 (en)Wakeword detection
US12243531B2 (en)Determining input for speech processing engine
US11335347B2 (en)Multiple classifications of audio data
US11545174B2 (en)Emotion detection using speaker baseline
US9412361B1 (en)Configuring system operation using image data
US20190318758A1 (en)Adjusting speed of human speech playback
US11854538B1 (en)Sentiment detection in audio data
JP2022171662A (en)Systems and methods for domain adaptation in neural networks using domain classifiers
US11205428B1 (en)Deleting user data using keys
Ariav et al.A deep architecture for audio-visual voice activity detection in the presence of transients
US11514900B1 (en)Wakeword detection
Joshi et al.A Study of speech emotion recognition methods
JP2025510553A (en) Multitask Learning for Personalized Keyword Spotting
AbrehaAn environmental audio-based context recognition system using smartphones
JP7178331B2 (en) Information processing device, information processing method and program
Eyben et al.Audiovisual vocal outburst classification in noisy acoustic conditions
US11531736B1 (en)User authentication as a service
JP2011191542A (en)Voice classification device, voice classification method, and program for voice classification
Yoshida et al.Audio-visual voice activity detection based on an utterance state transition model
US12200449B1 (en)User orientation estimation
Wöllmer et al.Temporal and Situational Context Modeling for Improved Dominance Recognition in Meetings.
Bregler et al.Improving acoustic speaker verification with visual body-language features
Vildjiounaite et al.Requirements and software framework for adaptive multimodal affect recognition
Inoue et al.Enhanced speaker diarization with detection of backchannels using eye-gaze information in poster conversations.

Legal Events

DateCodeTitleDescription
STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

STPPInformation on status: patent application and granting procedure in general

Free format text:DOCKETED NEW CASE - READY FOR EXAMINATION

STPPInformation on status: patent application and granting procedure in general

Free format text:NON FINAL ACTION MAILED

STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

ASAssignment

Owner name:AT&T INTELLECTUAL PROPERTY I, L.P., GEORGIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DIMITRIADIS, DIMITRIOS;ZAVESKY, ERIC;BURLICK, MATTHEW;SIGNING DATES FROM 20130822 TO 20130823;REEL/FRAME:051224/0502

STPPInformation on status: patent application and granting procedure in general

Free format text:NON FINAL ACTION MAILED

STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

STCBInformation on status: application discontinuation

Free format text:ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION


[8]ページ先頭

©2009-2025 Movatter.jp