NotificationsYou must be signed in to change notification settings
Fork56.4k
Star85.3k

speech recognition sample#20291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

alalek merged 12 commits intoopencv:masterfromspazewalker:master

Oct 4, 2021

Merged

speech recognition sample#20291

alalek merged 12 commits intoopencv:masterfromspazewalker:master

Oct 4, 2021

Conversation

Copy link

Contributor

spazewalker commentedJun 21, 2021•
edited by alalek
Loading

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

Mentor :@l-bat
Project proposal :https://summerofcode.withgoogle.com/projects/#5148521881141248

PR details

Creating ONNX model

NVIDIA trained jasper using FP16 precision. OpenCV needs FP32. We need to change onnx model's graph. This is done using this script :convert_jasper_to_FP32.py. Pre-trained converted onnx can be foundhere. Original pre-trained model by NVIDIA can be foundhere.

Usage

usage: speech_recognition.py [-h] --input_audio INPUT_AUDIO [--show_spectrogram] [--model MODEL] [--output OUTPUT] [--backend {0,2,3}] [--target {0,1,2}]This script runs Jasper Speech recognition modeloptional arguments:  -h, --help            show thishelp message andexit  --input_audio INPUT_AUDIO                        Path to input audio file. OR Path to a txt file with relative path to multiple audio filesin different lines (default: None)  --show_spectrogram    Whether to show a spectrogram of the input audio. (default: False)  --model MODEL         Path to the onnx file of Jasper. default="jasper.onnx" (default: jasper.onnx)  --output OUTPUT       Path to file where recognized audio transcript must be saved. Leave this to print on console. (default: None)  --backend {0,2,3}     Select a computation backend: 0: automatically (by default) 2: OpenVINO Inference Engine 3: OpenCV Implementation (default: 0)  --target {0,1,2}      Select a target device: 0: CPU target (by default) 1: OpenCL 2: OpenCL FP16 (default: 0)

Todo

Use AudioIO instead of soundfile.
Check performance.

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Docs

speech recognition sample added.(initial commit)

8f3f246

l-bat added the GSoC label

Jun 21, 2021

l-bat reviewed

Jun 21, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

fixed typos, removed plt

cc5c445

spazewalker changed the title~~speech recognition sample added.(initial commit)~~speech recognition sample

Jun 22, 2021

trailing whitespaces removed

0825f1f

l-bat added the category: dnn label

Jun 23, 2021

masking removed and using opencv for displaying spectrogram

b74aae6

Copy link

Contributor

l-bat commentedJun 23, 2021

Please add description at the beginning of sample as in

opencv/samples/dnn/human_parsing.py

Lines 2 to 40 inb74aae6

	'''
	You can download the converted pb model from https://www.dropbox.com/s/qag9vzambhhkvxr/lip_jppnet_384.pb?dl=0
	or convert the model yourself.

	Follow these steps if you want to convert the original model yourself:
	To get original .meta pre-trained model download https://drive.google.com/file/d/1BFVXgeln-bek8TCbRjN6utPAgRE0LJZg/view
	For correct convert .meta to .pb model download original repository https://github.com/Engineering-Course/LIP_JPPNet
	Change script evaluate_parsing_JPPNet-s2.py for human parsing
	1. Remove preprocessing to create image_batch_origin:
	with tf.name_scope("create_inputs"):
	...
	Add
	image_batch_origin = tf.placeholder(tf.float32, shape=(2, None, None, 3), name='input')

	2. Create input
	image = cv2.imread(path/to/image)
	image_rev = np.flip(image, axis=1)
	input = np.stack([image, image_rev], axis=0)

	3. Hardcode image_h and image_w shapes to determine output shapes.
	We use default INPUT_SIZE = (384, 384) from evaluate_parsing_JPPNet-s2.py.
	parsing_out1 = tf.reduce_mean(tf.stack([tf.image.resize_images(parsing_out1_100, INPUT_SIZE),
	tf.image.resize_images(parsing_out1_075, INPUT_SIZE),
	tf.image.resize_images(parsing_out1_125, INPUT_SIZE)]), axis=0)
	Do similarly with parsing_out2, parsing_out3
	4. Remove postprocessing. Last net operation:
	raw_output = tf.reduce_mean(tf.stack([parsing_out1, parsing_out2, parsing_out3]), axis=0)
	Change:
	parsing_ = sess.run(raw_output, feed_dict={'input:0': input})

	5. To save model after sess.run(...) add:
	input_graph_def = tf.get_default_graph().as_graph_def()
	output_node = "Mean_3"
	output_graph_def = tf.graph_util.convert_variables_to_constants(sess, input_graph_def, output_node)

	output_graph = "LIP_JPPNet.pb"
	with tf.gfile.GFile(output_graph, "wb") as f:
	f.write(output_graph_def.SerializeToString())'
	'''

How to get FP32 ONNX model from pre-trained model
Provide link to the converted model

description added

28a3269

l-bat reviewed

Jun 28, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

samples/dnn/speech_recognition.py Outdated

		if __name__ == '__main__':

		# Computation backends supported by layers
		backends = (cv.dnn.DNN_BACKEND_DEFAULT, cv.dnn.DNN_BACKEND_OPENCV)

Copy link

Contributor

l-batJun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you try forward net with OpenVINO (cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)?

Copy link

ContributorAuthor

spazewalkerJun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I tried. It gave this error:error: (-213:The function/feature is not implemented) Unknown backend identifier in function 'cv::dnn::dnn4_v20210301::wrapMat'

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

requested changes and add opencl fp16 target

e42d86c

l-bat reviewed

Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated


		parser = argparse.ArgumentParser(description='This script runs Jasper Speech recognition model',
		formatter_class=argparse.ArgumentDefaultsHelpFormatter)
		parser.add_argument('--input_audio', type=str, help='Path to input audio file.')

Copy link

Contributor

l-batJun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we need to specify supported audio formats?

Copy link

Contributor

l-batJun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we need to addrequired=True

Copy link

ContributorAuthor

spazewalkerJun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Finally, we need to use AudioIO. So, should I add the formats supported there? I suppose mp3, wav and mp4 are supported.

l-bat reviewed

Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

l-bat reviewed

Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

parenthesis and halide removed

010cc97

l-bat reviewed

Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

spazewalker added2 commits

July 14, 2021 01:01

Merge branch 'opencv:master' into master

b5a8d00

Merge branch 'opencv:master' into master

d57bde7

l-bat reviewed

Aug 12, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

l-bat reviewed

Aug 20, 2021

View reviewed changes

samples/dnn/speech_recognition.py OutdatedShow resolvedHide resolved

Copy link

Member

alalek commentedAug 22, 2021

@spazewalker Could you please check if approach from#20558 works for this case?

Copy link

ContributorAuthor

spazewalker commentedAug 22, 2021

@spazewalker Could you please check if approach from#20558 works for this case?

@alalek Just tested it. It works for this case.

spazewalker marked this pull request as ready for review

August 22, 2021 16:57

spazewalker marked this pull request as draft

August 22, 2021 16:58

spazewalker added3 commits

August 22, 2021 22:35

workaround 3d matrix issue

5e9f115

handle multi channel audio

62099d5

support for multiple files at once

suggested changes

5ddfa7d

Co-authored-by: Liubov Batanina  <piccione-mail@yandex.ru>fix whitespaces

Copy link

Member

alalek commentedSep 27, 2021

Lets merge it withsoundfile workaround.

@spazewalker Please make PR to "Ready for review" if it is ready for merging.

Copy link

Member

alalek commentedOct 2, 2021

"Ready for review"

@spazewalker Ping. Or let us know if you want to improve something else.

Copy link

ContributorAuthor

spazewalker commentedOct 3, 2021

@alalek I'm actually waiting for#19721 to get merged. I think videoio would replace the soundfile.

alalek approved these changes

Oct 3, 2021

View reviewed changes

Copy link

Member

alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you 👍

spazewalker marked this pull request as ready for review

October 3, 2021 06:33

alalek merged commit4938765 intoopencv:master

Oct 4, 2021

alalek mentioned this pull request

Oct 15, 2021

(5.x) Merge 4.x#20886

Merged

a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request

Mar 30, 2023

Merge pull requestopencv#20291from spazewalker:master

10e897b

speech recognition sample* speech recognition sample added.(initial commit)* fixed typos, removed plt* trailing whitespaces removed* masking removed and using opencv for displaying spectrogram* description added* requested changes and add opencl fp16 target* parenthesis and halide removed* workaround 3d matrix issue* handle multi channel audiosupport for multiple files at once* suggested changesfix whitespaces

Labels

category: dnn category: samples feature GSoC

Movatterモバイル変換

Uh oh!

speech recognition sample#20291

speech recognition sample#20291

Uh oh!

Conversation

spazewalker commentedJun 21, 2021• edited by alalekLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

PR details

Creating ONNX model

Usage

Todo

Pull Request Readiness Checklist

Uh oh!

Uh oh!

Uh oh!

l-bat commentedJun 23, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l-batJun 28, 2021

Choose a reason for hiding this comment

Uh oh!

spazewalkerJun 28, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

l-batJun 29, 2021

Choose a reason for hiding this comment

Uh oh!

l-batJun 29, 2021

Choose a reason for hiding this comment

Uh oh!

spazewalkerJun 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalek commentedAug 22, 2021

Uh oh!

spazewalker commentedAug 22, 2021

Uh oh!

alalek commentedSep 27, 2021

Uh oh!

alalek commentedOct 2, 2021

Uh oh!

spazewalker commentedOct 3, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spazewalker commentedJun 21, 2021•
edited by alalek
Loading