Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

speech recognition sample#20291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
alalek merged 12 commits intoopencv:masterfromspazewalker:master
Oct 4, 2021
Merged

speech recognition sample#20291

alalek merged 12 commits intoopencv:masterfromspazewalker:master
Oct 4, 2021

Conversation

@spazewalker
Copy link
Contributor

@spazewalkerspazewalker commentedJun 21, 2021
edited by alalek
Loading

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

PR details

Creating ONNX model

NVIDIA trained jasper using FP16 precision. OpenCV needs FP32. We need to change onnx model's graph. This is done using this script :convert_jasper_to_FP32.py. Pre-trained converted onnx can be foundhere. Original pre-trained model by NVIDIA can be foundhere.

Usage

usage: speech_recognition.py [-h] --input_audio INPUT_AUDIO [--show_spectrogram] [--model MODEL] [--output OUTPUT] [--backend {0,2,3}] [--target {0,1,2}]This script runs Jasper Speech recognition modeloptional arguments:  -h, --help            show thishelp message andexit  --input_audio INPUT_AUDIO                        Path to input audio file. OR Path to a txt file with relative path to multiple audio filesin different lines (default: None)  --show_spectrogram    Whether to show a spectrogram of the input audio. (default: False)  --model MODEL         Path to the onnx file of Jasper. default="jasper.onnx" (default: jasper.onnx)  --output OUTPUT       Path to file where recognized audio transcript must be saved. Leave this to print on console. (default: None)  --backend {0,2,3}     Select a computation backend: 0: automatically (by default) 2: OpenVINO Inference Engine 3: OpenCV Implementation (default: 0)  --target {0,1,2}      Select a target device: 0: CPU target (by default) 1: OpenCL 2: OpenCL FP16 (default: 0)

Todo

  • Use AudioIO instead of soundfile.
  • Check performance.

Pull Request Readiness Checklist

See details athttps://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Docs

l-bat and alalek reacted with thumbs up emoji
@l-batl-bat added the GSoC labelJun 21, 2021
@spazewalkerspazewalker changed the titlespeech recognition sample added.(initial commit)speech recognition sampleJun 22, 2021
@l-bat
Copy link
Contributor

Please add description at the beginning of sample as in

'''
You can download the converted pb model from https://www.dropbox.com/s/qag9vzambhhkvxr/lip_jppnet_384.pb?dl=0
or convert the model yourself.
Follow these steps if you want to convert the original model yourself:
To get original .meta pre-trained model download https://drive.google.com/file/d/1BFVXgeln-bek8TCbRjN6utPAgRE0LJZg/view
For correct convert .meta to .pb model download original repository https://github.com/Engineering-Course/LIP_JPPNet
Change script evaluate_parsing_JPPNet-s2.py for human parsing
1. Remove preprocessing to create image_batch_origin:
with tf.name_scope("create_inputs"):
...
Add
image_batch_origin = tf.placeholder(tf.float32, shape=(2, None, None, 3), name='input')
2. Create input
image = cv2.imread(path/to/image)
image_rev = np.flip(image, axis=1)
input = np.stack([image, image_rev], axis=0)
3. Hardcode image_h and image_w shapes to determine output shapes.
We use default INPUT_SIZE = (384, 384) from evaluate_parsing_JPPNet-s2.py.
parsing_out1 = tf.reduce_mean(tf.stack([tf.image.resize_images(parsing_out1_100, INPUT_SIZE),
tf.image.resize_images(parsing_out1_075, INPUT_SIZE),
tf.image.resize_images(parsing_out1_125, INPUT_SIZE)]), axis=0)
Do similarly with parsing_out2, parsing_out3
4. Remove postprocessing. Last net operation:
raw_output = tf.reduce_mean(tf.stack([parsing_out1, parsing_out2, parsing_out3]), axis=0)
Change:
parsing_ = sess.run(raw_output, feed_dict={'input:0': input})
5. To save model after sess.run(...) add:
input_graph_def = tf.get_default_graph().as_graph_def()
output_node = "Mean_3"
output_graph_def = tf.graph_util.convert_variables_to_constants(sess, input_graph_def, output_node)
output_graph = "LIP_JPPNet.pb"
with tf.gfile.GFile(output_graph, "wb") as f:
f.write(output_graph_def.SerializeToString())'
'''

  1. How to get FP32 ONNX model from pre-trained model
  2. Provide link to the converted model
spazewalker and alalek reacted with thumbs up emoji

if __name__ == '__main__':

# Computation backends supported by layers
backends = (cv.dnn.DNN_BACKEND_DEFAULT, cv.dnn.DNN_BACKEND_OPENCV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you try forward net with OpenVINO (cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I tried. It gave this error:error: (-213:The function/feature is not implemented) Unknown backend identifier in function 'cv::dnn::dnn4_v20210301::wrapMat'


parser = argparse.ArgumentParser(description='This script runs Jasper Speech recognition model',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--input_audio', type=str, help='Path to input audio file.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we need to specify supported audio formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we need to addrequired=True

spazewalker reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Finally, we need to use AudioIO. So, should I add the formats supported there? I suppose mp3, wav and mp4 are supported.

@alalek
Copy link
Member

@spazewalker Could you please check if approach from#20558 works for this case?

@spazewalker
Copy link
ContributorAuthor

@spazewalker Could you please check if approach from#20558 works for this case?

@alalek Just tested it. It works for this case.

alalek reacted with thumbs up emoji

@spazewalkerspazewalker marked this pull request as ready for reviewAugust 22, 2021 16:57
@spazewalkerspazewalker marked this pull request as draftAugust 22, 2021 16:58
support for multiple files at once
Co-authored-by: Liubov Batanina  <piccione-mail@yandex.ru>fix whitespaces
@alalek
Copy link
Member

Lets merge it withsoundfile workaround.

@spazewalker Please make PR to "Ready for review" if it is ready for merging.

@alalek
Copy link
Member

"Ready for review"

@spazewalker Ping. Or let us know if you want to improve something else.

@spazewalker
Copy link
ContributorAuthor

@alalek I'm actually waiting for#19721 to get merged. I think videoio would replace the soundfile.

alalek reacted with thumbs up emoji

Copy link
Member

@alalekalalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you 👍

@spazewalkerspazewalker marked this pull request as ready for reviewOctober 3, 2021 06:33
@alalekalalek merged commit4938765 intoopencv:masterOct 4, 2021
@alalekalalek mentioned this pull requestOct 15, 2021
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull requestMar 30, 2023
speech recognition sample* speech recognition sample added.(initial commit)* fixed typos, removed plt* trailing whitespaces removed* masking removed and using opencv for displaying spectrogram* description added* requested changes and add opencl fp16 target* parenthesis and halide removed* workaround 3d matrix issue* handle multi channel audiosupport for multiple files at once* suggested changesfix whitespaces
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@l-batl-batl-bat left review comments

@alalekalalekalalek approved these changes

Assignees

@alalekalalek

Projects

None yet

Milestone

4.5.4

Development

Successfully merging this pull request may close these issues.

3 participants

@spazewalker@l-bat@alalek

[8]ページ先頭

©2009-2025 Movatter.jp