Posted onNov 19, 2022 • Edited onNov 21, 2022

Speech-to-Image

#a11y

Abstract

In this blog, we demonstrate multi-deep-learning models that can complement together to form a speech-to-image solution. We have built over Stable Diffusion Model and utilized Amazon Transcribe and Amazon Translate to shape our solution. Once an mp3 recording is uploaded into S3, a lambda function will call Amazon Transcribe to transcribe the recording (speech-to-text), then it will call Amazon Translate to translate the transcribed text (text-to-text), and finally, it will invoke the HuggingFace model deployed on SageMaker to generate an image from the translated text (text-to-image) that is also uploaded to the specific sub-bucket.

Some Results

Overview

Text to image in artificial intelligence is the generation of images based on a given prompt. It requires a prominent level of natural language processing and image generation. Text deciphering to generate images is a major field in AI and still a work in progress as real image collection and processing is expensive.

Once an mp3 recording is uploaded into the S3 a lambda function will call Amazon Transcribe to transcribe the recording (speech-to-text), then it will call Amazon Translate to translate the transcribed text (text-to-text), and finally, it will invoke the HuggingFace model deployed on SageMaker to generate an image from the translated text (text-to-image) that is also uploaded to the specific sub-bucket.

Stable Diffusion

The algorithm used is Stable Diffusion by hugging face, a data science platform that provides tools that enable users to build, train, and deploy ML models based on open-source code and technologies. Stable Diffusion is a latent diffusion model, a variety of generative neural network that was developed by Stability AI. It was trained on pairs of images and captions taken from LAION-5B (a dataset of 5.85 billion CLIP-filtered image-text pairs). Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, cultivates autonomous freedom to produce incredible imagery, empowers billions of people to create stunning art within seconds.

The Lambda Function

The Lambda function is developed to call the deep learning models and complement their responses. There are three main models executed in this lambda. Each model execution code will be shown in detail after this section.

The Lambda function code is as follows:

importjsonimportboto3fromtranscribeimporttranscribe_mp3fromtranslateimporttranslate_textfromhf_modelimportgenerate_imagesfromupload2s3importupload_fileclient=boto3.client('s3')deflambda_handler(event,context):#parse out the bucket & file name from the event handlerforrecordinevent['Records']:file_bucket=record['s3']['bucket']['name']file_name=record['s3']['object']['key']object_url='https://s3.amazonaws.com/{0}/{1}'.format(file_bucket,file_name)transcribed_text=transcribe_mp3(file_name,object_url)translated_text=translate_text(transcribed_text)generated_images=generate_images(translated_text,2)fori,imginenumerate(generated_images):img_name=f'{translated_text.replace("","_").replace(".","")}-{i}.jpeg'img_path="/tmp/"+img_nameimg.save(img_path)upload_file(file_name=img_path,bucket='',# enter the s3 bucket nameobject_name='result/'+img_name)return"lambda handled Successfully!"

Transcribe Model

For this purpose, Amazon Transcribe was used. It is an automatic speech recognition service that uses deep learning models to convert audio to text.

The code includes boto3 library which allows us to access AWS services, it is as follows:

importjson,boto3fromurllib.requestimporturlopenimporttimetranscribe=boto3.client('transcribe')deftranscribe_mp3(file_name,object_url):response=transcribe.start_transcription_job(TranscriptionJobName=file_name.replace('/','')[:10],LanguageCode='ar-AE',MediaFormat='mp3',Media={'MediaFileUri':object_url})whileTrue:status=transcribe.get_transcription_job(TranscriptionJobName='audio-rawV')ifstatus['TranscriptionJob']['TranscriptionJobStatus']in['COMPLETED','FAILED']:breakprint('In Progress')time.sleep(5)load_url=urlopen(status['TranscriptionJob']['Transcript']['TranscriptFileUri'])load_json=json.dumps(json.load(load_url))text=str(json.loads(load_json)['results']['transcripts'][0]['transcript'])returntext

Translate Model

In order to make the application more diverse for users around the world, Amazon Translate was used. It’s a neural machine translation service that uses deep learning models to deliver fast and high-quality language translation. After integration, the user can enter the prompt in any language, and AWS Translate will transform it to English language as the algorithm requires. The implementation uses Boto3 library, that connects the code written on SageMaker to AWS Translate service. Then, an input is required and translated to English, and finally saved as the prompt variable.
The code used is as following:

importboto3,jsontranslate=boto3.client('translate')deftranslate_text(text):result=translate.translate_text(Text=text,SourceLanguageCode="auto",TargetLanguageCode="en")prompt=result["TranslatedText"]returnprompt

Hugging Face Model

To deploy the HuggingFace Model, you can easily just follow the steps provided by Phil Schmid in his blog entitled “Stable Diffusion on Amazon SageMaker”. Once the model is deployed, you can note the SageMaker Endpoint name that we are going to invoke through our lambda code.

The code to invoke the SageMaker Endpoint is as follows:

importos,io,boto3,json,csvfromioimportBytesIOimportbase64fromPILimportImageENDPOINT_NAME=''# enter your ENDPOINT_NAMEruntime=boto3.client('sagemaker-runtime')# helper decoderdefdecode_base64_image(image_string):base64_image=base64.b64decode(image_string)buffer=BytesIO(base64_image)returnImage.open(buffer)defgenerate_images(prompt,num_images_per_prompt):data={"inputs":prompt,"num_images_per_prompt":num_images_per_prompt}payload=json.dumps(data,indent=2).encode('utf-8')response=runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,ContentType='application/json',Body=payload)response_decoded=json.loads(response['Body'].read().decode())decoded_images=[decode_base64_image(image)forimageinresponse_decoded["generated_images"]]returndecoded_images

Hope you enjoyed !!!!!!

Github Repo

Top comments(2)

Thomas Bnt

French web developer mainly but touches everything. Volunteer admin mod here at DEV. I learn Nuxt at this moment and databases. — Addict to Cappuccino and Music