Posted onOct 24, 2024 • Edited onFeb 4

Image Generation using Janus 1.3B🔮

#streamlit #transformers #computervision #tutorial

Today, we’re diving into something exciting:Janus 1.3B, one of the tiniest yet competent truly multimodal LLMs. What sets Janus apart is that, despite its smaller size, it delivers powerful results in natural language processing and image generation. This is a perfect example of where AI is heading—smaller models yet versatile and multimodal.

Janus 1.3B

So, what exactly isJanus 1.3B? At its core, Janus is avision-language model (VLM) designed to handle textual and visual data. With just1.3 billion parameters, Janus is significantly smaller than some of the other LLMs and multimodal models we’ve discussed on the channel. But don’t let its size fool you; it can perform both text and image generation, making it a powerful tool despite its relatively compact size.

Unlike most models, which specialise in one area or need large architectures to function effectively in multiple domains, Janus achieves this multimodal functionality at a much smaller scale. This is a massive step in making AI more efficient, accessible, and, most importantly, scalable.

How Does Janus Work?

Let’s start with its architecture. Janus processes text understanding, multimodal understanding, and visual generation through independent encoding methods that eventually feed into a unified autoregressive transformer. This design allows it to handle different types of input—text, images, or a combination of both—in a highly efficient manner.

Here’s the breakdown of how it all works:

Text Understanding: Janus employs a built-in tokenizer from its underlying LLM. This tokenizer converts text into discrete IDs (tokens), which are transformed into feature representations. The LLM processes these features in the same way as any other text-based model.
Multimodal Understanding: Janus integratesSigLIP, a powerful vision encoder that extracts high-dimensional semantic features from images for image processing. These features are flattened from a 2D grid into a 1D sequence and passed through an understanding adaptor. This adaptor maps the image features into the input space of the LLM, ensuring that both image and text data are represented in a way that the model can understand together.
Image Generation: Janus utilizes aVector Quantization (VQ) tokenizer to generate images. This tokenizer converts images into a sequence of discrete IDs. These ID sequences are flattened and passed through a generation adaptor, which maps them into the LLM’s input space. This allows Janus to generate image content from a text description. A specialized image prediction head is trained for this task, while Janus relies on the LLM’s existing text prediction head for text-based tasks.

Once the inputs, whether text, image, or both, are converted into feature sequences, Janus concatenates them into a unified multimodal feature sequence. This sequence is then fed into the LLM for processing, making it capable of generating text and images based on the input it receives.

Janus Multi-Modal Performance

Now, let’s talk performance. Despite its relatively small size of 1.3 billion parameters, Janus is competitive across several multimodal tasks. It excels inVisual Question Answering (VQA) benchmarks,COCO Captioning, andImage-Text Retrieval.

Janus is designed to handle real-world multimodal applications where parameter efficiency is critical. While larger models might outperform Janus on tasks that require deep reasoning over complex text or high-resolution images, Janus hits a sweet spot by balancing efficiency and performance for general-purpose multimodal applications.

How to Use Janus for Multi-Modal Integration

Now, let us see how to use the model for multimodal inferences. Below is an example of how to set up the generate_answer function, which takes an image and a question as inputs.

defgenerate_answer(image_path,question):# Load the VL-GPT model, tokenizer, and visual language chat processormodel=load_vl_gpt_model()tokenizer=load_tokenizer()vl_chat_processor=load_vl_chat_processor()# Define conversation structureconversation=f"{question} [image:{image_path}]"# Prepare image for processingimage=preprocess_image(image_path)# Prepare inputs for the modelinputs=vl_chat_processor.process(image,conversation)# Generate input embeddingsinput_embeddings=model.get_embeddings(inputs)# Generate answer using the VL-GPT modelanswer=model.generate(input_embeddings)returndecode_answer(answer)

In this code, we load the necessary components, prepare the image and question for processing, and generate a response that combines visual context with the posed question.

Janus Image Generation

Finally, let’s examine Janus’ image generation capabilities. While it’s not as large as dedicated models likeDALL-E 2 orStable Diffusion, Janus still creates high-quality images from textual inputs in an incredibly compact form.

As mentioned, Janus uses the VQ tokenizer to convert images into discrete tokens. These tokens are then processed using a latent diffusion model, generating the image in stages and refining it over time to match the text input. The result? Images that are highly coherent and contextually accurate, especially when dealing with more straightforward or abstract prompts.

How to Use Janus for Image Generation

The process starts with tokenizing the prompt using thevl_chat_processor. This converts the text into numerical representations that the model can understand.

defgenerate_image(prompt):# Tokenize the prompttokenized_prompt=vl_chat_processor.tokenize(prompt)# Create initial embeddings from tokensinitial_embeddings=model.create_embeddings(tokenized_prompt)# Generate image tokens iterativelyimage_tokens=[]for_inrange(num_tokens):token=model.generate_next_token(initial_embeddings)image_tokens.append(token)initial_embeddings=model.update_embeddings(initial_embeddings,token)# Decode tokens into an imageimage=decode_image(image_tokens)# Save image to disksave_image(image,"output_image.jpg")

This code illustrates generating an image based on a text prompt using Janus. It showcases the iterative process of generating image tokens while ensuring relevance to the original prompt.

Conclusion

So there you have it—Janus 1.3B, a small but compelling multimodal model that punches well above its weight. Its ability to handle text understanding, multimodal reasoning, and image generation in such a compact framework is a testament to the efficiency of its design.

For those interested in multimodal AI that can be deployed in real-world applications without massive computational power, Janus is a model you should watch.