- Notifications
You must be signed in to change notification settings - Fork2.2k
☁️ Build multimodal AI applications with cloud-native stack
License
jina-ai/serve
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic.
- Native support for all major ML frameworks and data types
- High-performance service design with scaling, streaming, and dynamic batching
- LLM serving with streaming output
- Built-in Docker integration and Executor Hub
- One-click deployment to Jina AI Cloud
- Enterprise-ready with Kubernetes and Docker Compose support
Comparison with FastAPI
Key advantages over FastAPI:
- DocArray-based data handling with native gRPC support
- Built-in containerization and service orchestration
- Seamless scaling of microservices
- One-command cloud deployment
pip install jina
See guides forApple Silicon andWindows.
Three main layers:
- Data: BaseDoc and DocList for input/output
- Serving: Executors process Documents, Gateway connects services
- Orchestration: Deployments serve Executors, Flows create pipelines
Let's create a gRPC-based AI service using StableLM:
fromjinaimportExecutor,requestsfromdocarrayimportDocList,BaseDocfromtransformersimportpipelineclassPrompt(BaseDoc):text:strclassGeneration(BaseDoc):prompt:strtext:strclassStableLM(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.generator=pipeline('text-generation',model='stabilityai/stablelm-base-alpha-3b' )@requestsdefgenerate(self,docs:DocList[Prompt],**kwargs)->DocList[Generation]:generations=DocList[Generation]()prompts=docs.textllm_outputs=self.generator(prompts)forprompt,outputinzip(prompts,llm_outputs):generations.append(Generation(prompt=prompt,text=output))returngenerations
Deploy with Python or YAML:
fromjinaimportDeploymentfromexecutorimportStableLMdep=Deployment(uses=StableLM,timeout_ready=-1,port=12345)withdep:dep.block()
jtype:Deploymentwith:uses:StableLMpy_modules: -executor.pytimeout_ready:-1port:12345
Use the client:
fromjinaimportClientfromdocarrayimportDocListfromexecutorimportPrompt,Generationprompt=Prompt(text='suggest an interesting image generation prompt')client=Client(port=12345)response=client.post('/',inputs=[prompt],return_type=DocList[Generation])
Chain services into a Flow:
fromjinaimportFlowflow=Flow(port=12345).add(uses=StableLM).add(uses=TextToImage)withflow:flow.block()
Boost throughput with built-in features:
- Replicas for parallel processing
- Shards for data partitioning
- Dynamic batching for efficient model inference
Example scaling a Stable Diffusion deployment:
jtype:Deploymentwith:uses:TextToImagetimeout_ready:-1py_modules: -text_to_image.pyenv:CUDA_VISIBLE_DEVICES:RRreplicas:2uses_dynamic_batching:/default:preferred_batch_size:10timeout:200
- Structure your Executor:
TextToImage/├── executor.py├── config.yml├── requirements.txt
- Configure:
# config.ymljtype:TextToImagepy_modules: -executor.pymetas:name:TextToImagedescription:Text to Image generation Executor
- Push to Hub:
jina hub push TextToImage
jinaexport kubernetes flow.yml ./my-k8skubectl apply -R -f my-k8s
jinaexport docker-compose flow.yml docker-compose.ymldocker-compose up
Deploy with a single command:
jina cloud deploy jcloud-flow.yml
Enable token-by-token streaming for responsive LLM applications:
- Define schemas:
fromdocarrayimportBaseDocclassPromptDocument(BaseDoc):prompt:strmax_tokens:intclassModelOutputDocument(BaseDoc):token_id:intgenerated_text:str
- Initialize service:
fromtransformersimportGPT2Tokenizer,GPT2LMHeadModelclassTokenStreamingExecutor(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.model=GPT2LMHeadModel.from_pretrained('gpt2')
- Implement streaming:
@requests(on='/stream')asyncdeftask(self,doc:PromptDocument,**kwargs)->ModelOutputDocument:input=tokenizer(doc.prompt,return_tensors='pt')input_len=input['input_ids'].shape[1]for_inrange(doc.max_tokens):output=self.model.generate(**input,max_new_tokens=1)ifoutput[0][-1]==tokenizer.eos_token_id:breakyieldModelOutputDocument(token_id=output[0][-1],generated_text=tokenizer.decode(output[0][input_len:],skip_special_tokens=True ), )input= {'input_ids':output,'attention_mask':torch.ones(1,len(output[0])), }
- Serve and use:
# ServerwithDeployment(uses=TokenStreamingExecutor,port=12345,protocol='grpc')asdep:dep.block()# Clientasyncdefmain():client=Client(port=12345,protocol='grpc',asyncio=True)asyncfordocinclient.stream_doc(on='/stream',inputs=PromptDocument(prompt='what is the capital of France ?',max_tokens=10),return_type=ModelOutputDocument, ):print(doc.generated_text)
Jina-serve is backed byJina AI and licensed underApache-2.0.
About
☁️ Build multimodal AI applications with cloud-native stack