- Notifications
You must be signed in to change notification settings - Fork2.2k
☁️ Build multimodal AI applications with cloud-native stack
License
jina-ai/serve
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic.
- Native support for all major ML frameworks and data types
- High-performance service design with scaling, streaming, and dynamic batching
- LLM serving with streaming output
- Built-in Docker integration and Executor Hub
- One-click deployment to Jina AI Cloud
- Enterprise-ready with Kubernetes and Docker Compose support
Comparison with FastAPI
Key advantages over FastAPI:
- DocArray-based data handling with native gRPC support
- Built-in containerization and service orchestration
- Seamless scaling of microservices
- One-command cloud deployment
pip install jina
See guides forApple Silicon andWindows.
Three main layers:
- Data: BaseDoc and DocList for input/output
- Serving: Executors process Documents, Gateway connects services
- Orchestration: Deployments serve Executors, Flows create pipelines
Let's create a gRPC-based AI service using StableLM:
fromjinaimportExecutor,requestsfromdocarrayimportDocList,BaseDocfromtransformersimportpipelineclassPrompt(BaseDoc):text:strclassGeneration(BaseDoc):prompt:strtext:strclassStableLM(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.generator=pipeline('text-generation',model='stabilityai/stablelm-base-alpha-3b' )@requestsdefgenerate(self,docs:DocList[Prompt],**kwargs)->DocList[Generation]:generations=DocList[Generation]()prompts=docs.textllm_outputs=self.generator(prompts)forprompt,outputinzip(prompts,llm_outputs):generations.append(Generation(prompt=prompt,text=output))returngenerations
Deploy with Python or YAML:
fromjinaimportDeploymentfromexecutorimportStableLMdep=Deployment(uses=StableLM,timeout_ready=-1,port=12345)withdep:dep.block()
jtype:Deploymentwith:uses:StableLMpy_modules: -executor.pytimeout_ready:-1port:12345
Use the client:
fromjinaimportClientfromdocarrayimportDocListfromexecutorimportPrompt,Generationprompt=Prompt(text='suggest an interesting image generation prompt')client=Client(port=12345)response=client.post('/',inputs=[prompt],return_type=DocList[Generation])
Chain services into a Flow:
fromjinaimportFlowflow=Flow(port=12345).add(uses=StableLM).add(uses=TextToImage)withflow:flow.block()
Boost throughput with built-in features:
- Replicas for parallel processing
- Shards for data partitioning
- Dynamic batching for efficient model inference
Example scaling a Stable Diffusion deployment:
jtype:Deploymentwith:uses:TextToImagetimeout_ready:-1py_modules: -text_to_image.pyenv:CUDA_VISIBLE_DEVICES:RRreplicas:2uses_dynamic_batching:/default:preferred_batch_size:10timeout:200
- Structure your Executor:
TextToImage/├── executor.py├── config.yml├── requirements.txt
- Configure:
# config.ymljtype:TextToImagepy_modules: -executor.pymetas:name:TextToImagedescription:Text to Image generation Executor
- Push to Hub:
jina hub push TextToImage
jinaexport kubernetes flow.yml ./my-k8skubectl apply -R -f my-k8s
jinaexport docker-compose flow.yml docker-compose.ymldocker-compose up
Deploy with a single command:
jina cloud deploy jcloud-flow.yml
Enable token-by-token streaming for responsive LLM applications:
- Define schemas:
fromdocarrayimportBaseDocclassPromptDocument(BaseDoc):prompt:strmax_tokens:intclassModelOutputDocument(BaseDoc):token_id:intgenerated_text:str
- Initialize service:
fromtransformersimportGPT2Tokenizer,GPT2LMHeadModelclassTokenStreamingExecutor(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.model=GPT2LMHeadModel.from_pretrained('gpt2')
- Implement streaming:
@requests(on='/stream')asyncdeftask(self,doc:PromptDocument,**kwargs)->ModelOutputDocument:input=tokenizer(doc.prompt,return_tensors='pt')input_len=input['input_ids'].shape[1]for_inrange(doc.max_tokens):output=self.model.generate(**input,max_new_tokens=1)ifoutput[0][-1]==tokenizer.eos_token_id:breakyieldModelOutputDocument(token_id=output[0][-1],generated_text=tokenizer.decode(output[0][input_len:],skip_special_tokens=True ), )input= {'input_ids':output,'attention_mask':torch.ones(1,len(output[0])), }
- Serve and use:
# ServerwithDeployment(uses=TokenStreamingExecutor,port=12345,protocol='grpc')asdep:dep.block()# Clientasyncdefmain():client=Client(port=12345,protocol='grpc',asyncio=True)asyncfordocinclient.stream_doc(on='/stream',inputs=PromptDocument(prompt='what is the capital of France ?',max_tokens=10),return_type=ModelOutputDocument, ):print(doc.generated_text)
Jina-serve is backed byJina AI and licensed underApache-2.0.
About
☁️ Build multimodal AI applications with cloud-native stack
Topics
Resources
License
Code of conduct
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.