jina-ai/servePublic

NotificationsYou must be signed in to change notification settings
Fork2.2k
Star21.6k

☁️ Build multimodal AI applications with cloud-native stack

License

Apache-2.0 license

21.6k stars 2.2k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 8,644 Commits
.github		.github
Dockerfiles		Dockerfiles
conda		conda
docs		docs
jina		jina
jina_cli		jina_cli
scripts		scripts
tests		tests
.darglint		.darglint
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
extra-requirements.txt		extra-requirements.txt
fastentrypoints.py		fastentrypoints.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Jina-Serve

Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic.

Key Features

Native support for all major ML frameworks and data types
High-performance service design with scaling, streaming, and dynamic batching
LLM serving with streaming output
Built-in Docker integration and Executor Hub
One-click deployment to Jina AI Cloud
Enterprise-ready with Kubernetes and Docker Compose support

Comparison with FastAPI

Key advantages over FastAPI:

DocArray-based data handling with native gRPC support
Built-in containerization and service orchestration
Seamless scaling of microservices
One-command cloud deployment

Install

pip install jina

See guides forApple Silicon andWindows.

Core Concepts

Three main layers:

Data: BaseDoc and DocList for input/output
Serving: Executors process Documents, Gateway connects services
Orchestration: Deployments serve Executors, Flows create pipelines

Build AI Services

Let's create a gRPC-based AI service using StableLM:

fromjinaimportExecutor,requestsfromdocarrayimportDocList,BaseDocfromtransformersimportpipelineclassPrompt(BaseDoc):text:strclassGeneration(BaseDoc):prompt:strtext:strclassStableLM(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.generator=pipeline('text-generation',model='stabilityai/stablelm-base-alpha-3b'        )@requestsdefgenerate(self,docs:DocList[Prompt],**kwargs)->DocList[Generation]:generations=DocList[Generation]()prompts=docs.textllm_outputs=self.generator(prompts)forprompt,outputinzip(prompts,llm_outputs):generations.append(Generation(prompt=prompt,text=output))returngenerations

Deploy with Python or YAML:

fromjinaimportDeploymentfromexecutorimportStableLMdep=Deployment(uses=StableLM,timeout_ready=-1,port=12345)withdep:dep.block()

jtype:Deploymentwith:uses:StableLMpy_modules:   -executor.pytimeout_ready:-1port:12345

Use the client:

fromjinaimportClientfromdocarrayimportDocListfromexecutorimportPrompt,Generationprompt=Prompt(text='suggest an interesting image generation prompt')client=Client(port=12345)response=client.post('/',inputs=[prompt],return_type=DocList[Generation])

Build Pipelines

Chain services into a Flow:

fromjinaimportFlowflow=Flow(port=12345).add(uses=StableLM).add(uses=TextToImage)withflow:flow.block()

Scaling and Deployment

Local Scaling

Boost throughput with built-in features:

Replicas for parallel processing
Shards for data partitioning
Dynamic batching for efficient model inference

Example scaling a Stable Diffusion deployment:

jtype:Deploymentwith:uses:TextToImagetimeout_ready:-1py_modules:   -text_to_image.pyenv:CUDA_VISIBLE_DEVICES:RRreplicas:2uses_dynamic_batching:/default:preferred_batch_size:10timeout:200

Cloud Deployment

Containerize Services

Structure your Executor:

TextToImage/├── executor.py├── config.yml├── requirements.txt

Configure:

# config.ymljtype:TextToImagepy_modules: -executor.pymetas:name:TextToImagedescription:Text to Image generation Executor

Push to Hub:

jina hub push TextToImage

Deploy to Kubernetes

jinaexport kubernetes flow.yml ./my-k8skubectl apply -R -f my-k8s

Use Docker Compose

jinaexport docker-compose flow.yml docker-compose.ymldocker-compose up

JCloud Deployment

Deploy with a single command:

jina cloud deploy jcloud-flow.yml

LLM Streaming

Enable token-by-token streaming for responsive LLM applications:

Define schemas:

fromdocarrayimportBaseDocclassPromptDocument(BaseDoc):prompt:strmax_tokens:intclassModelOutputDocument(BaseDoc):token_id:intgenerated_text:str

Initialize service:

fromtransformersimportGPT2Tokenizer,GPT2LMHeadModelclassTokenStreamingExecutor(Executor):def__init__(self,**kwargs):super().__init__(**kwargs)self.model=GPT2LMHeadModel.from_pretrained('gpt2')

Implement streaming:

@requests(on='/stream')asyncdeftask(self,doc:PromptDocument,**kwargs)->ModelOutputDocument:input=tokenizer(doc.prompt,return_tensors='pt')input_len=input['input_ids'].shape[1]for_inrange(doc.max_tokens):output=self.model.generate(**input,max_new_tokens=1)ifoutput[0][-1]==tokenizer.eos_token_id:breakyieldModelOutputDocument(token_id=output[0][-1],generated_text=tokenizer.decode(output[0][input_len:],skip_special_tokens=True            ),        )input= {'input_ids':output,'attention_mask':torch.ones(1,len(output[0])),        }

Serve and use:

# ServerwithDeployment(uses=TokenStreamingExecutor,port=12345,protocol='grpc')asdep:dep.block()# Clientasyncdefmain():client=Client(port=12345,protocol='grpc',asyncio=True)asyncfordocinclient.stream_doc(on='/stream',inputs=PromptDocument(prompt='what is the capital of France ?',max_tokens=10),return_type=ModelOutputDocument,    ):print(doc.generated_text)