Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

High-performance In-browser LLM Inference Engine

License

NotificationsYou must be signed in to change notification settings

mlc-ai/web-llm

Repository files navigation

NPM Package"WebLLM Chat Deployed"Join DiscordRelated Repository: WebLLM ChatRelated Repository: MLC LLM

High-Performance In-Browser LLM Inference Engine.

Documentation |Blogpost |Paper |Examples

Overview

WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration.Everything runs inside the browser with no server support and is accelerated with WebGPU.

WebLLM isfully compatible withOpenAI API.That is, you can use the same OpenAI API onany open source models locally, with functionalitiesincluding streaming, JSON-mode, function-calling (WIP), etc.

We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

You can use WebLLM as a basenpm package and build your own web application on top of it by following the examples below. This project is a companion project ofMLC LLM, which enables universal deployment of LLM across hardware environments.

Key Features

  • In-Browser Inference: WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.

  • Full OpenAI API Compatibility: Seamlessly integrate your app with WebLLM using OpenAI API with functionalities such as streaming, JSON-mode, logit-level control, seeding, and more.

  • Structured JSON Generation: WebLLM supports state-of-the-art JSON mode structured generation, implemented in the WebAssembly portion of the model library for optimal performance. CheckWebLLM JSON Playground on HuggingFace to try generating JSON output with custom JSON schema.

  • Extensive Model Support: WebLLM natively supports a range of models including Llama 3, Phi 3, Gemma, Mistral, Qwen(通义千问), and many others, making it versatile for various AI tasks. For the complete supported model list, checkMLC Models.

  • Custom Model Integration: Easily integrate and deploy custom models in MLC format, allowing you to adapt WebLLM to specific needs and scenarios, enhancing flexibility in model deployment.

  • Plug-and-Play Integration: Easily integrate WebLLM into your projects using package managers like NPM and Yarn, or directly via CDN, complete with comprehensiveexamples and a modular design for connecting with UI components.

  • Streaming & Real-Time Interactions: Supports streaming chat completions, allowing real-time output generation which enhances interactive applications like chatbots and virtual assistants.

  • Web Worker & Service Worker Support: Optimize UI performance and manage the lifecycle of models efficiently by offloading computations to separate worker threads or service workers.

  • Chrome Extension Support: Extend the functionality of web browsers through custom Chrome extensions using WebLLM, with examples available for building both basic and advanced extensions.

Built-in Models

Check the complete list of available models onMLC Models. WebLLM supports a subset of these available models and the list can be accessed atprebuiltAppConfig.model_list.

Here are the primary families of models currently supported:

  • Llama: Llama 3, Llama 2, Hermes-2-Pro-Llama-3
  • Phi: Phi 3, Phi 2, Phi 1.5
  • Gemma: Gemma-2B
  • Mistral: Mistral-7B-v0.3, Hermes-2-Pro-Mistral-7B, NeuralHermes-2.5-Mistral-7B, OpenHermes-2.5-Mistral-7B
  • Qwen (通义千问): Qwen2 0.5B, 1.5B, 7B

If you need more models,request a new model via opening an issue or checkCustom Models for how to compile and use your own models with WebLLM.

Jumpstart with Examples

Learn how to use WebLLM to integrate large language models into your application and generate chat completions through this simple Chatbot example:

Example Chatbot on JSFiddleExample Chatbot on Codepen

For an advanced example of a larger, more complicated project, checkWebLLM Chat.

More examples for different use cases are available in theexamples folder.

Get Started

WebLLM offers a minimalist and modular interface to access the chatbot in the browser.The package is designed in a modular way to hook to any of the UI components.

Installation

Package Manager

# npmnpm install @mlc-ai/web-llm# yarnyarn add @mlc-ai/web-llm# or pnpmpnpm install @mlc-ai/web-llm

Then import the module in your code.

// Import everythingimport*aswebllmfrom"@mlc-ai/web-llm";// Or only import what you needimport{CreateMLCEngine}from"@mlc-ai/web-llm";

CDN Delivery

Thanks tojsdelivr.com, WebLLM can be imported directly through URL and work out-of-the-box on cloud development platforms likejsfiddle.net,Codepen.io, andScribbler:

import*aswebllmfrom"https://esm.run/@mlc-ai/web-llm";

It can also be dynamically imported as:

constwebllm=awaitimport("https://esm.run/@mlc-ai/web-llm");

Create MLCEngine

Most operations in WebLLM are invoked through theMLCEngine interface. You can create anMLCEngine instance and loading the model by calling theCreateMLCEngine() factory function.

(Note that loading models requires downloading and it can take a significant amount of time for the very first run without caching previously. You should properly handle this asynchronous call.)

import{CreateMLCEngine}from"@mlc-ai/web-llm";// Callback function to update model loading progressconstinitProgressCallback=(initProgress)=>{console.log(initProgress);}constselectedModel="Llama-3.1-8B-Instruct-q4f32_1-MLC";constengine=awaitCreateMLCEngine(selectedModel,{initProgressCallback:initProgressCallback},// engineConfig);

Under the hood, this factory function does the following steps for first creating an engine instance (synchronous) and then loading the model (asynchronous). You can also do them separately in your application.

import{MLCEngine}from"@mlc-ai/web-llm";// This is a synchronous call that returns immediatelyconstengine=newMLCEngine({initProgressCallback:initProgressCallback});// This is an asynchronous call and can take a long time to finishawaitengine.reload(selectedModel);

Chat Completion

After successfully initializing the engine, you can now invoke chat completions using OpenAI style chat APIs through theengine.chat.completions interface. For the full list of parameters and their descriptions, checksection below andOpenAI API reference.

(Note: Themodel parameter is not supported and will be ignored here. Instead, callCreateMLCEngine(model) orengine.reload(model) instead as shown in theCreate MLCEngine above.)

constmessages=[{role:"system",content:"You are a helpful AI assistant."},{role:"user",content:"Hello!"},]constreply=awaitengine.chat.completions.create({  messages,});console.log(reply.choices[0].message);console.log(reply.usage);

Streaming

WebLLM also supports streaming chat completion generating. To use it, simply passstream: true to theengine.chat.completions.create call.

constmessages=[{role:"system",content:"You are a helpful AI assistant."},{role:"user",content:"Hello!"},]// Chunks is an AsyncGenerator objectconstchunks=awaitengine.chat.completions.create({  messages,temperature:1,stream:true,// <-- Enable streamingstream_options:{include_usage:true},});letreply="";forawait(constchunkofchunks){reply+=chunk.choices[0]?.delta.content||"";console.log(reply);if(chunk.usage){console.log(chunk.usage);// only last chunk has usage}}constfullReply=awaitengine.getMessage();console.log(fullReply);

Advanced Usage

Using Workers

You can put the heavy computation in a worker script to optimize your application performance. To do so, you need to:

  1. Create a handler in the worker thread that communicates with the frontend while handling the requests.
  2. Create a Worker Engine in your main application, which under the hood sends messages to the handler in the worker thread.

For detailed implementations of different kinds of Workers, check the following sections.

Dedicated Web Worker

WebLLM comes with API support for WebWorker so you can hookthe generation process into a separate worker thread so thatthe computing in the worker thread won't disrupt the UI.

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// worker.tsimport{WebWorkerMLCEngineHandler}from"@mlc-ai/web-llm";// A handler that resides in the worker threadconsthandler=newWebWorkerMLCEngineHandler();self.onmessage=(msg:MessageEvent)=>{handler.onmessage(msg);};

In the main logic, we create aWebWorkerMLCEngine thatimplements the sameMLCEngineInterface. The rest of the logic remains the same.

// main.tsimport{CreateWebWorkerMLCEngine}from"@mlc-ai/web-llm";asyncfunctionmain(){// Use a WebWorkerMLCEngine instead of MLCEngine hereconstengine=awaitCreateWebWorkerMLCEngine(newWorker(newURL("./worker.ts",import.meta.url),{type:"module",}),selectedModel,{ initProgressCallback},// engineConfig);// everything else remains the same}

Use Service Worker

WebLLM comes with API support for ServiceWorker so you can hook the generation processinto a service worker to avoid reloading the model in every page visit and optimizeyour application's offline experience.

(Note, Service Worker's life cycle is managed by the browser and can be killed any time without notifying the webapp.ServiceWorkerMLCEngine will try to keep the service worker thread alive by periodically sending heartbeat events, but your application should also include proper error handling. CheckkeepAliveMs andmissedHeatbeat inServiceWorkerMLCEngine for more details.)

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// sw.tsimport{ServiceWorkerMLCEngineHandler}from"@mlc-ai/web-llm";lethandler:ServiceWorkerMLCEngineHandler;self.addEventListener("activate",function(event){handler=newServiceWorkerMLCEngineHandler();console.log("Service Worker is ready");});

Then in the main logic, we register the service worker and create the engine usingCreateServiceWorkerMLCEngine function. The rest of the logic remains the same.

// main.tsimport{MLCEngineInterface,CreateServiceWorkerMLCEngine}from"@mlc-ai/web-llm";if("serviceWorker"innavigator){navigator.serviceWorker.register(newURL("sw.ts",import.meta.url),// worker script{type:"module"},);}constengine:MLCEngineInterface=awaitCreateServiceWorkerMLCEngine(selectedModel,{ initProgressCallback},// engineConfig);

You can find a complete example on how to run WebLLM in service worker inexamples/service-worker.

Chrome Extension

You can also find examples of building Chrome extension with WebLLM inexamples/chrome-extension andexamples/chrome-extension-webgpu-service-worker. The latter one leverages service worker, so the extension is persistent in the background. Additionally, you can explore another full project of a Chrome extension, WebLLM Assistant, which leverages WebLLMhere.

Full OpenAI Compatibility

WebLLM is designed to be fully compatible withOpenAI API. Thus, besides building a simple chatbot, you can also have the following functionalities with WebLLM:

  • streaming: return output as chunks in real-time in the form of an AsyncGenerator
  • json-mode: efficiently ensure output is in JSON format, seeOpenAI Reference for more.
  • seed-to-reproduce: use seeding to ensure a reproducible output with fieldsseed.
  • function-calling (WIP): function calling with fieldstools andtool_choice (with preliminary support); or manual function calling withouttools ortool_choice (keeps the most flexibility).

Custom Models

WebLLM works as a companion project ofMLC LLM and it supports custom models in MLC format.It reuses the model artifact and builds the flow of MLC LLM. To compile and use your own models with WebLLM, please check outMLC LLM documenton how to compile and deploy new model weights and libraries to WebLLM.

Here, we go over the high-level idea. There are two elements of the WebLLM package that enable new models and weight variants.

  • model: Contains a URL to model artifacts, such as weights and meta-data.
  • model_lib: A URL to the web assembly library (i.e. wasm file) that contains the executables to accelerate the model computations.

Both are customizable in the WebLLM.

import{CreateMLCEngine}from"@mlc-ai/web-llm";asyncmain(){constappConfig={"model_list":[{"model":"/url/to/my/llama","model_id":"MyLlama-3b-v1-q4f32_0","model_lib":"/url/to/myllama3b.wasm",}],};// override defaultconstchatOpts={"repetition_penalty":1.01};// load a prebuilt model// with a chat option override and app config// under the hood, it will load the model from myLlamaUrl// and cache it in the browser cache// The chat will also load the model library from "/url/to/myllama3b.wasm",// assuming that it is compatible to the model in myLlamaUrl.constengine=awaitCreateMLCEngine("MyLlama-3b-v1-q4f32_0",{ appConfig},// engineConfigchatOpts,);}

In many cases, we only want to supply the model weight variant, butnot necessarily a new model (e.g.NeuralHermes-Mistral can reuseMistral'smodel library). For examples of how a model library can be shared by different model variants,seewebllm.prebuiltAppConfig.

Build WebLLM Package From Source

NOTE: you don't need to build from source unless you would like to modify the WebLLM package.To use the npm, simply followGet Started or any of theexamples instead.

To build from source, simply run:

npm installnpm run build

Then, to test the effects of your code change in an example, insideexamples/get-started/package.json, change from"@mlc-ai/web-llm": "^0.2.78" to"@mlc-ai/web-llm": ../...

Then run:

cd examples/get-startednpm installnpm start

Note that sometimes you would need to switch betweenfile:../.. and../.. to trigger npm to recognize new changes. In the worst case, you can run:

cd examples/get-startedrm -rf node_modules dist package-lock.json .parcel-cachenpm installnpm start

In case you need to build TVMjs from source

WebLLM's runtime largely depends on TVMjs:https://github.com/apache/tvm/tree/main/web

While it is also available as an npm package:https://www.npmjs.com/package/@mlc-ai/web-runtime, you can build it from source if needed by following the steps below.

  1. Installemscripten. It is an LLVM-based compiler that compiles C/C++ source code to WebAssembly.

    • Follow theinstallation instruction to install the latest emsdk.
    • Sourceemsdk_env.sh bysource path/to/emsdk_env.sh, so thatemcc is reachable from PATH and the commandemcc works.

    We can verify the successful installation by trying outemcc terminal.

    Note: We recently found that using the latestemcc version may run into issues during runtime. Use./emsdk install 3.1.56 instead of./emsdk install latest for now as a workaround. The error may look like

    Init error, LinkError: WebAssembly.instantiate(): Import #6 module="wasi_snapshot_preview1"function="proc_exit": function import requires a callable
  2. In./package.json, change from"@mlc-ai/web-runtime": "0.18.0-dev2", to"@mlc-ai/web-runtime": "file:./tvm_home/web",.

  3. Setup necessary environment

    Prepare all the necessary dependencies for web build:

    ./scripts/prep_deps.sh

    In this step, if$TVM_SOURCE_DIR is not defined in the environment, we will execute the following line to buildtvmjs dependency:

    git clone https://github.com/mlc-ai/relax 3rdparty/tvm-unity --recursive

    This clones the current HEAD ofmlc-ai/relax. However, it may not always be the correct branch or commit to clone. To build a specific npm version from source, refer to the version bump PR, which states which branch (i.e.mlc-ai/relax orapache/tvm) and which commit the current WebLLM version depends on. For instance, version 0.2.52, according to its version bump PR#521, is built by checking out the following commithttps://github.com/apache/tvm/commit/e6476847753c80e054719ac47bc2091c888418b6 inapache/tvm, rather than the HEAD ofmlc-ai/relax.

    Besides,--recursive is necessary and important. Otherwise, you may encounter errors likefatal error: 'dlpack/dlpack.h' file not found.

  4. Build WebLLM Package

    npm run build
  5. Validate some of the sub-packages

    You can then go to the subfolders inexamples to validate some of the sub-packages.We use Parcelv2 for bundling. Although Parcel is not very good at tracking parent directorychanges sometimes. When you make a change in the WebLLM package, try to edit thepackage.jsonof the subfolder and save it, which will trigger Parcel to rebuild.

Links

Acknowledgement

This project is initiated by members from CMU Catalyst, UW SAMPL, SJTU, OctoML, and the MLC community. We would love to continue developing and supporting the open-source ML community.

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities make these models accessible. We would like to thank the teams behind Vicuna, SentencePiece, LLaMA, and Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.

Citation

If you find this project to be useful, please cite:

@misc{ruan2024webllmhighperformanceinbrowserllm,      title={WebLLM: A High-Performance In-Browser LLM Inference Engine},       author={Charlie F. Ruan and Yucheng Qin and Xun Zhou and Ruihang Lai and Hongyi Jin and Yixin Dong and Bohan Hou and Meng-Shiun Yu and Yiyan Zhai and Sudeep Agarwal and Hangrui Cao and Siyuan Feng and Tianqi Chen},      year={2024},      eprint={2412.15803},      archivePrefix={arXiv},      primaryClass={cs.LG},      url={https://arxiv.org/abs/2412.15803}, }

Contributors

contributors

⬆ Back to Top ⬆


[8]ページ先頭

©2009-2025 Movatter.jp