Merge branch 'mlc-ai:main' into main

JackWeiw · Jun 3, 2024 · e1826d6 · e1826d6
2 parents 6834d0a + 46ee63a
commit e1826d6
Show file tree

Hide file tree

Showing 327 changed files with 9,712 additions and 9,557 deletions.
diff --git a/3rdparty/tokenizers-cpp b/3rdparty/tokenizers-cpp
diff --git a/3rdparty/tvm b/3rdparty/tvm
diff --git a/README.md b/README.md
@@ -1,13 +1,23 @@
-[discord-url]: https://discord.gg/9Xpy2HGBuD
+<div align="center">
 
 # MLC LLM
 
-[Documentation](https://llm.mlc.ai/docs) | [Blog](https://blog.mlc.ai/) | [Discord][discord-url]
+[![Installation](https://img.shields.io/badge/docs-latest-green)](https://llm.mlc.ai/docs/)
+[![License](https://img.shields.io/badge/license-apache_2-blue)](https://github.com/mlc-ai/mlc-llm/blob/main/LICENSE)
+[![Join Discoard](https://img.shields.io/badge/Join-Discord-7289DA?logo=discord&logoColor=white)]("https://discord.gg/9Xpy2HGBuD")
+[![Related Repository: WebLLM](https://img.shields.io/badge/Related_Repo-WebLLM-fafbfc?logo=github)](https://github.com/mlc-ai/web-llm/)
 
-**M**achine **L**earning **C**ompilation for **L**arge **L**anguage **M**odels (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.
+**Universal LLM Deployment Engine with ML Compilation**
 
-**Universal deployment.** MLC LLM supports the following platforms and hardware:
+[Get Started](https://llm.mlc.ai/docs/get_started/quick_start) | [Documentation](https://llm.mlc.ai/docs) | [Blog](https://blog.mlc.ai/)
 
+</div>
+
+## About
+
+MLC LLM is a machine learning compiler and high-performance deployment engine for large language models.  The mission of this project is to enable everyone to develop, optimize, and deploy AI models natively on everyone's platforms. 
+
+<div align="center">
 <table style="width:100%">
   <thead>
     <tr>
@@ -48,125 +58,16 @@
     </tr>
   </tbody>
 </table>
+</div>
 
+MLC LLM compiles and runs code on MLCEngine -- a unified high-performance LLM inference engine across the above platforms. MLCEngine provides OpenAI-compatible API available through REST server, python, javascript, iOS, Android, all backed by the same engine and compiler that we keep improving with the community.
 
-## Quick Start
-
-We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM.
-We use 4-bit quantized 8B Llama-3 model for demonstration purpose.
-The pre-quantized Llama-3 weights is available at https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC.
-You can also try out unquantized Llama-3 model by replacing `q4f16_1` to `q0f16` in the examples below.
-Please visit our [documentation](https://llm.mlc.ai/docs/index.html) for detailed quick start and introduction.
-
-### Installation
-
-MLC LLM is available via [pip](https://llm.mlc.ai/docs/install/mlc_llm.html#install-mlc-packages).
-It is always recommended to install it in an isolated conda virtual environment.
-
-To verify the installation, activate your virtual environment, run
-
-```bash
-python -c "import mlc_llm; print(mlc_llm.__path__)"
-```
-
-You are expected to see the installation path of MLC LLM Python package.
-
-### Chat CLI
-
-We can try out the chat CLI in MLC LLM with 4-bit quantized 8B Llama-3 model.
-
-```bash
-mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
-```
-
-It may take 1-2 minutes for the first time running this command.
-After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.
-
-```
-You can use the following special commands:
-/help               print the special commands
-/exit               quit the cli
-/stats              print out the latest stats (token/sec)
-/reset              restart a fresh chat
-/set [overrides]    override settings in the generation config. For example,
-                      `/set temperature=0.5;max_gen_len=100;stop=end,stop`
-                      Note: Separate stop words in the `stop` option with commas (,).
-Multi-line input: Use escape+enter to start a new line.
-
-user: What's the meaning of life
-assistant:
-What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.
-
-The concept of the meaning of life has been debated and...
-```
+## Get Started
 
-### Python API
-
-We can run the Llama-3 model with the chat completion Python API of MLC LLM.
-You can save the code below into a Python file and run it.
-
-```python
-from mlc_llm import MLCEngine
-
-# Create engine
-model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
-engine = MLCEngine(model)
-
-# Run chat completion in OpenAI API.
-for response in engine.chat.completions.create(
-    messages=[{"role": "user", "content": "What is the meaning of life?"}],
-    model=model,
-    stream=True,
-):
-    for choice in response.choices:
-        print(choice.delta.content, end="", flush=True)
-print("\n")
-
-engine.terminate()
-```
-
-**The Python API of `mlc_llm.MLCEngine` fully aligns with OpenAI API**.
-You can use MLCEngine in the same way of using
-[OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
-for both synchronous and asynchronous generation.
-
-If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncMLCEngine` instead.
-
-### REST Server
-
-We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests.
-The server has fully OpenAI API completeness.
-
-```bash
-mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
-```
-
-The server is hooked at `http://127.0.0.1:8000` by default, and you can use `--host` and `--port`
-to set a different host and port.
-When the server is ready (showing `INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)`),
-we can open a new shell and send a cURL request via the following command:
-
-```bash
-curl -X POST \
-  -H "Content-Type: application/json" \
-  -d '{
-        "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
-        "messages": [
-            {"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
-        ]
-  }' \
-  http://127.0.0.1:8000/v1/chat/completions
-```
-
-## Universal Deployment APIs
-
-MLC LLM provides multiple sets of APIs across platforms and environments. These include
-* [Python API](https://llm.mlc.ai/docs/deploy/python_engine.html)
-* [OpenAI-compatible Rest-API](https://llm.mlc.ai/docs/deploy/rest.html)
-* [C++ API](https://llm.mlc.ai/docs/deploy/cli.html)
-* [JavaScript API](https://llm.mlc.ai/docs/deploy/javascript.html) and [Web LLM](https://github.com/mlc-ai/web-llm)
-* [Swift API for iOS App](https://llm.mlc.ai/docs/deploy/ios.html)
-* [Java API and Android App](https://llm.mlc.ai/docs/deploy/android.html)
+Please visit our [documentation](https://llm.mlc.ai/docs/) to get started with MLC LLM.
+- [Installation](https://llm.mlc.ai/docs/install/mlc_llm)
+- [Quick start](https://llm.mlc.ai/docs/get_started/quick_start)
+- [Introduction](https://llm.mlc.ai/docs/get_started/introduction)
 
 ## Citation
 
@@ -231,10 +132,4 @@ The underlying techniques of MLC LLM include:
   ```
 </details>
 
-## Links
-
-- You might want to check out our online public [Machine Learning Compilation course](https://mlc.ai) for a systematic
-walkthrough of our approaches.
-- [WebLLM](https://webllm.mlc.ai/) is a companion project using MLC LLM's WebGPU and WebAssembly backend.
-- [WebStableDiffusion](https://websd.mlc.ai/) is a companion project for diffusion models with the WebGPU backend.
 
diff --git a/android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt b/android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt
@@ -1,6 +1,7 @@
 package ai.mlc.mlcchat
 
-import ai.mlc.mlcllm.ChatModule
+import ai.mlc.mlcllm.MLCEngine
+import ai.mlc.mlcllm.OpenAIProtocol
 import android.app.Application
 import android.content.ClipData
 import android.content.ClipboardManager
@@ -21,6 +22,8 @@ import java.nio.channels.Channels
 import java.util.UUID
 import java.util.concurrent.Executors
 import kotlin.concurrent.thread
+import ai.mlc.mlcllm.OpenAIProtocol.ChatCompletionMessage
+import kotlinx.coroutines.*
 
 class AppViewModel(application: Application) : AndroidViewModel(application) {
     val modelList = emptyList<ModelState>().toMutableStateList()
@@ -502,14 +505,14 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
         private var modelChatState = mutableStateOf(ModelChatState.Ready)
             @Synchronized get
             @Synchronized set
-        private val backend = ChatModule()
+        private val engine = MLCEngine()
         private var modelLib = ""
         private var modelPath = ""
         private val executorService = Executors.newSingleThreadExecutor()
-
+        private val viewModelScope = CoroutineScope(Dispatchers.Main + Job())
         private fun mainResetChat() {
             executorService.submit {
-                callBackend { backend.resetChat() }
+                callBackend { engine.reset() }
                 viewModelScope.launch {
                     clearHistory()
                     switchToReady()
@@ -551,7 +554,7 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
                     val stackTrace = e.stackTraceToString()
                     val errorMessage = e.localizedMessage
                     appendMessage(
-                        MessageRole.Bot,
+                        MessageRole.Assistant,
                         "MLCChat failed\n\nStack trace:\n$stackTrace\n\nError message:\n$errorMessage"
                     )
                     switchToFailed()
@@ -604,7 +607,7 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
 
         private fun mainTerminateChat(callback: () -> Unit) {
             executorService.submit {
-                callBackend { backend.unload() }
+                callBackend { engine.unload() }
                 viewModelScope.launch {
                     clearHistory()
                     switchToReady()
@@ -644,11 +647,8 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
                     Toast.makeText(application, "Initialize...", Toast.LENGTH_SHORT).show()
                 }
                 if (!callBackend {
-                        backend.unload()
-                        backend.reload(
-                            modelConfig.modelLib,
-                            modelPath
-                        )
+                        engine.unload()
+                        engine.reload(modelPath, modelConfig.modelLib)
                     }) return@submit
                 viewModelScope.launch {
                     Toast.makeText(application, "Ready to chat", Toast.LENGTH_SHORT).show()
@@ -662,19 +662,31 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
             switchToGenerating()
             executorService.submit {
                 appendMessage(MessageRole.User, prompt)
-                appendMessage(MessageRole.Bot, "")
-                if (!callBackend { backend.prefill(prompt) }) return@submit
-                while (!backend.stopped()) {
-                    if (!callBackend {
-                            backend.decode()
-                            val newText = backend.message
-                            viewModelScope.launch { updateMessage(MessageRole.Bot, newText) }
-                        }) return@submit
-                    if (modelChatState.value != ModelChatState.Generating) return@submit
-                }
-                val runtimeStats = backend.runtimeStatsText()
+                appendMessage(MessageRole.Assistant, "")
                 viewModelScope.launch {
-                    report.value = runtimeStats
+                    val channel = engine.chat.completions.create(
+                        messages = listOf(
+                            ChatCompletionMessage(
+                                role = OpenAIProtocol.ChatCompletionRole.user,
+                                content = prompt
+                            )
+                        ),
+                        stream_options = OpenAIProtocol.StreamOptions(include_usage = true)
+                    )
+                    var texts = ""
+                    for (response in channel) {
+                        if (!callBackend {
+                            val finalsage = response.usage
+                            if (finalsage != null) {
+                                report.value = (finalsage.extra?.asTextLabel()?:"")
+                            } else {
+                                if (response.choices.size > 0) {
+                                    texts += response.choices[0].delta.content?.asText().orEmpty()
+                                }
+                            }
+                            updateMessage(MessageRole.Assistant, texts)
+                        });
+                    }
                     if (modelChatState.value == ModelChatState.Generating) switchToReady()
                 }
             }
@@ -722,7 +734,7 @@ enum class ModelChatState {
 }
 
 enum class MessageRole {
-    Bot,
+    Assistant,
     User
 }
 
@@ -757,4 +769,4 @@ data class ParamsRecord(
 
 data class ParamsConfig(
     @SerializedName("records") val paramsRecords: List<ParamsRecord>
-)
+)
diff --git a/android/MLCChat/app/src/main/java/ai/mlc/mlcchat/ChatView.kt b/android/MLCChat/app/src/main/java/ai/mlc/mlcchat/ChatView.kt
@@ -136,7 +136,7 @@ fun ChatView(
 @Composable
 fun MessageView(messageData: MessageData) {
     SelectionContainer {
-        if (messageData.role == MessageRole.Bot) {
+        if (messageData.role == MessageRole.Assistant) {
             Row(
                 horizontalArrangement = Arrangement.Start,
                 modifier = Modifier.fillMaxWidth()

diff --git a/android/MLCChat/mlc-package-config.json b/android/MLCChat/mlc-package-config.json
@@ -3,13 +3,13 @@
     "model_list": [
         {
             "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC",
-            "model_id": "gemma-2b-q4f16_1",
+            "model_id": "gemma-2b-q4f16_1-MLC",
             "estimated_vram_bytes": 3000000000
         },
         {
             "model": "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC",
             "estimated_vram_bytes": 4348727787,
-            "model_id": "Llama-2-7b-chat-hf-q4f16_1",
+            "model_id": "Llama-2-7b-chat-hf-q4f16_1-MLC",
             "overrides": {
                 "context_window_size": 768,
                 "prefill_chunk_size": 256
@@ -18,12 +18,12 @@
         {
             "model": "HF://mlc-ai/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
             "estimated_vram_bytes": 1948348579,
-            "model_id": "RedPajama-INCITE-Chat-3B-v1-q4f16_1"
+            "model_id": "RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC"
         },
         {
             "model": "HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC",
             "estimated_vram_bytes": 4275453296,
-            "model_id": "Mistral-7B-Instruct-v0.2-q4f16_1",
+            "model_id": "Mistral-7B-Instruct-v0.2-q4f16_1-MLC",
             "overrides": {
                 "sliding_window_size": 768,
                 "prefill_chunk_size": 256
@@ -32,7 +32,7 @@
         {
             "model": "HF://mlc-ai/phi-2-q4f16_1-MLC",
             "estimated_vram_bytes": 2036816936,
-            "model_id": "phi-2-q4f16_1"
+            "model_id": "phi-2-q4f16_1-MLC"
         }
     ]
 }