Get stuck at the "starting ..." stage when try to run Tabby on Windows machine #3194

ANYMS-A · 2024-09-25T04:17:34Z

ANYMS-A
Sep 25, 2024

Hi, followed the tutorial, I tried to start the Tabby server on my CPU-only windows machine and the starting process stucked with the following cmd line output:

2024-09-24T09:14:25.573705Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-09-24T09:14:25.573875Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-09-24T09:14:25.573987Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\h30058272\.tabby\models\TabbyML\Nomic-Embed-Text\ggml/model.gguf (version GGUF V3 (latest))
2024-09-24T09:14:25.574124Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-24T09:14:25.574251Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-09-24T09:14:25.574369Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-09-24T09:14:25.574446Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-09-24T09:14:25.574494Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-09-24T09:14:25.574540Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-09-24T09:14:25.574587Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-09-24T09:14:25.574633Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-09-24T09:14:25.574680Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-09-24T09:14:25.574775Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-09-24T09:14:25.574839Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-09-24T09:14:25.574890Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-09-24T09:14:25.574939Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-09-24T09:14:25.575015Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-09-24T09:14:25.575063Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-09-24T09:14:25.575109Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-09-24T09:14:25.575156Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-09-24T09:14:25.575205Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-09-24T09:14:25.575251Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-09-24T09:14:25.575297Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-09-24T09:14:25.575343Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-09-24T09:14:25.575389Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-09-24T09:14:25.575435Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-09-24T09:14:25.575480Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-09-24T09:14:25.575526Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-09-24T09:14:25.575572Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-09-24T09:14:25.575617Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-09-24T09:14:25.575663Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-09-24T09:14:25.575710Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-09-24T09:14:25.575758Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-09-24T09:14:25.575809Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-09-24T09:14:25.575881Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-09-24T09:14:25.575938Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_merges         = 0
2024-09-24T09:14:25.575983Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-09-24T09:14:25.576028Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-09-24T09:14:25.576073Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd           = 768
2024-09-24T09:14:25.576124Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_layer          = 12
2024-09-24T09:14:25.576169Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head           = 12
2024-09-24T09:14:25.576214Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-09-24T09:14:25.576259Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_rot            = 64
2024-09-24T09:14:25.576303Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_swa            = 0
2024-09-24T09:14:25.576398Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-09-24T09:14:25.576469Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-09-24T09:14:25.576512Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-09-24T09:14:25.576553Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-09-24T09:14:25.576594Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-09-24T09:14:25.576635Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-09-24T09:14:25.576675Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-09-24T09:14:25.576716Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-09-24T09:14:25.576757Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-24T09:14:25.576865Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-09-24T09:14:25.576920Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-09-24T09:14:25.576982Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert         = 0
2024-09-24T09:14:25.577032Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-09-24T09:14:25.577079Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: causal attn      = 0
2024-09-24T09:14:25.577123Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: pooling type     = 1
2024-09-24T09:14:25.577167Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope type        = 2
2024-09-24T09:14:25.577211Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-09-24T09:14:25.577255Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-09-24T09:14:25.577299Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-09-24T09:14:25.577386Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-09-24T09:14:25.577431Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-09-24T09:14:25.577475Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-09-24T09:14:25.577519Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-09-24T09:14:25.577564Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-09-24T09:14:25.577609Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-09-24T09:14:25.577653Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model type       = 137M
2024-09-24T09:14:25.577696Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-09-24T09:14:25.577740Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-09-24T09:14:25.577784Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-09-24T09:14:25.577829Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-09-24T09:14:25.577873Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-09-24T09:14:25.577917Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-09-24T09:14:25.577961Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-09-24T09:14:25.578005Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-09-24T09:14:25.578049Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-09-24T09:14:25.578092Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-09-24T09:14:25.578136Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-09-24T09:14:25.578180Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-09-24T09:14:25.578224Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: max token length = 21
2024-09-24T09:14:25.578269Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: ggml ctx size =    0.05 MiB
2024-09-24T09:14:25.578332Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors:        CPU buffer size =   138.65 MiB
2024-09-24T09:14:25.578377Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: .......................................................
2024-09-24T09:14:25.578421Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ctx      = 4096
2024-09-24T09:14:25.578465Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_batch    = 2048
2024-09-24T09:14:25.578509Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ubatch   = 2048
2024-09-24T09:14:25.578553Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: flash_attn = 0
2024-09-24T09:14:25.578596Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_base  = 1000.0
2024-09-24T09:14:25.578640Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_scale = 1
2024-09-24T09:14:25.578684Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_kv_cache_init:        CPU KV buffer size =   144.00 MiB
2024-09-24T09:14:25.578728Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
2024-09-24T09:14:25.578772Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
2024-09-24T09:14:25.578816Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 260.02 MiB
2024-09-24T09:14:25.578908Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU compute buffer size =   260.02 MiB
2024-09-24T09:14:25.578956Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph nodes  = 453
2024-09-24T09:14:25.579002Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph splits = 1
⠋     4.838 s   Starting...2024-09-24T09:14:27.074277Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:634: connected to 127.0.0.1:30888
⠋     7.255 s   Starting...2024-09-24T09:14:29.491996Z DEBUG reqwest::connect: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\reqwest-0.12.4\src\connect.rs:497: starting new connection: http://127.0.0.1:30888/
2024-09-24T09:14:29.492134Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:631: connecting to 127.0.0.1:30888
⠙     7.335 s   Starting...2024-09-24T09:14:29.522809Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:98: llama-server <embedding> exited with status code -1073741819, args: `Command { std: "C:\\Users\\h30058272\\Downloads\\dist\\tabby_x86_64-windows-msvc\\llama-server.exe" "-m" "C:\\Users\\h30058272\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml/model.gguf" "--cont-batching" "--port" "30888" "-np" "1" "--log-disable" "--ctx-size" "4096" "-ngl" "9999" "--embedding" "--ubatch-size" "4096", kill_on_drop: true }`

I found similar discussions : #2936 and enable the DEBUG logging, but seems the output from my machine is kind of different from his,

Answered by ANYMS-A

Sep 27, 2024

Finally, after a few hecking, find a solution for this issue:

Download a right version of llama-cpp release. For my machine, I choose the "llama-bxxxx-bin-win-avx-x64" version.
Unzip the pre-compiled llama-cpp you downloaded, copy the: ggml.dll, llama.dll, llama-server.exe into the "xxxx\dist\tabby_x86_64-windows-msvc" directory.
Run .\tabby.exe serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct

View full answer

wsxiaoys · 2024-09-25T04:27:20Z

wsxiaoys
Sep 25, 2024
Maintainer

Hi - this is an known issue fixed in #3152 (original issue: #3150).

It'll be part of 0.18 release - you might give rc3 a try to see if it fix the issue for you: https://github.com/TabbyML/tabby/releases/tag/v0.18.0-rc.3

2 replies

ANYMS-A Sep 25, 2024
Author

Thanks for the suggestion, I tried with the given version but still get the same output

2024-09-25T06:29:58.933099Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-09-25T06:29:58.933154Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-09-25T06:29:58.933202Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\h30058272\.tabby\models\TabbyML\Nomic-Embed-Text\ggml\model.gguf (version GGUF V3 (latest))
2024-09-25T06:29:58.933251Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-25T06:29:58.933293Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-09-25T06:29:58.933334Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-09-25T06:29:58.933375Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-09-25T06:29:58.933426Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-09-25T06:29:58.933483Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-09-25T06:29:58.933528Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-09-25T06:29:58.933572Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-09-25T06:29:58.933615Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-09-25T06:29:58.933663Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-09-25T06:29:58.933709Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-09-25T06:29:58.933752Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-09-25T06:29:58.933799Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-09-25T06:29:58.933843Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-09-25T06:29:58.933886Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-09-25T06:29:58.933928Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-09-25T06:29:58.933971Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-09-25T06:29:58.934013Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-09-25T06:29:58.934065Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-09-25T06:29:58.934114Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-09-25T06:29:58.934162Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-09-25T06:29:58.934204Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-09-25T06:29:58.934246Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-09-25T06:29:58.934288Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-09-25T06:29:58.934331Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-09-25T06:29:58.934373Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-09-25T06:29:58.934415Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-09-25T06:29:58.934456Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-09-25T06:29:58.934498Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-09-25T06:29:58.934540Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-09-25T06:29:58.934582Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-09-25T06:29:58.934624Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-09-25T06:29:58.934672Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_merges         = 0
2024-09-25T06:29:58.934717Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-09-25T06:29:58.934760Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-09-25T06:29:58.934802Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd           = 768
2024-09-25T06:29:58.934844Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_layer          = 12
2024-09-25T06:29:58.934886Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head           = 12
2024-09-25T06:29:58.934928Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-09-25T06:29:58.934969Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_rot            = 64
2024-09-25T06:29:58.935011Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_swa            = 0
2024-09-25T06:29:58.935054Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-09-25T06:29:58.935097Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-09-25T06:29:58.935139Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-09-25T06:29:58.935181Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-09-25T06:29:58.935223Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-09-25T06:29:58.935265Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-09-25T06:29:58.935313Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-09-25T06:29:58.935355Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-09-25T06:29:58.935397Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-25T06:29:58.935439Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-09-25T06:29:58.935481Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-09-25T06:29:58.935523Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert         = 0
2024-09-25T06:29:58.935565Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-09-25T06:29:58.935607Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: causal attn      = 0
2024-09-25T06:29:58.935650Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: pooling type     = 1
2024-09-25T06:29:58.935692Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope type        = 2
2024-09-25T06:29:58.935734Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-09-25T06:29:58.935777Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-09-25T06:29:58.935818Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-09-25T06:29:58.935860Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-09-25T06:29:58.935902Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-09-25T06:29:58.935944Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-09-25T06:29:58.935986Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-09-25T06:29:58.936028Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-09-25T06:29:58.936069Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-09-25T06:29:58.936111Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model type       = 137M
2024-09-25T06:29:58.936153Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-09-25T06:29:58.936209Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-09-25T06:29:58.936251Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-09-25T06:29:58.936293Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-09-25T06:29:58.936335Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-09-25T06:29:58.936421Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-09-25T06:29:58.936478Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-09-25T06:29:58.936523Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-09-25T06:29:58.936566Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-09-25T06:29:58.936609Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-09-25T06:29:58.936651Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-09-25T06:29:58.936694Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-09-25T06:29:58.936736Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: max token length = 21
2024-09-25T06:29:58.936776Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: ggml ctx size =    0.05 MiB
2024-09-25T06:29:58.936814Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors:        CPU buffer size =   138.65 MiB
2024-09-25T06:29:58.936868Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: .......................................................
2024-09-25T06:29:58.936910Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ctx      = 4096
2024-09-25T06:29:58.936958Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_batch    = 2048
2024-09-25T06:29:58.937008Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ubatch   = 2048
2024-09-25T06:29:58.937038Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: flash_attn = 0
2024-09-25T06:29:58.937089Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_base  = 1000.0
2024-09-25T06:29:58.937133Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_scale = 1
2024-09-25T06:29:58.937173Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_kv_cache_init:        CPU KV buffer size =   144.00 MiB
2024-09-25T06:29:58.937214Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
2024-09-25T06:29:58.937254Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
2024-09-25T06:29:58.937295Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 260.02 MiB
2024-09-25T06:29:58.937335Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU compute buffer size =   260.02 MiB
2024-09-25T06:29:58.937389Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph nodes  = 453
2024-09-25T06:29:58.937460Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph splits = 1
⠹    20.304 s   Starting...2024-09-25T06:30:00.431744Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:634: connected to 127.0.0.1:30888
⠋    22.562 s   Starting...2024-09-25T06:30:02.688625Z DEBUG reqwest::connect: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\reqwest-0.12.4\src\connect.rs:497: starting new connection: http://127.0.0.1:30888/
2024-09-25T06:30:02.688724Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:631: connecting to 127.0.0.1:30888
2024-09-25T06:30:02.723190Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:98: llama-server <embedding> exited with status code -1073741819, args: `Command { std: "C:\\Users\\h30058272\\Downloads\\dist\\tabby_x86_64-windows-msvc\\llama-server.exe" "-m" "C:\\Users\\h30058272\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf" "--cont-batching" "--port" "30888" "-np" "1" "--log-disable" "--ctx-size" "4096" "-ngl" "9999" "--embedding" "--ubatch-size" "4096", kill_on_drop: true }`
2024-09-25T06:30:02.723557Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-09-25T06:30:02.723657Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-09-25T06:30:02.723785Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\h30058272\.tabby\models\TabbyML\Nomic-Embed-Text\ggml\model.gguf (version GGUF V3 (latest))
2024-09-25T06:30:02.723837Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-25T06:30:02.723879Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-09-25T06:30:02.723919Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-09-25T06:30:02.723960Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-09-25T06:30:02.724001Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-09-25T06:30:02.724041Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-09-25T06:30:02.724081Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-09-25T06:30:02.724214Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-09-25T06:30:02.724254Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-09-25T06:30:02.724295Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-09-25T06:30:02.724336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-09-25T06:30:02.724379Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-09-25T06:30:02.724419Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-09-25T06:30:02.724462Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-09-25T06:30:02.724504Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-09-25T06:30:02.724546Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-09-25T06:30:02.724589Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-09-25T06:30:02.724631Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-09-25T06:30:02.724679Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-09-25T06:30:02.724726Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-09-25T06:30:02.724774Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-09-25T06:30:02.724816Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-09-25T06:30:02.724858Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-09-25T06:30:02.724900Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-09-25T06:30:02.724941Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-09-25T06:30:02.724982Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-09-25T06:30:02.725024Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-09-25T06:30:02.725066Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-09-25T06:30:02.725108Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-09-25T06:30:02.725161Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-09-25T06:30:02.725204Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-09-25T06:30:02.725246Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-09-25T06:30:02.725288Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_merges         = 0
2024-09-25T06:30:02.725330Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-09-25T06:30:02.725371Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-09-25T06:30:02.725413Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd           = 768
2024-09-25T06:30:02.725455Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_layer          = 12
2024-09-25T06:30:02.725497Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head           = 12
2024-09-25T06:30:02.725539Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-09-25T06:30:02.725588Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_rot            = 64
2024-09-25T06:30:02.725630Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_swa            = 0
2024-09-25T06:30:02.725672Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-09-25T06:30:02.725714Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-09-25T06:30:02.725756Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-09-25T06:30:02.725798Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-09-25T06:30:02.725840Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-09-25T06:30:02.725883Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-09-25T06:30:02.725925Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-09-25T06:30:02.725966Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-09-25T06:30:02.726007Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-25T06:30:02.726049Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-09-25T06:30:02.726091Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-09-25T06:30:02.726132Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert         = 0
2024-09-25T06:30:02.726186Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-09-25T06:30:02.726228Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: causal attn      = 0
2024-09-25T06:30:02.726270Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: pooling type     = 1
2024-09-25T06:30:02.726311Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope type        = 2
2024-09-25T06:30:02.726353Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-09-25T06:30:02.726395Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-09-25T06:30:02.726436Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-09-25T06:30:02.726478Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-09-25T06:30:02.726519Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-09-25T06:30:02.726561Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-09-25T06:30:02.726603Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-09-25T06:30:02.726645Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-09-25T06:30:02.726690Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-09-25T06:30:02.726732Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model type       = 137M
2024-09-25T06:30:02.726774Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-09-25T06:30:02.726815Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-09-25T06:30:02.726857Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-09-25T06:30:02.726899Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-09-25T06:30:02.726942Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-09-25T06:30:02.726986Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-09-25T06:30:02.727029Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-09-25T06:30:02.727070Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-09-25T06:30:02.727112Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-09-25T06:30:02.727154Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-09-25T06:30:02.727196Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-09-25T06:30:02.727237Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-09-25T06:30:02.727277Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: max token length = 21
2024-09-25T06:30:02.727319Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: ggml ctx size =    0.05 MiB
2024-09-25T06:30:02.727362Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors:        CPU buffer size =   138.65 MiB
2024-09-25T06:30:02.727404Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: .......................................................
2024-09-25T06:30:02.727445Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ctx      = 4096
2024-09-25T06:30:02.727487Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_batch    = 2048
2024-09-25T06:30:02.727529Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ubatch   = 2048
2024-09-25T06:30:02.727570Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: flash_attn = 0
2024-09-25T06:30:02.727612Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_base  = 1000.0
2024-09-25T06:30:02.727654Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_scale = 1
2024-09-25T06:30:02.727695Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_kv_cache_init:        CPU KV buffer size =   144.00 MiB
2024-09-25T06:30:02.727737Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
2024-09-25T06:30:02.727779Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
2024-09-25T06:30:02.727821Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 260.02 MiB
2024-09-25T06:30:02.727863Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model:        CPU compute buffer size =   260.02 MiB
2024-09-25T06:30:02.727905Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph nodes  = 453
2024-09-25T06:30:02.727946Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph splits = 1
⠏    24.093 s   Starting...

ANYMS-A Sep 25, 2024
Author

It outputs the same contents shown as above repeatly

ANYMS-A · 2024-09-27T06:26:05Z

ANYMS-A
Sep 27, 2024
Author

After some inspection, I noticed that this is because when Rust try to run the llama_server.exe in the subporcess, it failed.

So I tried the command like below to test if I can start the llama_server in my powershell,
.\llama-server.exe -m "C:\\Users\\xxx\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf" --cont-batching --port 30888 -np 1 --ctx-size 1024 --embedding --ubatch-size 4096

It turns out that this embedding llama server will exit quietly after printing the logging below:

PS C:\Users\xxx\Downloads\dist\tabby_x86_64-windows-msvc> .\llama-server.exe -m "C:\\Users\\xxx\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf" --cont-batching --port 30888 -np 1 --ctx-size 1024 --embedding --ubatch-size 4096 --log-enable
INFO [                    main] build info | tid="12136" timestamp=1727417626 build=1 commit="5ef07e2"
INFO [                    main] system info | tid="12136" timestamp=1727417626 n_threads=10 n_threads_batch=-1 total_threads=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\\Users\\xxx\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 7
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type q8_0:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
llm_load_tensors: ggml ctx size =    0.05 MiB
llm_load_tensors:        CPU buffer size =   138.65 MiB
.......................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 74.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    74.01 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 1
INFO [                    init] initializing slots | tid="12136" timestamp=1727417626 n_slots=1
INFO [                    init] new slot | tid="12136" timestamp=1727417626 id_slot=0 n_ctx_slot=1024
INFO [                    main] model loaded | tid="12136" timestamp=1727417626
INFO [                    main] chat template | tid="12136" timestamp=1727417626 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="12136" timestamp=1727417626 hostname="127.0.0.1" port="30888" n_threads_http="19"

Do you have any idea about why does this happen?

0 replies

ANYMS-A · 2024-09-27T08:22:23Z

ANYMS-A
Sep 27, 2024
Author

Finally, after a few hecking, find a solution for this issue:

Download a right version of llama-cpp release. For my machine, I choose the "llama-bxxxx-bin-win-avx-x64" version.
Unzip the pre-compiled llama-cpp you downloaded, copy the: ggml.dll, llama.dll, llama-server.exe into the "xxxx\dist\tabby_x86_64-windows-msvc" directory.
Run .\tabby.exe serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get stuck at the "starting ..." stage when try to run Tabby on Windows machine #3194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Get stuck at the "starting ..." stage when try to run Tabby on Windows machine #3194

ANYMS-A Sep 25, 2024

Replies: 3 comments · 2 replies

wsxiaoys Sep 25, 2024 Maintainer

ANYMS-A Sep 25, 2024 Author

ANYMS-A Sep 25, 2024 Author

ANYMS-A Sep 27, 2024 Author

ANYMS-A Sep 27, 2024 Author

ANYMS-A
Sep 25, 2024

Replies: 3 comments 2 replies

wsxiaoys
Sep 25, 2024
Maintainer

ANYMS-A Sep 25, 2024
Author

ANYMS-A Sep 25, 2024
Author

ANYMS-A
Sep 27, 2024
Author

ANYMS-A
Sep 27, 2024
Author