-
Hi, followed the tutorial, I tried to start the Tabby server on my CPU-only windows machine and the starting process stucked with the following cmd line output: 2024-09-24T09:14:25.573705Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-09-24T09:14:25.573875Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-09-24T09:14:25.573987Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\h30058272\.tabby\models\TabbyML\Nomic-Embed-Text\ggml/model.gguf (version GGUF V3 (latest))
2024-09-24T09:14:25.574124Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-24T09:14:25.574251Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 0: general.architecture str = nomic-bert
2024-09-24T09:14:25.574369Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
2024-09-24T09:14:25.574446Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
2024-09-24T09:14:25.574494Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
2024-09-24T09:14:25.574540Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
2024-09-24T09:14:25.574587Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
2024-09-24T09:14:25.574633Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
2024-09-24T09:14:25.574680Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
2024-09-24T09:14:25.574775Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 8: general.file_type u32 = 7
2024-09-24T09:14:25.574839Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
2024-09-24T09:14:25.574890Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
2024-09-24T09:14:25.574939Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
2024-09-24T09:14:25.575015Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
2024-09-24T09:14:25.575063Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
2024-09-24T09:14:25.575109Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
2024-09-24T09:14:25.575156Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
2024-09-24T09:14:25.575205Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
2024-09-24T09:14:25.575251Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
2024-09-24T09:14:25.575297Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-09-24T09:14:25.575343Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
2024-09-24T09:14:25.575389Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
2024-09-24T09:14:25.575435Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
2024-09-24T09:14:25.575480Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv 22: general.quantization_version u32 = 2
2024-09-24T09:14:25.575526Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type f32: 51 tensors
2024-09-24T09:14:25.575572Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type q8_0: 61 tensors
2024-09-24T09:14:25.575617Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-09-24T09:14:25.575663Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-09-24T09:14:25.575710Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: format = GGUF V3 (latest)
2024-09-24T09:14:25.575758Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: arch = nomic-bert
2024-09-24T09:14:25.575809Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab type = WPM
2024-09-24T09:14:25.575881Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_vocab = 30522
2024-09-24T09:14:25.575938Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_merges = 0
2024-09-24T09:14:25.575983Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab_only = 0
2024-09-24T09:14:25.576028Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_train = 2048
2024-09-24T09:14:25.576073Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd = 768
2024-09-24T09:14:25.576124Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_layer = 12
2024-09-24T09:14:25.576169Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head = 12
2024-09-24T09:14:25.576214Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head_kv = 12
2024-09-24T09:14:25.576259Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_rot = 64
2024-09-24T09:14:25.576303Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_swa = 0
2024-09-24T09:14:25.576398Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_k = 64
2024-09-24T09:14:25.576469Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_v = 64
2024-09-24T09:14:25.576512Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_gqa = 1
2024-09-24T09:14:25.576553Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_k_gqa = 768
2024-09-24T09:14:25.576594Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_v_gqa = 768
2024-09-24T09:14:25.576635Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_eps = 1.0e-12
2024-09-24T09:14:25.576675Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_rms_eps = 0.0e+00
2024-09-24T09:14:25.576716Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-09-24T09:14:25.576757Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-24T09:14:25.576865Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_logit_scale = 0.0e+00
2024-09-24T09:14:25.576920Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ff = 3072
2024-09-24T09:14:25.576982Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert = 0
2024-09-24T09:14:25.577032Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert_used = 0
2024-09-24T09:14:25.577079Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: causal attn = 0
2024-09-24T09:14:25.577123Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: pooling type = 1
2024-09-24T09:14:25.577167Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope type = 2
2024-09-24T09:14:25.577211Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope scaling = linear
2024-09-24T09:14:25.577255Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_base_train = 1000.0
2024-09-24T09:14:25.577299Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-09-24T09:14:25.577386Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_orig_yarn = 2048
2024-09-24T09:14:25.577431Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope_finetuned = unknown
2024-09-24T09:14:25.577475Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_conv = 0
2024-09-24T09:14:25.577519Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_inner = 0
2024-09-24T09:14:25.577564Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_state = 0
2024-09-24T09:14:25.577609Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_dt_rank = 0
2024-09-24T09:14:25.577653Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model type = 137M
2024-09-24T09:14:25.577696Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model ftype = Q8_0
2024-09-24T09:14:25.577740Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model params = 136.73 M
2024-09-24T09:14:25.577784Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model size = 138.65 MiB (8.51 BPW)
2024-09-24T09:14:25.577829Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: general.name = nomic-embed-text-v1.5
2024-09-24T09:14:25.577873Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: BOS token = 101 '[CLS]'
2024-09-24T09:14:25.577917Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: EOS token = 102 '[SEP]'
2024-09-24T09:14:25.577961Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: UNK token = 100 '[UNK]'
2024-09-24T09:14:25.578005Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: SEP token = 102 '[SEP]'
2024-09-24T09:14:25.578049Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: PAD token = 0 '[PAD]'
2024-09-24T09:14:25.578092Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: CLS token = 101 '[CLS]'
2024-09-24T09:14:25.578136Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: MASK token = 103 '[MASK]'
2024-09-24T09:14:25.578180Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: LF token = 0 '[PAD]'
2024-09-24T09:14:25.578224Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: max token length = 21
2024-09-24T09:14:25.578269Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: ggml ctx size = 0.05 MiB
2024-09-24T09:14:25.578332Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: CPU buffer size = 138.65 MiB
2024-09-24T09:14:25.578377Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: .......................................................
2024-09-24T09:14:25.578421Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ctx = 4096
2024-09-24T09:14:25.578465Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_batch = 2048
2024-09-24T09:14:25.578509Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ubatch = 2048
2024-09-24T09:14:25.578553Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: flash_attn = 0
2024-09-24T09:14:25.578596Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_base = 1000.0
2024-09-24T09:14:25.578640Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: freq_scale = 1
2024-09-24T09:14:25.578684Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_kv_cache_init: CPU KV buffer size = 144.00 MiB
2024-09-24T09:14:25.578728Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: KV self size = 144.00 MiB, K (f16): 72.00 MiB, V (f16): 72.00 MiB
2024-09-24T09:14:25.578772Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: CPU output buffer size = 0.00 MiB
2024-09-24T09:14:25.578816Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 260.02 MiB
2024-09-24T09:14:25.578908Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: CPU compute buffer size = 260.02 MiB
2024-09-24T09:14:25.578956Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph nodes = 453
2024-09-24T09:14:25.579002Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: graph splits = 1
⠋ 4.838 s Starting...2024-09-24T09:14:27.074277Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:634: connected to 127.0.0.1:30888
⠋ 7.255 s Starting...2024-09-24T09:14:29.491996Z DEBUG reqwest::connect: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\reqwest-0.12.4\src\connect.rs:497: starting new connection: http://127.0.0.1:30888/
2024-09-24T09:14:29.492134Z DEBUG hyper_util::client::legacy::connect::http: C:\Users\runneradmin\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyper-util-0.1.5\src\client\legacy\connect\http.rs:631: connecting to 127.0.0.1:30888
⠙ 7.335 s Starting...2024-09-24T09:14:29.522809Z WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:98: llama-server <embedding> exited with status code -1073741819, args: `Command { std: "C:\\Users\\h30058272\\Downloads\\dist\\tabby_x86_64-windows-msvc\\llama-server.exe" "-m" "C:\\Users\\h30058272\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml/model.gguf" "--cont-batching" "--port" "30888" "-np" "1" "--log-disable" "--ctx-size" "4096" "-ngl" "9999" "--embedding" "--ubatch-size" "4096", kill_on_drop: true }` I found similar discussions : #2936 and enable the DEBUG logging, but seems the output from my machine is kind of different from his, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Hi - this is an known issue fixed in #3152 (original issue: #3150). It'll be part of 0.18 release - you might give rc3 a try to see if it fix the issue for you: https://github.com/TabbyML/tabby/releases/tag/v0.18.0-rc.3 |
Beta Was this translation helpful? Give feedback.
-
After some inspection, I noticed that this is because when Rust try to run the llama_server.exe in the subporcess, it failed. So I tried the command like below to test if I can start the llama_server in my powershell, It turns out that this embedding llama server will exit quietly after printing the logging below: PS C:\Users\xxx\Downloads\dist\tabby_x86_64-windows-msvc> .\llama-server.exe -m "C:\\Users\\xxx\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf" --cont-batching --port 30888 -np 1 --ctx-size 1024 --embedding --ubatch-size 4096 --log-enable
INFO [ main] build info | tid="12136" timestamp=1727417626 build=1 commit="5ef07e2"
INFO [ main] system info | tid="12136" timestamp=1727417626 n_threads=10 n_threads_batch=-1 total_threads=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\\Users\\xxx\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 7
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q8_0: 61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = nomic-bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 30522
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 1
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 137M
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 136.73 M
llm_load_print_meta: model size = 138.65 MiB (8.51 BPW)
llm_load_print_meta: general.name = nomic-embed-text-v1.5
llm_load_print_meta: BOS token = 101 '[CLS]'
llm_load_print_meta: EOS token = 102 '[SEP]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
llm_load_print_meta: max token length = 21
llm_load_tensors: ggml ctx size = 0.05 MiB
llm_load_tensors: CPU buffer size = 138.65 MiB
.......................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 36.00 MiB
llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 74.01 MiB
llama_new_context_with_model: CPU compute buffer size = 74.01 MiB
llama_new_context_with_model: graph nodes = 453
llama_new_context_with_model: graph splits = 1
INFO [ init] initializing slots | tid="12136" timestamp=1727417626 n_slots=1
INFO [ init] new slot | tid="12136" timestamp=1727417626 id_slot=0 n_ctx_slot=1024
INFO [ main] model loaded | tid="12136" timestamp=1727417626
INFO [ main] chat template | tid="12136" timestamp=1727417626 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="12136" timestamp=1727417626 hostname="127.0.0.1" port="30888" n_threads_http="19" Do you have any idea about why does this happen? |
Beta Was this translation helpful? Give feedback.
-
Finally, after a few hecking, find a solution for this issue:
|
Beta Was this translation helpful? Give feedback.
Finally, after a few hecking, find a solution for this issue:
.\tabby.exe serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct