[Bug]: Milvus Exits Suddenly while data ingestion #36645

AhmedAl-Zanam · 2024-10-04T06:38:19Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:v2.4.12
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: Intel Xeon Gold 6248 16 cores/540GB
- GPU: none
- Others:

Current Behavior

We are experiencing an issue where Milvus exits suddenly while we are ingesting data. This unexpected termination disrupts our data processing pipeline and affects the overall stability of our system. Despite assigning a significant amount of RAM and CPU cores to the Milvus server, the problem persists.

Expected Behavior

Milvus should handle the data ingestion process without exiting unexpectedly.

Steps To Reproduce

1. Start the Milvus server.
2. Begin ingesting a large dataset into Milvus.
3. Observe the server behavior during the ingestion process.

Milvus Log

_milvus23-standalone_logs (1).txt

Anything else?

The issue occurs consistently during large data ingestions.
We have verified that the etcd service is running and accessible.
Network connectivity between Milvus and etcd appears to be stable.
We have assigned a significant amount of RAM and CPU cores to the Milvus server, but the problem remains.

xiaofan-luan · 2024-10-07T14:30:01Z

From the log you offered :

the clock offset seems to be huge (> 30s)
there is etcd session timeout

xiaofan-luan · 2024-10-07T14:30:27Z

["Slow etcd operation save"] ["time spent"=14.360227492s] [key=by-dev/kv/gid/timestamp]

and etcd is slow too.

xiaofan-luan · 2024-10-07T14:31:03Z

can you confirm

ETCD is deployed with SSD volume?
etcd cpu and memory usage at that time is working well

yanliang567 · 2024-10-08T01:49:17Z

/assign @AhmedAl-Zanam
/unassign

colanyanya · 2024-11-08T03:22:37Z

我遇到了一个相似的问题

Environment

- Milvus version:v2.4.5
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.8 langchian_milvus 0.1.5
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: Intel(R) Xeon(R) Platinum 8core/16G (阿里云)
- GPU: none
- Others:

出现情况

1.一个小型的rag知识库，只加载了一个collection，大概500万条向量，IVF_SQ8索引，用了docker运行内存占用大概1.2G左右
2.在查询过程中内存也平稳在1.2G左右（在window wsl docker上空闲时会降到大概300M，Ubuntu上保持在1.2G，这一点我不清楚具体原理）
3.在空闲时候 milvus docker 突然退出

这是docker log前几条异常信息

[2024/11/07 11:55:48.225 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1m19.711448623s] [prev-physical=2024/11/07 11:54:28.514 +00:00] [now=2024/11/07 11:55:48.225 +00:00]
[2024/11/07 11:55:48.594 +00:00] [INFO] [datacoord/policy.go:338] ["node channel count is not much larger than average, skip reallocate"] [nodeID=9] [channelCount=3] [channelCountPerNode=3]
[2024/11/07 11:55:48.733 +00:00] [INFO] [observers/target_observer.go:463] ["Update readable segment version"] [collectionID=453695194926738545] [channelName=by-dev-rootcoord-dml_2_453695194926738545v0] [nodeID=9] [oldVersion=1730980457263365908] [newVersion=1730980467300108282]
[2024/11/07 11:55:48.738 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.738 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=indexnode]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.740 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=rootcoord]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]

milvus_log_20241107.txt

xiaofan-luan · 2024-11-08T19:15:16Z

@colanyanya
I guess the etcd stopped working.
How did you deploy? is milvus in a seperate container?

xiaofan-luan · 2024-11-08T19:59:01Z

我遇到了一个相似的问题

Environment

- Milvus version:v2.4.5
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.8 langchian_milvus 0.1.5
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: Intel(R) Xeon(R) Platinum 8core/16G (阿里云)
- GPU: none
- Others:

出现情况

1.一个小型的rag知识库，只加载了一个collection，大概500万条向量，IVF_SQ8索引，用了docker运行内存占用大概1.2G左右
2.在查询过程中内存也平稳在1.2G左右（在window wsl docker上空闲时会降到大概300M，Ubuntu上保持在1.2G，这一点我不清楚具体原理）
3.在空闲时候 milvus docker 突然退出

这是docker log前几条异常信息

[2024/11/07 11:55:48.225 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1m19.711448623s] [prev-physical=2024/11/07 11:54:28.514 +00:00] [now=2024/11/07 11:55:48.225 +00:00]
[2024/11/07 11:55:48.594 +00:00] [INFO] [datacoord/policy.go:338] ["node channel count is not much larger than average, skip reallocate"] [nodeID=9] [channelCount=3] [channelCountPerNode=3]
[2024/11/07 11:55:48.733 +00:00] [INFO] [observers/target_observer.go:463] ["Update readable segment version"] [collectionID=453695194926738545] [channelName=by-dev-rootcoord-dml_2_453695194926738545v0] [nodeID=9] [oldVersion=1730980457263365908] [newVersion=1730980467300108282]
[2024/11/07 11:55:48.738 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.738 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=indexnode]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.740 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=rootcoord]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]

milvus_log_20241107.txt

make sure you deploy etcd is still working at that time and it is deployed with SSDs

AhmedAl-Zanam added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2024

AhmedAl-Zanam assigned yanliang567 Oct 4, 2024

sre-ci-robot assigned AhmedAl-Zanam and unassigned yanliang567 Oct 8, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus Exits Suddenly while data ingestion #36645

[Bug]: Milvus Exits Suddenly while data ingestion #36645

AhmedAl-Zanam commented Oct 4, 2024

xiaofan-luan commented Oct 7, 2024

xiaofan-luan commented Oct 7, 2024

xiaofan-luan commented Oct 7, 2024

yanliang567 commented Oct 8, 2024

colanyanya commented Nov 8, 2024 •

edited

Loading

xiaofan-luan commented Nov 8, 2024

xiaofan-luan commented Nov 8, 2024

Environment

出现情况

[Bug]: Milvus Exits Suddenly while data ingestion #36645

[Bug]: Milvus Exits Suddenly while data ingestion #36645

Comments

AhmedAl-Zanam commented Oct 4, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Oct 7, 2024

xiaofan-luan commented Oct 7, 2024

xiaofan-luan commented Oct 7, 2024

yanliang567 commented Oct 8, 2024

colanyanya commented Nov 8, 2024 • edited Loading

Environment

出现情况

xiaofan-luan commented Nov 8, 2024

xiaofan-luan commented Nov 8, 2024

Environment

出现情况

colanyanya commented Nov 8, 2024 •

edited

Loading