Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus Exits Suddenly while data ingestion #36645

Open
1 task done
AhmedAl-Zanam opened this issue Oct 4, 2024 · 7 comments
Open
1 task done

[Bug]: Milvus Exits Suddenly while data ingestion #36645

AhmedAl-Zanam opened this issue Oct 4, 2024 · 7 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@AhmedAl-Zanam
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:v2.4.12
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: Intel Xeon Gold 6248 16 cores/540GB
- GPU: none
- Others:

Current Behavior

We are experiencing an issue where Milvus exits suddenly while we are ingesting data. This unexpected termination disrupts our data processing pipeline and affects the overall stability of our system. Despite assigning a significant amount of RAM and CPU cores to the Milvus server, the problem persists.

Expected Behavior

Milvus should handle the data ingestion process without exiting unexpectedly.

Steps To Reproduce

1. Start the Milvus server.
2. Begin ingesting a large dataset into Milvus.
3. Observe the server behavior during the ingestion process.

Milvus Log

_milvus23-standalone_logs (1).txt

Anything else?

The issue occurs consistently during large data ingestions.
We have verified that the etcd service is running and accessible.
Network connectivity between Milvus and etcd appears to be stable.
We have assigned a significant amount of RAM and CPU cores to the Milvus server, but the problem remains.

@AhmedAl-Zanam AhmedAl-Zanam added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2024
@xiaofan-luan
Copy link
Collaborator

image

From the log you offered :

  1. the clock offset seems to be huge (> 30s)
  2. there is etcd session timeout

@xiaofan-luan
Copy link
Collaborator

["Slow etcd operation save"] ["time spent"=14.360227492s] [key=by-dev/kv/gid/timestamp]

and etcd is slow too.

@xiaofan-luan
Copy link
Collaborator

can you confirm

  1. ETCD is deployed with SSD volume?
  2. etcd cpu and memory usage at that time is working well

@yanliang567
Copy link
Contributor

/assign @AhmedAl-Zanam
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024
@colanyanya
Copy link

colanyanya commented Nov 8, 2024

我遇到了一个相似的问题

Environment

- Milvus version:v2.4.5
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.8 langchian_milvus 0.1.5
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: Intel(R) Xeon(R) Platinum 8core/16G (阿里云)
- GPU: none
- Others: 

出现情况

1.一个小型的rag知识库,只加载了一个collection,大概500万条向量,IVF_SQ8索引,用了docker运行内存占用大概1.2G左右
2.在查询过程中内存也平稳在1.2G左右(在window wsl docker上空闲时会降到大概300M,Ubuntu上保持在1.2G,这一点我不清楚具体原理)
3.在空闲时候 milvus docker 突然退出

这是docker log前几条异常信息

[2024/11/07 11:55:48.225 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1m19.711448623s] [prev-physical=2024/11/07 11:54:28.514 +00:00] [now=2024/11/07 11:55:48.225 +00:00]
[2024/11/07 11:55:48.594 +00:00] [INFO] [datacoord/policy.go:338] ["node channel count is not much larger than average, skip reallocate"] [nodeID=9] [channelCount=3] [channelCountPerNode=3]
[2024/11/07 11:55:48.733 +00:00] [INFO] [observers/target_observer.go:463] ["Update readable segment version"] [collectionID=453695194926738545] [channelName=by-dev-rootcoord-dml_2_453695194926738545v0] [nodeID=9] [oldVersion=1730980457263365908] [newVersion=1730980467300108282]
[2024/11/07 11:55:48.738 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.738 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=indexnode]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.740 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=rootcoord]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]

milvus_log_20241107.txt

@xiaofan-luan
Copy link
Collaborator

@colanyanya
I guess the etcd stopped working.
How did you deploy? is milvus in a seperate container?

@xiaofan-luan
Copy link
Collaborator

我遇到了一个相似的问题

Environment

- Milvus version:v2.4.5
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): none   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.8 langchian_milvus 0.1.5
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: Intel(R) Xeon(R) Platinum 8core/16G (阿里云)
- GPU: none
- Others: 

出现情况

1.一个小型的rag知识库,只加载了一个collection,大概500万条向量,IVF_SQ8索引,用了docker运行内存占用大概1.2G左右
2.在查询过程中内存也平稳在1.2G左右(在window wsl docker上空闲时会降到大概300M,Ubuntu上保持在1.2G,这一点我不清楚具体原理)
3.在空闲时候 milvus docker 突然退出

这是docker log前几条异常信息

[2024/11/07 11:55:48.225 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1m19.711448623s] [prev-physical=2024/11/07 11:54:28.514 +00:00] [now=2024/11/07 11:55:48.225 +00:00]
[2024/11/07 11:55:48.594 +00:00] [INFO] [datacoord/policy.go:338] ["node channel count is not much larger than average, skip reallocate"] [nodeID=9] [channelCount=3] [channelCountPerNode=3]
[2024/11/07 11:55:48.733 +00:00] [INFO] [observers/target_observer.go:463] ["Update readable segment version"] [collectionID=453695194926738545] [channelName=by-dev-rootcoord-dml_2_453695194926738545v0] [nodeID=9] [oldVersion=1730980457263365908] [newVersion=1730980467300108282]
[2024/11/07 11:55:48.738 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.738 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=indexnode]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/11/07 11:55:48.740 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=rootcoord]
[2024/11/07 11:55:48.740 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]

milvus_log_20241107.txt

make sure you deploy etcd is still working at that time and it is deployed with SSDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants