[Feature]: Stop Milvus being a joke #36703

eaxis · 2024-10-08T21:51:30Z

Is there an existing issue for this?

I have searched the existing issues

Is your feature request related to a problem? Please describe.

Hey guys, I have recently switched from Chroma to Milvus thinking that Milvus is a more mature, serious and production ready solution... and I'm absolutely disappointed - just wasted my time.

For the context, I have ~80k records in 2 collections defined by the following schema:

		// version v2.4.12
		schema := entity.NewSchema().
			WithName(collectionName).
			WithField(entity.NewField().WithName("id").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100).WithIsPrimaryKey(true).WithIsAutoID(false)).
			WithField(entity.NewField().WithName("document").WithDataType(entity.FieldTypeVarChar).WithMaxLength(50000)).
			WithField(entity.NewField().WithName("vectors").WithDataType(entity.FieldTypeFloatVector).WithDim(2000)).
			WithField(entity.NewField().WithName("tag").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100)).
			WithField(entity.NewField().WithName("timestamp").WithDataType(entity.FieldTypeInt64)).
			WithField(entity.NewField().WithName("source").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100))

and Milvus takes ~77GB on the SSD to store that content. Meanwhile Chroma takes ~3 GB and pgvector takes ~2 GB.

With that in mind, could you please clarify the following points:

Do you really think your users have infinite disk space / money to store everything you want? Some links:

Version 2.2.8: The disk space usage is too large #24051

The next thing is collection management (e.g. load / unload), this seems like a pretty internal operation - would you consider making it optional with the ability to assign a default strategy? Like 'load if not loaded' or something like that. I'm building a multi-threaded application, and controlling internal 3rd party statement is crazy.

Last thing: Do you really think your users don't need to count stored documents? The fact that Milvus can't even provide an exact number of managed documents is amusing and absurd at the same time.

Please note, I in no way mean to offend maintainers, but bringing a small(/medium?) sized project with Milvus under the hood to a production level seems to be a joke

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2024-10-09T04:59:41Z

Is there an existing issue for this?

I have searched the existing issues

Is your feature request related to a problem? Please describe.

Hey guys, I have recently switched from Chroma to Milvus thinking that Milvus is a more mature, serious and production ready solution... and I'm absolutely disappointed - just wasted my time.

For the context, I have ~80k records in 2 collections defined by the following schema:
		// version v2.4.12
		schema := entity.NewSchema().
			WithName(collectionName).
			WithField(entity.NewField().WithName("id").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100).WithIsPrimaryKey(true).WithIsAutoID(false)).
			WithField(entity.NewField().WithName("document").WithDataType(entity.FieldTypeVarChar).WithMaxLength(50000)).
			WithField(entity.NewField().WithName("vectors").WithDataType(entity.FieldTypeFloatVector).WithDim(2000)).
			WithField(entity.NewField().WithName("tag").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100)).
			WithField(entity.NewField().WithName("timestamp").WithDataType(entity.FieldTypeInt64)).
			WithField(entity.NewField().WithName("source").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100))
and Milvus takes ~77GB on the SSD to store that content. Meanwhile Chroma takes ~3 GB and pgvector takes ~2 GB.

With that in mind, could you please clarify the following points:

Do you really think your users have infinite disk space / money to store everything you want? Some links:

Version 2.2.8: The disk space usage is too large #24051

The next thing is collection management (e.g. load / unload), this seems like a pretty internal operation - would you consider making it optional with the ability to assign a default strategy? Like 'load if not loaded' or something like that. I'm building a multi-threaded application, and controlling internal 3rd party statement is crazy.

Last thing: Do you really think your users don't need to count stored documents? The fact that Milvus can't even provide an exact number of managed documents is amusing and absurd at the same time.

Please note, I in no way mean to offend maintainers, but bringing a small(/medium?) sized project with Milvus under the hood to a production level seems to be a joke

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

Hi eaxis,
Thanks for the feedback, here are something you need to understand

disk usage
It's heavily depend on how you deploy milvus. We recommend to deploy in a 3 replica mode, however if you are deploying with standalone mode, everything will be only one replica.

there might be many other reason to cause this storage expansion, for example:

all the WAL(RocksMQ is you are using standalone) will cost 1 times disk.
raw data on minio will cost 1 times (On disaster recovery it cost 3 times disk)
index data on minio will cost another 1 times.
if you try to flush manually, then it triggered frequent compaction and all those takes extra disk space until garbage colelction happend. (be careful about manual flush!! don't do that unless you know what you are doing)
garbage collection is usally done in couple of days(you can tune to reduce the garbage collection time)

Also, we believe that disk is the cheapest resource so I doesn't make too much sense to take too much care about it. Memory and CPU is definitely more expensive resources. V2.2.8 is a very old version(released 2 years ago) and we didn't see any report recently

Collection management
you can do create collection, create index and load before you insert data and search. The ability of load give you an option to reduce memory consumption if the collection is not necessary for now. if you want to use a quick setup, the easiest way is to

from pymilvus import MilvusClient, DataType

client = MilvusClient(
    uri="http://localhost:19530"
)

client.create_collection(
    collection_name="quick_setup",
    dimension=256
)

count
we do offer count operation, check https://milvus.io/docs/get-and-scalar-query.md

I think you probably need a little patience to learn details about milvus. Is is designed for scale and perfomance(You will know it you try to do some profiling and when you have more than 10m data). The largest deploymend we have is more than 10B. But I admit its API is not as straight forward as chroma or other databases due to it's complicated functionality and architecture.

xiaofan-luan · 2024-10-09T05:03:40Z

please also check https://zilliz.com/, which is our cloud offerings. It's purely managed by us and 80k entities is covered by our free tier.

BTW, I didn't notice you said you only have 80k entities. There is no way to achive 80GB storage no matter what you do. 80k is a too small amount of data. So I guess you need to check your code:

avoid flush on every write (this is most likely what happened)
how large is your document field? maybe enable compression would be help to reduce the document but we don't do it by default. this could be something we can improve.

xiaofan-luan · 2024-10-09T07:00:39Z

/assign @eaxis

eaxis · 2024-10-10T08:50:48Z

3. we do offer count operation

Yeah, that's correct, thank you for your patience. The thing is that Attu, which is the Milvus' UI has a warning that the number of records in a collection may not be accurate because of the way Milvus works, so I haven't looked into that in detail.

However, I opted for pgvector which seemed more stable and predictable to me, wish you luck!

xiaofan-luan · 2024-10-10T14:39:04Z

Sure，for 80k vectors, pg vector is a great option! enjoy it

For anyone who searched to this issue, attu used meta info to show count so it is not accurate .Count operation is accurate but usually takes more resource to execute

eaxis added the kind/feature Issues related to feature request from users label Oct 8, 2024

eaxis assigned xiaofan-luan Oct 8, 2024

sre-ci-robot assigned eaxis Oct 9, 2024

xiaofan-luan unassigned eaxis and xiaofan-luan Oct 9, 2024

xiaofan-luan closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Stop Milvus being a joke #36703

[Feature]: Stop Milvus being a joke #36703

eaxis commented Oct 8, 2024 •

edited

Loading

xiaofan-luan commented Oct 9, 2024

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

xiaofan-luan commented Oct 9, 2024

xiaofan-luan commented Oct 9, 2024

eaxis commented Oct 10, 2024 •

edited

Loading

xiaofan-luan commented Oct 10, 2024

[Feature]: Stop Milvus being a joke #36703

[Feature]: Stop Milvus being a joke #36703

Comments

eaxis commented Oct 8, 2024 • edited Loading

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

xiaofan-luan commented Oct 9, 2024

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

xiaofan-luan commented Oct 9, 2024

xiaofan-luan commented Oct 9, 2024

eaxis commented Oct 10, 2024 • edited Loading

xiaofan-luan commented Oct 10, 2024

eaxis commented Oct 8, 2024 •

edited

Loading

eaxis commented Oct 10, 2024 •

edited

Loading