Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Stop Milvus being a joke #36703

Closed
1 task done
eaxis opened this issue Oct 8, 2024 · 5 comments
Closed
1 task done

[Feature]: Stop Milvus being a joke #36703

eaxis opened this issue Oct 8, 2024 · 5 comments
Labels
kind/feature Issues related to feature request from users

Comments

@eaxis
Copy link

eaxis commented Oct 8, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Hey guys, I have recently switched from Chroma to Milvus thinking that Milvus is a more mature, serious and production ready solution... and I'm absolutely disappointed - just wasted my time.

For the context, I have ~80k records in 2 collections defined by the following schema:

		// version v2.4.12
		schema := entity.NewSchema().
			WithName(collectionName).
			WithField(entity.NewField().WithName("id").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100).WithIsPrimaryKey(true).WithIsAutoID(false)).
			WithField(entity.NewField().WithName("document").WithDataType(entity.FieldTypeVarChar).WithMaxLength(50000)).
			WithField(entity.NewField().WithName("vectors").WithDataType(entity.FieldTypeFloatVector).WithDim(2000)).
			WithField(entity.NewField().WithName("tag").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100)).
			WithField(entity.NewField().WithName("timestamp").WithDataType(entity.FieldTypeInt64)).
			WithField(entity.NewField().WithName("source").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100))

and Milvus takes ~77GB on the SSD to store that content. Meanwhile Chroma takes ~3 GB and pgvector takes ~2 GB.

With that in mind, could you please clarify the following points:

Do you really think your users have infinite disk space / money to store everything you want? Some links:

The next thing is collection management (e.g. load / unload), this seems like a pretty internal operation - would you consider making it optional with the ability to assign a default strategy? Like 'load if not loaded' or something like that. I'm building a multi-threaded application, and controlling internal 3rd party statement is crazy.

Last thing: Do you really think your users don't need to count stored documents? The fact that Milvus can't even provide an exact number of managed documents is amusing and absurd at the same time.

Please note, I in no way mean to offend maintainers, but bringing a small(/medium?) sized project with Milvus under the hood to a production level seems to be a joke

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@eaxis eaxis added the kind/feature Issues related to feature request from users label Oct 8, 2024
@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Hey guys, I have recently switched from Chroma to Milvus thinking that Milvus is a more mature, serious and production ready solution... and I'm absolutely disappointed - just wasted my time.

For the context, I have ~80k records in 2 collections defined by the following schema:

		// version v2.4.12
		schema := entity.NewSchema().
			WithName(collectionName).
			WithField(entity.NewField().WithName("id").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100).WithIsPrimaryKey(true).WithIsAutoID(false)).
			WithField(entity.NewField().WithName("document").WithDataType(entity.FieldTypeVarChar).WithMaxLength(50000)).
			WithField(entity.NewField().WithName("vectors").WithDataType(entity.FieldTypeFloatVector).WithDim(2000)).
			WithField(entity.NewField().WithName("tag").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100)).
			WithField(entity.NewField().WithName("timestamp").WithDataType(entity.FieldTypeInt64)).
			WithField(entity.NewField().WithName("source").WithDataType(entity.FieldTypeVarChar).WithMaxLength(100))

and Milvus takes ~77GB on the SSD to store that content. Meanwhile Chroma takes ~3 GB and pgvector takes ~2 GB.

With that in mind, could you please clarify the following points:

Do you really think your users have infinite disk space / money to store everything you want? Some links:

The next thing is collection management (e.g. load / unload), this seems like a pretty internal operation - would you consider making it optional with the ability to assign a default strategy? Like 'load if not loaded' or something like that. I'm building a multi-threaded application, and controlling internal 3rd party statement is crazy.

Last thing: Do you really think your users don't need to count stored documents? The fact that Milvus can't even provide an exact number of managed documents is amusing and absurd at the same time.

Please note, I in no way mean to offend maintainers, but bringing a small(/medium?) sized project with Milvus under the hood to a production level seems to be a joke

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

Hi eaxis,
Thanks for the feedback, here are something you need to understand

  1. disk usage
    It's heavily depend on how you deploy milvus. We recommend to deploy in a 3 replica mode, however if you are deploying with standalone mode, everything will be only one replica.

there might be many other reason to cause this storage expansion, for example:

  1. all the WAL(RocksMQ is you are using standalone) will cost 1 times disk.
  2. raw data on minio will cost 1 times (On disaster recovery it cost 3 times disk)
  3. index data on minio will cost another 1 times.
  4. if you try to flush manually, then it triggered frequent compaction and all those takes extra disk space until garbage colelction happend. (be careful about manual flush!! don't do that unless you know what you are doing)
  5. garbage collection is usally done in couple of days(you can tune to reduce the garbage collection time)

Also, we believe that disk is the cheapest resource so I doesn't make too much sense to take too much care about it. Memory and CPU is definitely more expensive resources. V2.2.8 is a very old version(released 2 years ago) and we didn't see any report recently

  1. Collection management
    you can do create collection, create index and load before you insert data and search. The ability of load give you an option to reduce memory consumption if the collection is not necessary for now. if you want to use a quick setup, the easiest way is to
from pymilvus import MilvusClient, DataType

client = MilvusClient(
    uri="http://localhost:19530"
)

client.create_collection(
    collection_name="quick_setup",
    dimension=256
)

  1. count
    we do offer count operation, check https://milvus.io/docs/get-and-scalar-query.md
image

I think you probably need a little patience to learn details about milvus. Is is designed for scale and perfomance(You will know it you try to do some profiling and when you have more than 10m data). The largest deploymend we have is more than 10B. But I admit its API is not as straight forward as chroma or other databases due to it's complicated functionality and architecture.

@xiaofan-luan
Copy link
Collaborator

please also check https://zilliz.com/, which is our cloud offerings. It's purely managed by us and 80k entities is covered by our free tier.

BTW, I didn't notice you said you only have 80k entities. There is no way to achive 80GB storage no matter what you do. 80k is a too small amount of data. So I guess you need to check your code:

  1. avoid flush on every write (this is most likely what happened)
  2. how large is your document field? maybe enable compression would be help to reduce the document but we don't do it by default. this could be something we can improve.

@xiaofan-luan
Copy link
Collaborator

/assign @eaxis

@eaxis
Copy link
Author

eaxis commented Oct 10, 2024

3. we do offer count operation

Yeah, that's correct, thank you for your patience. The thing is that Attu, which is the Milvus' UI has a warning that the number of records in a collection may not be accurate because of the way Milvus works, so I haven't looked into that in detail.

However, I opted for pgvector which seemed more stable and predictable to me, wish you luck!

@xiaofan-luan
Copy link
Collaborator

Sure,for 80k vectors, pg vector is a great option! enjoy it

For anyone who searched to this issue, attu used meta info to show count so it is not accurate .Count operation is accurate but usually takes more resource to execute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

2 participants