New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Feat]Copy-free save and load for cuckoo hashtable #243

Open

Lifann wants to merge 7 commits into tensorflow:master from Lifann:indep-table-save

Member

Lifann commented May 16, 2022 •

edited

Loading

Description

Brief Description of the PR:
Since dynamic embedding could be super large for memory limit. save and load with traditional TensorFlow checkpoint mechanism will use a lot of memory when saving or loading.
This PR provides a method to save or load files for dynamic embedding tables, without full volume copying.

Type of change

Checklist:

I've properly formatted my code according to the guidelines
- By running yapf
- By running clang-format
This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Yes

Lifann requested a review from rhdong as a code owner

May 16, 2022 06:10

google-cla bot commented May 16, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

Lifann force-pushed the indep-table-save branch 2 times, most recently from 3fdfd57 to 1d4fe07 Compare

May 16, 2022 06:15

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc

                   return table_->export_values(ctx, value_dim);
                 }
+                Status Save(OpKernelContext* ctx, const string filepath,
+                            const size_t buffer_size) {
+                  int64 value_dim = value_shape_.dim_size(0);

Member

rhdong May 26, 2022

int64_t

Member Author

Lifann Jun 10, 2022

tensorflow::int64 is returned from dim_size and also used in table_->save()

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc

+                Status Load(OpKernelContext* ctx, const string filepath,
+                            const size_t buffer_size) {
+                  int64 value_dim = value_shape_.dim_size(0);

Member

rhdong May 26, 2022 •

edited

Loading

int64_t

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc Outdated

                   return table_->export_values(ctx, value_dim);
                 }
+                Status Save(OpKernelContext* ctx, const string filepath,

Member

rhdong May 26, 2022

SaveToFile might be better for possible extending in the future.

Member Author

Lifann Jun 10, 2022

Accept.

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc Outdated

+                Status Save(OpKernelContext* ctx, const string filepath,
+                            const size_t buffer_size) {
+                  int64 value_dim = value_shape_.dim_size(0);
+                  return table_->save(ctx, value_dim, filepath, buffer_size);

Member

rhdong May 26, 2022

SaveToFile might be better for possible extending in the future.

Member Author

Lifann Jun 10, 2022

Accept

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc Outdated

+                  return table_->save(ctx, value_dim, filepath, buffer_size);
+                }
+                Status Load(OpKernelContext* ctx, const string filepath,

Member

rhdong May 26, 2022

LoadFromFile might be better for possible extending in the future.

Member Author

Lifann Jun 10, 2022

Accept

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op.cc Outdated

+                Status Load(OpKernelContext* ctx, const string filepath,
+                            const size_t buffer_size) {
+                  int64 value_dim = value_shape_.dim_size(0);
+                  return table_->load(ctx, value_dim, filepath, buffer_size);

Member

rhdong May 26, 2022

LoadFromFile

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/lookup_impl/lookup_table_op_gpu.h Outdated

@@ @@ -49,6 +49,84 @@ struct ValueArray : public ValueArrayBase<V> { @@
               template <class T>
               using ValueType = ValueArrayBase<T>;
+              template <typename T>
+              class HostFileBuffer {

Member

rhdong May 26, 2022

It looks like repeated code with HostFileBuffer in lookup_table_op_cpu.h, if yes, recommending you move them to tensorflow_recommenders_addons/dynamic_embedding/core/utils/host_file_buffer.h

Member Author

Lifann Jun 10, 2022

Accept

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/ops/cuckoo_hashtable_ops.cc Show resolved Hide resolved

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/python/ops/cuckoo_hashtable_ops.py

@@ @@ -338,6 +338,53 @@ def export(self, name=None): @@
                           self.resource_handle, self._key_dtype, self._value_dtype)
                   return keys, values
+                def save(self, filepath, buffer_size=4194304, name=None):

Member

rhdong May 26, 2022

save_to_file would be better
Becausesave_to_hdfs is possible in the future.

Member Author

Lifann Jun 10, 2022

Accept

rhdong reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/python/ops/cuckoo_hashtable_ops.py

+                          value_dtype=self._value_dtype,
+                          buffer_size=buffer_size)
+                def load(self, filepath, buffer_size=4194304, name=None):

Member

rhdong May 26, 2022 •

edited

Loading

load_from_file or load_from_localfile

rhdong force-pushed the indep-table-save branch from 1d4fe07 to 9df45e4 Compare

June 7, 2022 03:45

rhdong changed the title ~~Copy-free save and load for cuckoo hashtable~~ [Feat]Copy-free save and load for cuckoo hashtable

Contributor

acmore commented Jun 9, 2022 •

edited by Lifann

Loading

This feature is very useful. Looking forward to it.

May I ask how to use it? In our case, we will use estimator and save checkpoints per epoch. So should we customize a saver to save the tables manually? But if so, how can we get all the tables to save?

Currently it only support usage like save_op = table.save(path). In eager mode, it's pythonic and simple. In graph mode, it should be managed on graph, like SessionRunHook or run sub-branch of graph.

acmore reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc

-                  size_t new_max_size = max_size_;
+                  size_t capacity = table_->get_capacity();
+                  size_t cur_size = table_->get_size(stream);

Contributor

acmore Jun 9, 2022

from our profiling result, this kernel is expensive. I suggest that we can cache last size, and check if the table need to expand with expecting keys to be added. If not, we could add expect to last size, and just return. In this way, we could reduce the number of calls to get_size.

Member Author

Lifann Jun 10, 2022 •

edited

Loading

Cached size makes latency between record to real-time size. Save and load are usually low-frequency operations, which may make large gap between recorded and real-time size.

acmore reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/lookup_impl/lookup_table_op_cpu.h Outdated


		~HostFileBuffer() { Close(); }

		void Put(const T value) {

Contributor

acmore Jun 9, 2022

May I ask if T is tstring, will it work? I remember that sometimes the key is tstring.

Member Author

Lifann Jun 10, 2022

Haven't used string key type. I think it can be solved by abstracting the host buffer class and template specialization.

acmore reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/lookup_impl/lookup_table_op_cpu.h Outdated

+                  size_t key_buffer_size = buffer_size;
+                  string key_tmpfile = filepath + ".keys.tmp";
+                  string key_file = filepath + ".keys";
+                  auto key_buffer = HostFileBuffer<K>(ctx, key_tmpfile, key_buffer_size,

Contributor

acmore Jun 9, 2022 •

edited

Loading

Is it better to abstract the file out of the HostFileBuffer class? We are heavily depending on hdfs, and would like to add hdfs support based on your work. Thanks

Member Author

Lifann Jun 10, 2022

Accept

acmore reviewed

View reviewed changes

tensorflow_recommenders_addons/dynamic_embedding/core/kernels/lookup_impl/lookup_table_op_gpu.h

+                  size_t total_keys = 0;
+                  size_t total_values = 0;
+                  while (nkeys > 0) {
+                    nkeys = key_buffer.Fill();

Contributor

acmore Jun 9, 2022

is it possible that nkeys is 0?

Member Author

Lifann Jun 10, 2022

If nkeys is 0, then it leaves an empty file.

Lifann force-pushed the indep-table-save branch 4 times, most recently from dfc5667 to cf6c24f Compare

June 10, 2022 14:36

Lifann force-pushed the indep-table-save branch from 3578760 to 5feb26c Compare

July 1, 2022 03:39

rhdong and others added 4 commits

July 1, 2022 15:01


          [fix] CI fail for protobuf update.

78c4e5d

- Also include the horovod compile fail on macOS(They was caused by the same reason)


          [fix] Fixed segment fault due to lambda may capture the thread_contex…

35cf2ef

…t reference with error address when high concurrency and server disconnection.


          [refactor] clean some warnings

d262b69


          fix(comment): Fix error in comment.

ac8ec80

Lifann and others added 3 commits

July 1, 2022 15:01


          Bugfix: rehashing in GPU hashtable is not enough when meeting large i…

097bd21

…nsert


          Add ops of ExportToFile and ImportFromFile without full volume copying

c9acc63


          [fix] stopping the Bazel in dev docker to update automatically .

b374514

Lifann force-pushed the indep-table-save branch from ce1b182 to b374514 Compare

July 1, 2022 07:02

MoFHeka reviewed

View reviewed changes

...ommenders_addons/dynamic_embedding/core/kernels/redis_impl/redis_cluster_connection_pool.hpp

@@ @@ -172,8 +172,6 @@ class RedisWrapper<RedisInstance, K, V, @@
                     } catch (const std::exception &err) {
                       LOG(ERROR) << "RedisHandler error in PipeExecRead for slices "
                                  << hkey.data() << " -- " << err.what();
-                      error_ptr = std::current_exception();

This comment was marked as resolved.

Sign in to view

Contributor

PWZER commented Mar 13, 2023

This is a good solution, we want to solve this problem too, When can it be merged?

Collaborator

MoFHeka commented Mar 14, 2023

Try this: https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding/FileSystemSaver.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet