Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

sfenu-3 · 2024-11-22T22:37:19Z

Bug description:

We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.

Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.

Steps to reproduce:

`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd

data_dict = {
'a': [1.] * 10,
'b': [20.] * 10,
'c': [300.] * 10,
'd': [4000.] * 10
}

df = pd.DataFrame(data_dict)

valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

df.to_csv('bug_report_test_data.csv', index=False)

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]

Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue

fchollet · 2024-11-25T23:06:39Z

I can reproduce the issue. @hertschuh, this appears to be related to the fix you provided as some point to use split_dataset with deeply nested structures. Maybe we should intro two separate cases: deeply nested vs not, and handle each case separately.

sibyjackgrove · 2024-11-26T00:03:25Z

@fchollet Thank you for looking into the issue. I face the same issue with Keras 3.6 when using split_data on tf.data dict datasets.

hertschuh · 2024-11-26T00:32:58Z

@sfenu-3

This

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)

hangs for a very simple reason, which is that it generates an infinitely looping dataset by default.

If you replace with:

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1, shuffle=False, num_epochs=1)

Then it works as expected. I checked with Keras 3.3.3 and it hangs in the same way.

or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]

Do you have an example of that?

hertschuh · 2024-11-26T00:34:06Z

@fchollet Thank you for looking into the issue. I face the same issue with Keras 3.6 when using split_data on tf.data dict datasets.

Which issue? The hanging or the key / value mismatch?

Do you have an example?

Thanks!

sibyjackgrove · 2024-11-26T04:46:54Z

@hertschuh I was referring to the key/value mismatch. I am creating my dataset using:

dataset = tf.data.experimental.make_csv_dataset(file_pattern=csv_filepattern,batch_size=1 ,select_columns=columns_selected,
header=header,num_epochs=1, shuffle=False,num_rows_for_inference=1000)
train_dataset, test_dataset = keras.utils.split_dataset(dataset, left_size=train_fraction)

It works fine for Keras 3.3.3 but has key,value mismatch for Keras 3.6

hertschuh · 2024-11-26T18:16:07Z

The root cause is that our tree utils have a bug with OrderedDict, which is what I'm in the process of fixing in #20481

The bug didn't appear in 3.3.3 and prior because we were not using tree until this refactor #19911

google-ml-butler bot added the type:Bug label Nov 22, 2024

github-actions bot assigned mehtamansi29 Nov 22, 2024

hertschuh self-assigned this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

sfenu-3 commented Nov 22, 2024 •

edited

Loading

fchollet commented Nov 25, 2024

sibyjackgrove commented Nov 26, 2024

hertschuh commented Nov 26, 2024

hertschuh commented Nov 26, 2024

sibyjackgrove commented Nov 26, 2024

hertschuh commented Nov 26, 2024

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

Comments

sfenu-3 commented Nov 22, 2024 • edited Loading

fchollet commented Nov 25, 2024

sibyjackgrove commented Nov 26, 2024

hertschuh commented Nov 26, 2024

hertschuh commented Nov 26, 2024

sibyjackgrove commented Nov 26, 2024

hertschuh commented Nov 26, 2024

sfenu-3 commented Nov 22, 2024 •

edited

Loading