Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538

Open
sfenu-3 opened this issue Nov 22, 2024 · 6 comments
Assignees
Labels

Comments

@sfenu-3
Copy link

sfenu-3 commented Nov 22, 2024

Bug description:

We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.

Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.

Steps to reproduce:

`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd

data_dict = {
'a': [1.] * 10,
'b': [20.] * 10,
'c': [300.] * 10,
'd': [4000.] * 10
}

df = pd.DataFrame(data_dict)

valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

df.to_csv('bug_report_test_data.csv', index=False)

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])

`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]

Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue

@fchollet
Copy link
Member

I can reproduce the issue. @hertschuh, this appears to be related to the fix you provided as some point to use split_dataset with deeply nested structures. Maybe we should intro two separate cases: deeply nested vs not, and handle each case separately.

@sibyjackgrove
Copy link

@fchollet Thank you for looking into the issue. I face the same issue with Keras 3.6 when using split_data on tf.data dict datasets.

@hertschuh
Copy link
Collaborator

@sfenu-3

This

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)

hangs for a very simple reason, which is that it generates an infinitely looping dataset by default.

If you replace with:

invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1, shuffle=False, num_epochs=1)

Then it works as expected. I checked with Keras 3.3.3 and it hangs in the same way.

or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]

Do you have an example of that?

@hertschuh
Copy link
Collaborator

@fchollet Thank you for looking into the issue. I face the same issue with Keras 3.6 when using split_data on tf.data dict datasets.

Which issue? The hanging or the key / value mismatch?

Do you have an example?

Thanks!

@sibyjackgrove
Copy link

@hertschuh I was referring to the key/value mismatch. I am creating my dataset using:

dataset = tf.data.experimental.make_csv_dataset(file_pattern=csv_filepattern,batch_size=1 ,select_columns=columns_selected,
header=header,num_epochs=1, shuffle=False,num_rows_for_inference=1000)
train_dataset, test_dataset = keras.utils.split_dataset(dataset, left_size=train_fraction)

It works fine for Keras 3.3.3 but has key,value mismatch for Keras 3.6

@hertschuh
Copy link
Collaborator

The root cause is that our tree utils have a bug with OrderedDict, which is what I'm in the process of fixing in #20481

The bug didn't appear in 3.3.3 and prior because we were not using tree until this refactor #19911

@hertschuh hertschuh self-assigned this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants