-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs using keras.src.utils.split_dataset on tf.data.Dataset loaded using tf.data.experimental.make_csv_dataset on versions v3.4.0+ #20538
Comments
I can reproduce the issue. @hertschuh, this appears to be related to the fix you provided as some point to use |
@fchollet Thank you for looking into the issue. I face the same issue with Keras 3.6 when using |
This invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1) hangs for a very simple reason, which is that it generates an infinitely looping dataset by default. If you replace with: invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1, shuffle=False, num_epochs=1) Then it works as expected. I checked with Keras 3.3.3 and it hangs in the same way.
Do you have an example of that? |
Which issue? The hanging or the key / value mismatch? Do you have an example? Thanks! |
@hertschuh I was referring to the key/value mismatch. I am creating my dataset using:
It works fine for Keras 3.3.3 but has key,value mismatch for Keras 3.6 |
Bug description:
We've noticed two bugs that appear when using split_dataset on tf datasets loaded using tf.data.experimental.make_csv_dataset for keras versions 3.4.0 onward. One of two things happens on attempting to call split_dataset on a dataset loaded using make_csv_dataset, either the split_dataset call hangs indefinitely or the output train and test data have their column names shuffled.
Tested keras versions: 3.5.0, 3.6.0. Tested tensorflow versions: 2.18.
Steps to reproduce:
`from keras.src.utils import split_dataset
import tensorflow as tf
import pandas as pd
data_dict = {
'a': [1.] * 10,
'b': [20.] * 10,
'c': [300.] * 10,
'd': [4000.] * 10
}
df = pd.DataFrame(data_dict)
valid_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
print("Dataframe dataset sample: ", [e for e in valid_dataset.take(1)])
train, test = split_dataset(valid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
df.to_csv('bug_report_test_data.csv', index=False)
invalid_dataset = tf.data.experimental.make_csv_dataset('bug_report_test_data.csv', batch_size=1)
print("CSV dataset sample: ", [e for e in invalid_dataset.take(1)])
train, test = split_dataset(invalid_dataset, left_size=0.5, seed=1)
print("Train dataset sample: ", [e for e in train.take(1)])
`
In the first case, split_dataset works as expected. In the latter case, the split_dataset call will either hang indefinitely or the column names will get reassigned like ['d': [1], 'b':[300], 'c':[4000], 'a':[20]]
Reverting the function _restore_dataset_from_list in keras.src.utils.dataset_utils back to version 3.3.3 resolves the issue
The text was updated successfully, but these errors were encountered: