Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffling can cause (tiny) information loss #564

Closed
timcallow opened this issue Aug 2, 2024 · 1 comment · Fixed by #607
Closed

Shuffling can cause (tiny) information loss #564

timcallow opened this issue Aug 2, 2024 · 1 comment · Fixed by #607
Assignees

Comments

@timcallow
Copy link
Contributor

I was looking into issue #509 and noticed a slightly weird feature of the data shuffling function.

Essentially, snapshots are created such that the total number of grid points (summed across snapshots) divided by the number of shuffled snapshots is an integer (lines 524-546 of data_shuffler.py). The dimensions of the shuffled snapshots is based on this.

However, when shuffling is done, the snapshots are populated by taking the number of grid-points per snapshot divided by the number of shuffled snapshots (lines 157-179). This is not necessarily an integer: for example, consider taking 3 snapshots from a grid of 200x200x200.

The result of this is that, if the number of mixed snapshots is not a divisor for each of the original snapshot gridsizes, then some of the vectors in the final shuffled data (a very small number) will be completely zero. Here is one such example:

Screenshot from 2024-08-02 16-11-38

I very much doubt that a handful of zero vectors, given the typical grid sizes we work with, will affect the neural network training. But I have a couple of questions:

  • Is this a known feature?
  • Is it necessary to take exactly 1/nth from each original snapshot into the final mixed snapshots? Given the grid sizes we work with, there would be statistically no difference to first concatenate the input vectors into a single big one, and then simply shuffle that.

In my opinion, it would be better to do as described above. It simplifies the code and ensures there are no arbitrary zero vectors in the final training set. It would make the solution to #509 easier.

What do you think @RandomDefaultUser? Am I missing something here?

@RandomDefaultUser
Copy link
Member

Hi @timcallow I believe the original reason to take exactly 1/nth is that the shuffling is realized via numpy memmaps. Here, we don't load all the snapshots into memory because that would be quite the overhead, but instead we load only 1/nth of n snapshots, so 1 snapshot at a time. As you have explained, that quite clearly leads to a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants