Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load FOF catalogues & Cosmology Metadata Changes #190

Merged
merged 39 commits into from
Sep 19, 2024
Merged

Conversation

robjmcgibbon
Copy link
Collaborator

Allow swiftsimio to load FOF files. I've made the SWIFTDataset class more generic, so it now has a filetype attribute (currently can be snapshot or FOF). Instead of storing an int for each particle type it now saves the names of the groups within the hdf5 that we want to load ('PartType{i}' for snapshots, and 'Groups' for fof files).

This seems to work for loading both snapshots and fof files. I'll have a look at the tests next.

Comment on lines -7 to +33
0: "gas",
1: "dark_matter",
2: "boundary",
3: "sinks",
4: "stars",
5: "black_holes",
6: "neutrinos",
"PartType0": "gas",
"PartType1": "dark_matter",
"PartType2": "boundary",
"PartType3": "sinks",
"PartType4": "stars",
"PartType5": "black_holes",
"PartType6": "neutrinos",
}

particle_name_class = {
0: "Gas",
1: "DarkMatter",
2: "Boundary",
3: "Sinks",
4: "Stars",
5: "BlackHoles",
6: "Neutrinos",
"PartType0": "Gas",
"PartType1": "DarkMatter",
"PartType2": "Boundary",
"PartType3": "Sinks",
"PartType4": "Stars",
"PartType5": "BlackHoles",
"PartType6": "Neutrinos",
}

particle_name_text = {
0: "Gas",
1: "Dark Matter",
2: "Boundary",
3: "Sinks",
4: "Stars",
5: "Black Holes",
6: "Neutrinos",
"PartType0": "Gas",
"PartType1": "Dark Matter",
"PartType2": "Boundary",
"PartType3": "Sinks",
"PartType4": "Stars",
"PartType5": "Black Holes",
"PartType6": "Neutrinos",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed present_particle_types to return "PartType{i}" rather than integers

return [f"PartType{i}" for i in types]

So I could either change this file, or else I'd have to convert back to an integer when getting the particle name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not able to remove these for modern snapshots? As they contain the particle type names and fields named appropriately..? I'd like to use those so we don't have to add/handle additional or renamed particle types manually.

@robjmcgibbon
Copy link
Collaborator Author

I'm now loading SOAP catalogues as well. The groups which we want to load are stored in the SOAP metadata. A slight complication is that we have subgroups, i.e. the 200c halos properties are stored in the group /SO/200_crit, and the x projection of the 100kpc aperture are stored in the group /ProjectedAperture/100kpc/projx. It would be nice to try and preserve this structure so we would access the group with projected_aperture.100kpc.proj_x, but I don't know of any way in python to have attributes that start with numbers, and I don't want to go with projected_aperture.kpc100

Masking seems to work for the SOAP catalogues. The FOF catalogues aren't sorted so we can't apply masks to them.

@robjmcgibbon
Copy link
Collaborator Author

SOAP contains some additional metadata for each dataset that we need to read in.

  • SOAP files are first generated and then compressed by a separate script. This means we save the type of compression to apply to the dataset, and whether the compression has been applied.
  • We want all the arrays in SOAP to be the same length (the number of subhalos). However, we don't calculate all quantities for all objects, e.g. we have some quantities that are only calculated for subhalos with 100 bound gas particles. This means the arrays can have a large number of zeros. I'm now reading in the mask applied and adding that information to the array description.

@kyleaoman
Copy link
Member

Hi Rob, little thing that I noticed trying to experiment a bit with this, if I install normally (pip install swiftsimio/) I get:

>>> import swiftsimio as sw
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cosma/home/durham/dc-oman1/.virtualenv/ssio/lib/python3.12/site-packages/swiftsimio/__init__.py", line 1, in <module>
    from .reader import *
  File "/cosma/home/durham/dc-oman1/.virtualenv/ssio/lib/python3.12/site-packages/swiftsimio/reader.py", line 14, in <module>
    from swiftsimio import metadata
  File "/cosma/home/durham/dc-oman1/.virtualenv/ssio/lib/python3.12/site-packages/swiftsimio/metadata/__init__.py", line 9, in <module>
    from .soap import soap_types
ModuleNotFoundError: No module named 'swiftsimio.metadata.soap'

If I install via link (pip install -e swiftsimio/), I don't have this problem. Should just be a matter of including in the pyproject.toml file, I think.

@JBorrow
Copy link
Member

JBorrow commented Jul 18, 2024

Gotta add the submodule to the list in the pyproject.toml

@kyleaoman
Copy link
Member

When you get to the stage of looking at the masking, could I make a request? I know that it was planned to be able to apply spatial masks like we do with particle data. Could we also have an option to read a user-defined slice of the subhalo table (might have to handle subhalo and fof tables separately? can't remember the structure in SOAP hdf5 files and the sample seems to have moved/vanished). It's possible that there's nothing extra to be done since with a spatial_only=False mask you can do virtually anything, just want to flag it as something to check. My use case is that I want to be able to specify that I'm interested in subhalo 12345 (using a mask makes sense for this), and then when I do:

data = sw.load(...)
data.bound_subhalo.enclose_radius  # or whatever other field

I would like to read only the single row of interest. I said slice above because there could be use cases (hypothetically) to read a range or other slice, and it's the same amount of work to implement this as masking to a single row. Like I said I don't think much is needed because the masking tools just about support this already, but I guess the code needs to understand what's an array of subhalo data vs fof data vs anything else.

@robjmcgibbon
Copy link
Collaborator Author

When you get to the stage of looking at the masking, could I make a request? I know that it was planned to be able to apply spatial masks like we do with particle data. Could we also have an option to read a user-defined slice of the subhalo table (might have to handle subhalo and fof tables separately? can't remember the structure in SOAP hdf5 files and the sample seems to have moved/vanished). It's possible that there's nothing extra to be done since with a spatial_only=False mask you can do virtually anything, just want to flag it as something to check. My use case is that I want to be able to specify that I'm interested in subhalo 12345 (using a mask makes sense for this), and then when I do:

data = sw.load(...)
data.bound_subhalo.enclose_radius  # or whatever other field

I would like to read only the single row of interest. I said slice above because there could be use cases (hypothetically) to read a range or other slice, and it's the same amount of work to implement this as masking to a single row. Like I said I don't think much is needed because the masking tools just about support this already, but I guess the code needs to understand what's an array of subhalo data vs fof data vs anything else.

Sorry about moving the example catalogue, they're now at /cosma8/data/dp004/dc-mcgi1/FLAMINGO/Runs/L1000N1800/DMO_FIDUCIAL/SOAP/HBTplus

I can make a new method for SWIFTMask called something like constrain_index. I don't think it should matter if it's a FOF/SOAP/snapshot file, they should all work the same way.

@kyleaoman
Copy link
Member

Makes sense. In EAGLE we had /FOF/... datasets and /Subhalo/... datasets coexisting in the same file, couldn't remember offhand if this was the case in SOAP, which would require a bit more logic, but looking now, it's not, so it should indeed be a minimal addition. Thanks!

@robjmcgibbon
Copy link
Collaborator Author

@JBorrow update on the current status of this in case you have time to look it over before the meeting next week

This change adds an attribute to the SWIFTMetadata class called "filetype" which can be either "snapshot", "FOF", or "SOAP". I haven't changed much when it comes to reading in the files, everything has just been made more generic so instead of requiring the data to be in groups named "PartTypeX", the groups can have any names. Masking appears to be working for the "SOAP" filetype, it's not supported for "FOF" since SWIFT does not output spatially sorted catalogues.

SWIFT has added two new attributes to it's units metadata - one to indicate if a value is stored as physical, another to indicate if a value can be transformed between comoving and physical. This was motivated by quantities such as BirthDensities, which it doesn't make sense to convert. These attributes are now read in when reading a file, and the cosmo_array has a valid_transform attribute, which throws an error if someones tries to convert something they shouldn't.

Things to be done:

  • Add tests for the "FOF" and "SOAP" filetypes
  • Add documentation for the "FOF" and "SOAP" filetypes
  • Implement a writer for "FOF" and "SOAP" (not sure this is needed, other than being helpful for tests)
  • Decide what to do with swiftsimio functionality that only make sense for snapshots (e.g. visualisation)

@JBorrow
Copy link
Member

JBorrow commented Sep 12, 2024

I am not sure that there's any functionality that only makes sense for snapshots. Visualisation is still useful with halo catalogues; for instance I use that in a hacky way right now to create cross spectra with haloes...

Copy link
Member

@JBorrow JBorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to see a comprehensive test suite here to make sure we're not missing anything.

if filetype == "snapshot":
return SWIFTSnapshotWriter(*args, **kwargs)
# TODO implement other writers
# elif filetype == '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it likely that anyone will want to write their own catalogue files with swiftsimio? Ideally we would just leave this feature alone until we are ready to make that change.

Given that people have to choose their 'file type' anyway, why don't we just make them choose the class they want to instantiate? We don't need to abstract this away.

@@ -128,7 +139,7 @@ def _unpack_cell_metadata(self):

def constrain_mask(
self,
ptype: str,
group_name: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but in general I don't like _name as an addition to the variable name. Given that we know its type is string, group is obviously a name.

Comment on lines -7 to +33
0: "gas",
1: "dark_matter",
2: "boundary",
3: "sinks",
4: "stars",
5: "black_holes",
6: "neutrinos",
"PartType0": "gas",
"PartType1": "dark_matter",
"PartType2": "boundary",
"PartType3": "sinks",
"PartType4": "stars",
"PartType5": "black_holes",
"PartType6": "neutrinos",
}

particle_name_class = {
0: "Gas",
1: "DarkMatter",
2: "Boundary",
3: "Sinks",
4: "Stars",
5: "BlackHoles",
6: "Neutrinos",
"PartType0": "Gas",
"PartType1": "DarkMatter",
"PartType2": "Boundary",
"PartType3": "Sinks",
"PartType4": "Stars",
"PartType5": "BlackHoles",
"PartType6": "Neutrinos",
}

particle_name_text = {
0: "Gas",
1: "Dark Matter",
2: "Boundary",
3: "Sinks",
4: "Stars",
5: "Black Holes",
6: "Neutrinos",
"PartType0": "Gas",
"PartType1": "Dark Matter",
"PartType2": "Boundary",
"PartType3": "Sinks",
"PartType4": "Stars",
"PartType5": "Black Holes",
"PartType6": "Neutrinos",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not able to remove these for modern snapshots? As they contain the particle type names and fields named appropriately..? I'd like to use those so we don't have to add/handle additional or renamed particle types manually.

Comment on lines +5 to +18
# Describes the conversion of hdf5 groups to names
def get_soap_name_underscore(group: str) -> str:
soap_name_underscores = {
"BoundSubhalo": "bound_subhalo",
"InputHalos": "input_halos",
"InclusiveSphere": "inclusive_sphere",
"ExclusiveSphere": "exclusive_sphere",
"SO": "spherical_overdensity",
"SOAP": "soap",
"ProjectedAperture": "projected_aperture",
}
split_name = group.split("/")
split_name[0] = soap_name_underscores[split_name[0]]
return "_".join(name.lower() for name in split_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we do this and not handle the conversion in the same way as field names?

"""

# TODO:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is TODO

Comment on lines 1026 to 1031
try:
# SOAP catalogues can be compressed/uncompressed
is_compressed = dataset.attrs["Is Compressed"]
except KeyError:
is_compressed = True
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_compressed = dataset.attrs.get("Is Compressed", True)

Comment on lines 1033 to 1038
except AttributeError:
# Compression is saved as str not bytes
comp = dataset.attrs["Lossy compression filter"]
except KeyError:
# Can't load compression string!
comp = "No compression info available"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines 1078 to 1082
try:
physical = dataset.attrs["Value stored as physical"][0] == 1
except:
physical = False
return physical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like this at all; never have non-specific exceptions. Can you do:

return bool(dataset.attrs.get("Value stored as physical", [1])[0])

Comment on lines 1095 to 1102
def get_valid_transform(dataset):
try:
valid_transform = (
dataset.attrs["Property can be converted to comoving"][0] == 1
)
except:
valid_transform = True
return valid_transform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern

Comment on lines 1516 to 1521
if group_metadata.metadata.filetype == "snapshot":
group_nice_name = metadata.particle_types.particle_name_class[group]
elif group_metadata.metadata.filetype == "FOF":
group_nice_name = "FOFGroups"
elif group_metadata.metadata.filetype == "SOAP":
group_nice_name = metadata.soap_types.get_soap_name_nice(group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if we wrote all this code are we doing this three different ways for three catalogues that we wrote..?

@JBorrow JBorrow changed the title Load FOF catalogues Load FOF catalogues & Cosmology Metadata Changes Sep 19, 2024
@JBorrow JBorrow marked this pull request as ready for review September 19, 2024 08:34
@JBorrow JBorrow merged commit ac05b8d into master Sep 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants