-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving Event Based Data #216
Comments
To give more context on handling sparse data in python, there are two main library that can be used:
In rosettasciio, there is precedent in reading event based data for the Velox EDS data and JEOL EDS data. For the Velox EDS data, the stream of events is converted to COO sparse array (pydata-sparse). As there is no support for sparse array in hyperspy, we use a pragmatic approach of converting the sparse array to a dense array for processing and this can be done lazily - there are some caveats for the Velox EDS case. My general impressions are:
A pragmatic approach would be good to have some reader/tooling in rosettasciio that can return a sparse array (start with the COO format?). This would allow people to start using it/experimenting with it. Tagging some more people, who may be interested and/or have experience on this topic: @kociak, @matkraj. |
Hi, in LiberTEM since 0.11 we have support for a range of sparse and dense array backends: https://libertem.github.io/LiberTEM/udf/advanced.html#sparse Under the hood, https://github.com/LiberTEM/sparseconverter/ is taking care of detecting backends and converting data between backends. For 4D STEM and likely also spectroscopic data the CSR format is much better than COO since it allows O(1) slicing of the navigation dimension, while canonical (sorted) COO is O(log(n)) and unsorted COO requires a full scan. In LiberTEM we've implemented a simple raw format with a TOML descriptor and three raw binary files for CSR. You can already create lazy signals from LiberTEM datasets with both sparse and dense array chunks: https://github.com/LiberTEM/LiberTEM/blob/master/examples/sparse_udf.ipynb, Section Dask and HyperSpy integration. Since parallelization and chunking depend on slicing, I would strongly recommend to store data as CSR and not COO. By [using LiberTEM as a file reader (https://libertem.github.io/LiberTEM/dask.html#load-datasets-as-dask-arrays) you can already work with event-based data in HyperSpy: It is just converted to dense on the fly since the analysis methods rin HyperSpy don't generally work with sparse data. Ensuring that different algorithms receive data with the formats they support, while at the same time working with all file formats, was one of the trickier parts for sparse support in LiberTEM. We introduced a signaling system where processing routines and dataset indicate which array backends they support natively. From that an execution plan is derived with the necessary conversions. With that we can gradually roll out native sparse support for datasets, detectors and algorithms (UDFs). If you are interested in sparse data for HyperSpy I can only invite you to make use of that infrastructure! As you may know, we added pervasive Dask array support in LiberTEM for such integrations: https://libertem.github.io/LiberTEM/dask.html |
...I should add, the Amsterdam Scientific Instruments CheeTah T3 camera can already save data in the CSR file format that LiberTEM supports, as a result of our cooperation. As you can see, the format is really simple and based on an established standard, so it is straightforward to open it with all kind of software. You can create a |
@uellue Thanks for the detailed information! It seems like LiberTEM is once again ahead of the curve :)
The CSR format is usually only applied to 2 dimensional data correct? In this case you would still have a fully dense navigation array? Are you using GCXS or something like a dense object array of CSR arrays? Edit: I actually looked at the example given: You are flattening the data into a 2d array where one axis is the flattened navigation and one dimension is the flattened signal. Interesting, is there any performance difference from this vs a GCSX type implementation?
Have you given any thought to binary vs some sort of a zip like storage like hdf5 or zarr? I can see benefits to each method but would there be any specific reason against packaging things as 1 "file".
I'd assume that it's not in a canonical form because the data comes in as a stream based on the time the event occurred? I'd assume that under the hood the Out of curiosity what happens to the time information? Are you just integrating over some time frame to create the data? Is the intensity information stored as well?
This is another reminder that I need to increase the |
Thank you for bringing this up. In addition to what @uellue mentioned about using CSR/GCXS for storing STEM data, I have a few other comments:
Is intensity here the "time over threshold" supported by some detectors, or is this a number of counted electrons from a time bin or similar? Having access to the ToT for each event is valuable, as it can enable a form of super-resolution in post-processing, just as one example. There are not only "pixel" events, but on some detectors, also generic TDC events are supported (think: some electrical signal on an input of the detector board went high/low). That would be a secondary set of columns
On sorted data, you can do better than
Can you clarify why chunks would need to be read twice?
I think this would best be deferred to the application - for example, in LiberTEM, the decision is taken at runtime, if the user's code supports a compatible sparse format, the most efficient conversion from the source format into the supported format is performed (which can be "zero-cost", for example when converting from the same or very similar formats in different libraries). If sparse computation not supported, the data is "densified" in small chunks.
As a real-world example, one of the RAW CSR data sets I have access to, shape Often in CSR there are runs of many very similar if not identical numbers, which would benefit strongly from run-length encoding (which is used internally, for example in LZ4 or Zstd), and/or similar techniques like bit-packing. I'm not sure if there are similar patterns in x/y/t/v style data, that's certainly something to look into. So, it does make sense to look into compression, if only for storage efficiency. |
I think we realistically want to be able to save any kind of reasonable data. So things like ToT or some electrical signal are what I was thinking. Number of counted electrons could also be relevant as there are multiple counting methods which operate on some integrating detector and result in a sparse dataset.
Ahh, yes you are correct, I don't think many people are using something like a bubble sort in practice :)
It kind of depends on the scheme. If you wanted to compress every 1000 events you might have: Chunk 1 : 0.0 sec-0.8 sec So if you want to create a 1 second frame rate you would have to read chunk 1 once and chunk 2 twice.
I think this seems reasonable. I think that to start it would be good to identify a "best-practice" for saving the data and then build the analysis tools around that. Most of conversion should be fairly cheap.
Thanks for trying that! Storage efficiency and i-o bandwidth limitations would both benefit from compression. |
LiberTEM uses a flat navigation dimension internally and folds back to the actual shape only in the end, so that part actually works to our advantage! Many operations, such as virtual detectors (dot product), sum, standard deviation etc. process the data with flat sig dimension already, so also no change in that case. If a UDF wants a folded sig dimension it can request GCXS. The internal data structure of GCXS is the same as CSR/CSC, it just has metadata on which axes are folded how, meaning the conversion could be O(1). The performance of the scipy.sparse.csr_matrix dot product is much, much better than sparse.pydata.org. For that reason we use it where possible. However, I think that's just an implementation difference and they could be nearly equally fast.
Being able to interact with the files directly, without involving any 3rd party libraries, gives the best portability and flexibility -- reading and writing with direct I/O or liburing, memory mapping, even embedded systems etc. With three files you can write the data sequentially without knowing the final size. Also, it removes free parameters such as chunking, internal paths etc, meaning one can write a simple file reader that will perform consistently with all files. I don't like the variability you can have with HDF5 and other containers that can be pretty difficult to support. With HDF5 it is "new data source, new troubles".
Exactly, the events are scrambled a bit. ASI sorts them by frame index, which is required for CSR, but not by column index, to keep up with the data rate. Also, they don't aggregate events, i.e. if two electrons hit the same pixel you get 2x "1" with the same column index.
Yes, there are methods in scipy.sparse for that, but of course that is extra work, so processing as-is is probably better. Didn't benchmark it, though! As you can imagine, I spent quite some time to iron out all the rough patches with these conversions... If you want to convert between formats, I'd recommend having a look at |
I understood the
Ah yeah, if you add compression into the mix, you might run into these read amplification issues, that's true. Nowadays, there are also seekable formats. If they are efficient enough with small "frame sizes" (their term for chunks), that may be an option to allow for more fine-grained slicing with less read amplification, and efficiently finding offsets for slicing using e.g. binary search. If you really only put 1k events into a chunk, it would just make sense to eat the read amplification costs, though :)
Do you think there can be a single "best practice"? Such a format would need to be completely lossless and contain all information that was recorded, and possibly information used by the vendor's software. Otherwise, I'm concerned that this might result in having to store both the original data "as read from the detector" in a potentially vendor-specific format, in addition to the converted "best practice" format. (This is already something that is happening with some (non-sparse) vendor formats, meaning we need to archive two sets of potentially large data if we want to a) be FAIR w/ an open format, and b) still have all metadata and want to use the vendor's software to load it again.)
We have a benchmark to put numbers to that - take these as a grain of salt, as they are mostly used for deciding which conversion "path" to take, if there is no direct conversion possible. And also, these conversions handle quite small chunks of sparse data - conversion of a whole data set might paint a different picture. |
I did some quick calculations of file size for two detectors versus the exposure rate (in events per pixel per "frame"), assuming that a total of 20 events per pixel are accumulated during the acquisition (more or less total exposure should simply scale linearly). I used four ideas for encoding: (1) binary frame-based encoding, (2) an optimized run-length encoding scheme I've been considering, (3) frame-based CSR (where the ROW_INDEX array represents frames, and the COLUMN_INDEX array represents the pixel index within each full 2D frame), and (4) row-based CSR (where the ROW_INDEX array represents rows within each frame, and the COLUMN_INDEX array represents the column within each row).
For sparse electron microscopy data, we typically see a per-frame sparsity of ~0.001 to ~0.025. A lower exposure rate means that the frames are of limited utility because they are signal-starved. A higher exposure rate means that coincidence loss is significant, limiting data quality. Obviously, the file format should accommodate lower or higher exposure rates, but I'm interested in looking at the typical use cases. Absent additional post-encoding compression, the run-length encoding (RLE) scheme results in the smallest file sizes. However, RLE seems best-suited for binary data and it is not clear how to extend it to include intensity, size, or other data about each event to RLE. Additionally, parallelizing input/output is not as straight-forward as it is for CSR. Frame-based CSR (which is what I think @uellue was advocating) is certainly much smaller than storing raw frames. However, it is noticeably larger than row-based CSR for all but the very lowest exposure rates. Therefore, in the practical cases I'm considering, row-based CSR seems best. But if the exposure rate is very low, it's better to group multiple rows or use entire frames instead of rows. Therefore, the file format should enable everything from row-based CSR up to frame-based CSR. I'm envisioning a format that consists of the following combination of files (or these files could be grouped into a single HDF5 file, and additional compression (e.g., ZIP) could be added):
Initially, the optional additional files can probably be ignored during analysis, but this will be useful to add new features in the future. For example, our low-voltage (SEM) cameras record the energy of every detected electron. We would like to save the energy of each detection event along with its coordinates, so that users can perform a posteriori energy-filtering. |
Okay I think that I'm starting to get the scope of the problem but someone tell me if I am wrong: Data Acquisition and storage:There are potentially 2 (maybe 3) different types of data coming off of cameras.
Data Processing
Compression and Storage
Let me know if I have anything wrong! I might mess around and see how this might look using |
Describe the functionality you would like to see.
This is something that I've talked about offline a couple of times but it would be good to have a more serious discussion on this and get ahead of it before it is too late :). It would be nice to develop a standard for saving event based data. This is usually saved in something like 4 columns: x, y, intensity, time but there are potentially other ways that this can be saved.
For example you could have a counted dataset where you have a grouping of [x, y] positions and some associated time etc. Note that this would be equivalent to the ragged array storage that we currently have implemented.
If you want to create an image or a spectra then you would just integrate the events over some time period.
Describe the context
This type of data doesn't really conform very well to things like hdf5 or zarr. We could segment it into something like 1 second chunks or groups of 1,000,000 events (64bits41,000,000-> ~32 mb) which would make the recreation much faster. For compression sake you probably want some sort of chunk id which details the start time and the end time of some chunk. This is how Parquet Handles larger column based arrays.
As far as recreation goes, this is a pretty fun lazy computation problem. Each integrated image can be created lazily after you slice in time. You can reslice and won't be penalized as you haven't created the dataset.
Additional information
A couple of key points:
I feel like this might make more sense in rosettasciio rather than other places (such as hyperspy etc.) as it would be good to handle the integration (or lack of integration) at loading rather than later. There is also the question of saving and operating on the data (do we keep it as event based data or at some point during the analysis is the dataset converted to an integrated dataset?)
Actionable Items:
@ericpre @francisco-dlp @sk1p @uellue @cophus @bbammes
The text was updated successfully, but these errors were encountered: