Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple file formats for the raw data #5

Open
rossant opened this issue Jun 14, 2019 · 11 comments
Open

Support multiple file formats for the raw data #5

rossant opened this issue Jun 14, 2019 · 11 comments
Assignees
Labels

Comments

@rossant
Copy link
Contributor

rossant commented Jun 14, 2019

No description provided.

@yger
Copy link
Contributor

yger commented Jun 19, 2019

I don't really know what is the best way to proceed here. On one hand, there is neo, a python package meant to be able to read/write various file formats, in a fast and efficient way. On the other hand, we also recoded numerous wrappers on our side, close/similar to the wrappers you'll find in neo, but lighter, for the internal needs of SpyKING CIRCUS. Since neo is more structured, maybe this is the good way to go? It would be amazing if phy could display several native/proprietary file format, as numerous users are struggling to simply export data into raw binary...

@rossant
Copy link
Contributor Author

rossant commented Jun 19, 2019

Could you point me to the code of your wrappers?

@yger
Copy link
Contributor

yger commented Jun 19, 2019

The code is here https://github.com/spyking-circus/spyking-circus/tree/master/circus/files
I am not saying this is optimal, I am not as good coder as you, so clearly the system should be refactored. Basically, there is a DataFile object, exposing read/write methods implemented for all subclasses and proprietary file formats. It gets a little bit more complicated because we can virtually concatenate files or recordings within the same file (hdf5). Neo is doing the same thing, but with a slightly more documented structure. Both have high overlap, and for quite a while I told myself that maybe SC should import neo as a dependency, in order to centralize once for all all the wrappers. I would be very interested to have your opinion too

@samuelgarcia
Copy link

@rossant :
Hi Cyril,
you should seriously have a look to neo reader.

There are 2 levels for read in neo:

  • neo.io : the legacy that reader neo object (AnalogSignal, SpikeTrain, Event, ...)
  • neo.rawio : the low level that lazy acces buffer directly

https://github.com/NeuralEnsemble/python-neo

https://neo.readthedocs.io/en/latest/rawio.html

Reading ephy format have been done in many places. (circus, neo, spikeextractors and many individual wrapper for particular format). It is a total energy disperssion.

Neo have a strong API with 2 levels that support multiblock, multi segment, signals multi sample rate, events, epochs, spike and waveforms.
Neo include many formats.

I really think that, Pierre should move all the wrapper in neo and Cyril you should use neo.rawio.
I telling this to Pierre since some years now.
Maybe one day it will happen. :)

Also note that recently, lazy reading have also been incorporate in neo.io so it is also a solution you could use.

@yger
Copy link
Contributor

yger commented Jun 21, 2019

I think moving to Neo wrappers, for us, would be the solution. I just never managed to take the time, but this is an Open Issue with SC :-) One day, it will happen.

@rossant
Copy link
Contributor Author

rossant commented Jun 21, 2019

Thanks @yger and @samuelgarcia, I'll have a look at this soon. I agree that we should reuse the same code as much as possible. For phy, what I'll need is a function with the following signature:

read_raw_data(data_files, n_channels_dat=None, dtype=None, offset=None, sample_rate=None)

which returns a single memmap NumPy array with (virtual) shape (n_samples, n_channels) (or an object polymorphic to it) that can be efficiently sliced in time, where n_samples is the total number of time samples across the entire recording, and n_channels is the total number of channels in the recording (across all shanks if there are multiple probes). I say "virtual" because phy supports multi raw data files that have the same characteristics and just split in time.

I already have this function for raw binary files. Parameters like n_channels, dtype, offset, sample_rate cannot be obtained from raw data files, unless a specific format is used with the header; that's why I need to have them as optional parameters to this read_raw_data() function. For more complex file formats, I suppose the readers can extract these values from the binary header.

Can neo be used to write such a wrapper function?

@yger
Copy link
Contributor

yger commented Jun 21, 2019

Yes. So I think we should all converge to neo as a dependency, and centralize all the individual wrappers. Neo can do what you want, I think, and expose such functions. The problem is that for different file format, you have different inputs to give (sampling rate, nb_channels, ...). Some have everything in the header, some only partial information. Not a big deal, you just need, in phy, to know how to handle this in the params.py I guess

@samuelgarcia
Copy link

I think it should be easy to make an objec proxy between neo.rawio and this function.
This object would virtual concatenate all signals.

But why don't you use directly the neo.rawio API that explicitly have multi segment and lazy read instead of this virtual object inside phy ?

Note that in your case offset can change from one file to another. So this function can lead to problems

@samuelgarcia
Copy link

Few format need parameters as input except raw binary.

@rossant
Copy link
Contributor Author

rossant commented Jun 21, 2019

These parameters would just be used for the raw binary format, which is what we use at the moment. If the file format is different, these parameters would be None and just discarded, since they would be parsed from the files themselves.

The course of action I gave is the least effort path for me since I wouldn't have to change anything in phy. The virtual concatenation object we have already works very well for us and ideally, we'd use it indistinctively for all file formats.

What's the difference between lazy read and memmap?

@samuelgarcia
Copy link

Some file don't have continuous block in the file.
So we construct block on the fly.
Like memmap do but with many many many more line of codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants