-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skyfield calculations may be overly accurate for requirements at the cost of computation #4
Comments
Upon closer look, while the lines pointed out above indeed need addressing, the skyfield library seems generally to be quite slow for what we are asking it to do. Both the _t_since_sunset and _sun_elevation functions are quite slow for large datasets, with the latter taking the crown, with these lines also causing speed issues: nowcastlib/nowcastlib/pipeline/features/generate.py Lines 182 to 186 in f054704
Skyfield was selected because the popular library pyephem suggests it. Further attention should be dedicated into determining whether
|
I'll be interested to hear which component in that call has the higher cost. Is there a way you can produce that series of times, I wonder, without creating n separate Python datetime objects? If the times are evenly spaced on the clock or calendar there are other approaches in Skyfield for creating the time array. In the meantime, if you split your call into two lines with a temporary variable:
— then could you could separately measure the cost of the Pandas operation from the Skyfield operation. |
@brandon-rhodes Hi thanks for commenting! Wasn't expecting the library creator to come and help 😅 The times are indeed evenly spaced so this could be done, as you say, with other approaches from skyfield. That was just a quick and dirty solution without having to dig too deeply into the documentation. I've updated my second comment with the complete section of code causing problems. Evaluating line by line as you suggested sheds some light on what operations are more costly. You can see my experimentation in this notebook, where i use %lprun to profile the same function with different input time series. While for small time series the pandas operation is more costly than skyfield's In the notebook the time series I try on are of length 1000, 10000 and 100000. I wasn't able to try it on 1000000 since it would not complete execution. For this particular project, time series of this size (millions of timestamps) are what we are concerned with. Thanks again for the interest! I hope this satisfies your interest :). We would really appreciate if you have any suggestions on how to do this in a smarter way! We're trying to get the sun's altitude at each timestamp for millions of timestamps, with the timestamps being evenly spaced (regular sample rate basically). We are also interested in determining the most recent sunset for each timestamp (code here) although this has been less problematic |
My guess is that most of the time is invested in computing precession and nutation — the precise angle at which the Earth's poles point at any given moment. The standard nutation series, in particular, has 687 terms of 5 coefficients each which need to be evaluated at every point on your timeline, so if you have one million times then that’s something like 5×687×10⁶ multiplies and then all the necessary adds to put them together. What kind of accuracy are you after? The full series is only useful if you're going for observatory-grade angles that are useful for pointing a radio telescope (which is the accuracy Skyfield aims for by default). Since your times are so close together, my guess is that we could evaluate nutation and precession daily, or only a few times a day, and linearly interpolate between them, to get better speed. But that will introduce error, so we need first to know how exact an answer you need on the Sun's position in the sky. What will you be doing with the angles? That can help establish tolerances. |
We are using the angles as additional input data to provide to machine learning models to be used for turbulence forecasting.
I have asked my colleagues what sort of accuracy we are after and will update this post
This seems like a good idea and depending on our tolerance it may be sufficient to go this route. I imagine this will be the case. Thank you again for all the help, let us know if you have any more comments 😊 |
Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.
The following lines need addressing:
nowcastlib/nowcastlib/pipeline/process/postprocess/generate.py
Lines 130 to 133 in b24e0d8
The text was updated successfully, but these errors were encountered: