Let's add an option to turn off aborts and error messages on errors. #340

edwardhartnett · 2023-03-10T22:21:22Z

edwardhartnett
Mar 10, 2023
Maintainer

Right now we have aborts when there is an error. This is inconvenient to the calling applcation.

How about we add an option to change this behavior. In the option is used, the library will print nothing, and will not abort, but will instead return an error code.

Obviously this would touch a lot of code, and would have to be done very carefully. But it is absolutely required if we are going to create a V2 API, and also required to reasonably test all error behavior. It's also required for the C API to get any traction.
Calling abort on error is a holdover practice from ancient times. We can change this behavior in a backward compatible way.

edwardhartnett · 2023-03-11T23:18:36Z

edwardhartnett
Mar 11, 2023
Maintainer Author

If we change the bort() subroutine, so that it checks the new option and only aborts if it should, then we have a pretty easy fix for a lot of the problem.

For example we could then test error conditions for all our other code.

The problem is that we would be using the same error code for all errors. In order to have library-wide error codes, we would need a new bort(), one which took an integer argument, which would be an error code. Then there needs to be a library function which returns the associated error message.

What would also be nice is if we could also handle the existing error messages. We want to turn them off, but save them in case the user wants to get them. For that we could have an internal buffer that contains the last error message (or all spaces). Then there could be a function to get that. (And if we were doing that we might as well have an integer buffer which stores the last error number.)

In that case, the user would get the general bort return code for errors, but could then check the error message buffer for any message.

0 replies

edwardhartnett · 2023-03-12T15:09:10Z

edwardhartnett
Mar 12, 2023
Maintainer Author

What if we add an optional argument to bort(), which is err_code.

Then, as time goes by, we can come up with a library-wide list of error codes, and return the appropriate code from each bort.

0 replies

edwardhartnett · 2023-05-13T14:58:35Z

edwardhartnett
May 13, 2023
Maintainer Author

From @jack-woollen

Thinking about how to thread bort return codes through the lib, I come up with the following thoughts, just for grins.

if bort is a gateway to threading the return codes, it could update a globally scoped bort_rc value along with an error string, and then return. The caller would then also return. Each caller in the return path would check the global bort_rc value and also return if it is set, all the way back to the original caller, who could do whatever they want, but would probably just print the string and stop.

There are about 1800 internal calls, not counting calls to bort. There's about 550 calls to bort. Each call to bort would require a return statement added after it. Each other call would need an if test added wrapping or following it to check the return code. So adding 1800+550 return statements is required. But what seems like a bigger problem to me is 1880 new if tests added throughout the lib, each potentially being executed multiple times. Performance can be an issue here.

Maybe there's a better way to do this?

0 replies

edwardhartnett · 2023-05-13T15:10:19Z

edwardhartnett
May 13, 2023
Maintainer Author

I don't think we have to worry about performance because:

An if statement is very fast.
More importantly, this will only occur on error conditions, which are not happening over and over again.

As you very correctly practice in your netCDF code: you must always check the return value. This always adds an if-statement, but that's OK, it's worth it.

Right now we have code like:

   if (something) then
      bort("some problem")
   endif

And bort is like this:

      SUBROUTINE BORT(STR)

      CHARACTER*(*) STR

      CALL ERRWRT(' ')
      CALL ERRWRT('***********BUFR ARCHIVE LIBRARY ABORT**************')
      CALL ERRWRT(STR)
      CALL ERRWRT('***********BUFR ARCHIVE LIBRARY ABORT**************')
      CALL ERRWRT(' ')

      CALL BORT_EXIT

      END

So we change to:

      SUBROUTINE BORT(STR)

      CHARACTER*(*) STR

      CALL ERRWRT(' ')
      CALL ERRWRT('***********BUFR ARCHIVE LIBRARY ABORT**************')
      CALL ERRWRT(STR)
      CALL ERRWRT('***********BUFR ARCHIVE LIBRARY ABORT**************')
      CALL ERRWRT(' ')
      if (bort_flag .eq. BORT_EXIT) then
         CALL BORT_EXIT
      else
         bort_errmsg = STR
         return
   endif

      END

And then do:

The we change calls to it like this:

   if (something) then
      bort("some problem")
      return
   endif

If the function/subroutine already has a return code in it's parameters, we can use that. If not, callers can check the BORT_ERRSTR to see if it is empty. If it's not, there was an error.

So all NCEPLIBS-bufr calls would need to be like:

call ufbsomething(blah1, blah2)
if (BORT_ERRSTR .ne. '') stop 10

Ideally we would have an error number, instead of a string. That's a further elaboration on this idea that would require adding a parameter to bort().

3 replies

jack-woollen May 14, 2023
Collaborator

Actually almost all user entries to the bufrlib already have return codes. But I think you're missing the roughly 1800 internal calls in the bufrlib. Each one of those would need:

call something(blah1, blah2)
if (BORT_ERRSTR .ne. '') return

I'm not saying this can't be done. I am saying its an awful lot of work, and a lot of overhead, just to be able to call bufrlib routines from netcdf, or for any cosmetic purpose.

edwardhartnett May 14, 2023
Maintainer Author

All other libraries, including all other NCEPLIBS libraries, return error codes and do not crash the calling application.

NCEPLIBS-bufr must meet that same requirement. If the other libraries can all do it, I'm convinced that NCEPLIBS-bufr can do it too.

jack-woollen May 15, 2023
Collaborator

@edwardhartnett You wrote:
"Take a closer look at the code I've written. While an extra return statement is needed after each bort() call, there's no need to check the contents of BORT_ERRSTR everywhere, just at the top level."

If the bort happens 5 or 6 subroutines deep from the user app, your code will return 1 level and then what? If every internal call does not check BORT_ERRSTR after the callee returns, the results become unpredictable. It doesn't return to the user app without traversing all the callers in between it and the callee that hit bort. I know you know that! Every routine in the return path must immediately return to its caller after a bort is hit. Otherwise the problem flagged by the bort call can generate seg faults on the way back up the return path, for example. That doesn't seem like a genius move to me, as regards your following statement.

edwardhartnett · 2023-05-14T15:39:42Z

edwardhartnett
May 14, 2023
Maintainer Author

Passing return codes from library functions is not cosmetic. It is basic functionality that is expected from every library. Crashing the calling application is a serious bug that makes NCEPLIBS-bufr unusable to most other applications.

When we do something in our library which no one else does one of the following is true:

We are software geniuses who figured out a better way to program than everyone else.
We are not.

Choice 2 is far more likely.

In a fully tested code base I would expect this to take a couple of days, or a week at most. I will get back to this in a few months, after I have made some more GRIB progress. I have no doubt that I can make these changes quickly and easily, once the library is fully tested.

0 replies

jbathegit · 2023-06-02T20:53:12Z

jbathegit
Jun 2, 2023
Maintainer

I think any solution also needs to take into account that an application code can have multiple Fortran logical units attached to the library at any given time. So a bort-like error condition could occur which impacts one logical unit but not the others, and therefore any bort_flag or bort_errmsg values should be indexed to or otherwise keyed to a particular logical unit. This would allow an application code to make a more informed decision about what to do, and this would also make sense from a user's standpoint, since pretty-much every routine that a user ever calls is already passing in such a logical unit number anyway as one of the arguments.

This is the basic approach I took in PR #509, using the internal module stcode and the user-callable subroutine igetsc to gracefully exit from an error that previously caused an abort in subroutine usrtpl. In particular, note how the internal iscodes array and the input argument to igetsc are both keyed to a specific logical unit, which in turn allows there to be different status codes for each logical unit rather than multiple ones simultaneously overwriting or otherwise tripping over each other between multiple logical units.

I realize this isn't exactly the same as what's being discussed above to route all error processing through subroutine bort, but it's the same basic concept, so let's please keep it mind whenever we get around to tackling this issue on a library-wide basis. In particular, let's plan to make any bort_flag or bort_errmsg values specific to each individual Fortran logical unit. Also, if we do end up deciding to route everything through subroutine bort, then let's please also go back and retrofit the solution that I implemented in #509, so that we have the same consistent approach for all error processing throughout the library.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's add an option to turn off aborts and error messages on errors. #340

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Let's add an option to turn off aborts and error messages on errors. #340

edwardhartnett Mar 10, 2023 Maintainer

Replies: 6 comments · 3 replies

edwardhartnett Mar 11, 2023 Maintainer Author

edwardhartnett Mar 12, 2023 Maintainer Author

edwardhartnett May 13, 2023 Maintainer Author

edwardhartnett May 13, 2023 Maintainer Author

jack-woollen May 14, 2023 Collaborator

edwardhartnett May 14, 2023 Maintainer Author

jack-woollen May 15, 2023 Collaborator

edwardhartnett May 14, 2023 Maintainer Author

jbathegit Jun 2, 2023 Maintainer

edwardhartnett
Mar 10, 2023
Maintainer

Replies: 6 comments 3 replies

edwardhartnett
Mar 11, 2023
Maintainer Author

edwardhartnett
Mar 12, 2023
Maintainer Author

edwardhartnett
May 13, 2023
Maintainer Author

edwardhartnett
May 13, 2023
Maintainer Author

jack-woollen May 14, 2023
Collaborator

edwardhartnett May 14, 2023
Maintainer Author

jack-woollen May 15, 2023
Collaborator

edwardhartnett
May 14, 2023
Maintainer Author

jbathegit
Jun 2, 2023
Maintainer