Exception Handling - Functional Specification

by Marco Romanini 14th Oct 1998

Description

It has become self-evident that it is necessary to protect the engine from badly behaved code in either user functions, as well as in the engine itself. As customers migrate from Classic server to super server they are noticing one of the fundamental differences. That is, in the classic model, if the server crashed, only the user who caused the crash was disconnected, while in Super Server, all users are dropped, and the server core dumps/GPFs.

Using C++ language exception handling we are able to trap for hardware exception such as SEGV / Access Violation. By doing this we should be able to prevent the server from crashing, and hopefully handle the situation gracefully.

User interface/Usability

In UNIX we have the ability to define special signal handlers for thread specific signals such as SIGSEGV, SIGFPE, SIGILL, SIGBUS. In Windows we can trap for many different exceptions such as ACCESS_VIOLATION, STACK_OVERFLOW, ARRAY_BOUNDS_EXCEEDED, DATATYPE_MISALIGNMENT, FLT_DENORMAL_OPERAN, FLT_DIVIDE_BY_ZERO, FLT_INEXACT_RESULT, FLT_INVALID_OPERATION, FLT_OVERFLOW, FLT_STACK_CHECK, FLT_UNDERFLOW, INT_DIVIDE_BY_ZERO, and INT_OVERFLOW. Using signal handlers on UNIX, and try-except blocks on WIndows we are able to trap any of these signals/exceptions.

However, the question we are now faced with is: Once we know that someone has done something VERY bad, what do we do about it?

There are two kinds of signals/exceptions: recoverable, and non recoverable. Most signals/exceptions are classified as non recoverable since it can NOT be guaranteed that no damage was done to memory. The only signal/exception we can classify as recoverable is Stack Overflow. This is because the OS tells us that we have attempted to exceed the stack boundary, but nothing was done. In this case it is OK to unwind the stack and continue executing.

Therefore, we will do two things:

  1. Catch and handle recoverable signals/exceptions, for now only stack overflow. This means that once we detect this signal/exception we will use our current error handling mechanism to inform the client of the situation. This of course includes terminating the request.
  2. Catch, log, and do NOT handle all signals/exceptions in user code. This means that we will be able to use the InterBase logging mechanism to state that user function name X has cause a critical error in the server, and therefore the server has died.
  3. Catch, log, and do NOT handle all other signals/exceptions. This means that once we detect these signals/exceptions, presumably in our code, we will use the InterBase logging mechanism to state that this happened, but we will not do anything about it, thus allowing the OS to handle it in the usual manner (i.e. Access Violation, core drop, etc.) The information provided here will not be very useful for debugging, or resolving the problem. However, this must be the case since the only way to give useful information about the problem is to give out symbols along with our server. This is of course unacceptable.

Requirements and Constraints

One requirement is that we remove the arbitrary limitation of 750 as the maximum number of recursions of a stored procedure. This will allow us to trap the stack overflow, and stop the recursion at that time, rather then at the arbitrary point of 750.

The constraint for this feature is that the server will still crash. We are not implementing this to prevent crashes, except for the stack overflow case. This only allows us to point out to the user that they caused the crash, if indeed that is the case. This behaviors may not be considered too friendly by most users, however as discussed above this is the ONLY thing we can do safely.

Migration issues

In previous versions of the product, we stopped recursions of stored procedures after 750 calls, now will be able to continue until we actually run out of stack. The actual limit may now be higher or lower, depending on how much stack each call takes up. This is a much more realistic approach to the problem, but users will have to be informed of this change.

As far as user code is concerned, this marks the third change to their behavior since InterBase V5.0. In InterBase V5.0, and earlier, if there was a signal/exception in user code we would crash. In InterBase V5.5, we added the necessary signal/exception handling to trap this situation, report it to the user, and continue executing. This looks very good to users, but it was determined that it was too risky to continue executing without knowing the extent of the damage done by the signal/exception. Therefore, we are now proposing to trap the signal/exception, log it to our log file, and abort execution of the server. This is not as well behaved as InterBase V5.5, but it is the safest approach.