The Chequered Career of nBackup

By Helen Borrie

The incremental backup utility toolset nBackup is not a friendly beast. An operational misstep can break a database. Under load, in some conditions, it can be a database-killer without any human intervention, especially under Firebird v.2.1 or v.2.0.

There's a proviso, though. A lot of the problem code was eliminated by a thorough revamp for Firebird 2.5. The threadability of attachments in v.2.5 is also thought to have reduced some of the instability of nBackup The improvements were not backported to the v.2.1 or 2.0 series because it was structurally infeasible.

This article is not a “How-to” for using nBackup. The Firebird Project documentation for proper use of this toolset is excellent and can be read or downloaded from the Documentation pages at the Firebird web site.

How nBackup Should Play

The SQL command:

ALTER DATABASE <database> BEGIN BACKUP

sets a state flag in the database header to STALLED. It causes the engine to divert all subsequent page writes to a difference file, or “delta file”. Despite what one might assume, it does not start a backup and it should NOT be used when one is about to start a backup. Its purpose is to allow subsequent safe copying of the main database file, either by a file system utility such as a compression tool or a backup application or by the nBackup code itself. However, the user should never wrap the SQL commands around any call to the nBackup executable: the physical backup code always executes the code beneath BEGIN BACKUP and END BACKUP itself when it is required.

To reiterate: It's essential to realise that no backup starts until the engine actually receives a command- line call to nBackup. The SQL statement is not required from the user when starting a backup: the nBackup command does this for you when called with the -B[ackup] switch and the appropriate parameters.

The Delta File

By default, the delta (or “difference”) file is created in the same directory as the database, with the file suffix “.delta” and the main part of the file name duplicating the main part of the database file's file name. For example, if the database is EMPLOYEE.FDB then the difference file will be EMPLOYEE.DELTA.

DDL syntax exists to customise the location and/or file name of the delta file, using the DIFFERENCE FILE clauses in CREATE DATABASE and ALTER DATABASE.

Once the valid nBackup call is submitted the state flag will change to STALLED, the database file will be locked and the delta file should appear. From now until the completion of the backup, all changes in the database pages are written out to the delta.

Mapping

nBackup writes pages of all types into the delta, not just data pages but index, transaction and page inventory types. Each time it writes out a page, it also writes and updates mappings in the delta to pages in the database and, if applicable, to pages in the delta itself. Mappings will be used to merge the delta pages into the database once the backup is complete.

Merging

When nBackup is ready to merge the delta file with the stalled database, it changes the status flag in the database header from STALLED to MERGE (BACKUP MERGE in gstat -h attributes) and starts to merge the pages from the delta back into the database. As it writes and overwrites the modified pages in the database, it updates the mapping tables in memory and also updates the modified pages in the delta file.

End of a Backup

When nBackup has completed the backup, it calls:

ALTER DATABASE <database> END BACKUP

This command changes the status flag from MERGE to NORMAL (no backup state listed in gstat -h attributes). From there on, all page writes go to the database. The engine is then meant to delete what is left of the delta and life goes on.

How nBackup Can Misbehave

Traditionally, one of Firebird's great features has always been that it was almost impossible to corrupt a database, either through an untoward action by the engine or by a bad API call. For those determined to break a database, it could be achieved in three ways:

  1. restoring a backup over a running database
  2. allowing users to log in during a restore—a practice that has been blocked since Firebird 2.0
  3. making a file-system copy of a database whilst users were logged in

Altering the system tables directly could do it—although this is no longer possible in Firebird 3—and, on old versions of Firebird, a database could be corrupted on Windows if Forced Writes was disabled.

The nBackup tools brought a whole bunch of new ways to corrupt databases. Some can occur as the result of faults in the code beneath the tools and are more likely in older versions than in versions 2.5+. Others occur from human error.

Humans vs nBackup

We look first at some of the ways humans have found to do it. With the application of common sense and adequate technical understanding, they are avoidable.

Moving or Deleting the Delta File

A delta file is always a “live wire”, even if the Firebird server is not running. Trigger-happy technicians, looking to free up disk space, have been known to discover a .delta file during a cleanup and to delete or move it because they did not know what it was and assumed it was garbage of some kind. This is not a sensible approach to disk-saving. If a delta file exists, then it is hanging off a database that is not in state NORMAL. If it is missing when someone tries to connect to a database in state STALLED or MERGE, connection is impossible, even by a privileged user.

If the delta can be returned intact to its rightful home, without any attempts to correct the apparent corruption, the database may be connectable. An operator should then proceed with what is required to merge the delta with the database and return the state to NORMAL. If not, then all changes that have occurred since the file date on the database file are lost and a physical recovery will be needed to retrieve the data from the last time the database was in NORMAL state.

Including the Delta File in an External Backup

nBackup provides a tool to “freeze” a database so that it may be safely backed up by an external disk backup utility. The idea is to allow users to continue working during this type of file-copying or compressing operation, when arbitrary sector locks applied by the external operation would otherwise interfere with Firebird's write operations and risk corruption. A delta file is used to store page changes during the freeze, in the same way as when nBackup is executing a full or incremental backup.

If the external application has been set to back up not just the database but also the delta file, its arbitrary sector locking can obstruct the page writes to the delta and corruption is a likely result.

Crashing the Machine

Corruption is not supposed to happen if the server machine crashes, or is shut down arbitrarily, whilst a database is not in the NORMAL state and Forced Writes is on. When the machine is restarted, the likelihood that the engine can restore the status quo and resume doing what it was doing when interrupted should be just as good as when the crash happens in NORMAL state. In the versions prior to v.2.5, unfortunately, there were bugs with synchronization and Forced Writes that could put the delta at risk of corruption in these circumstances. There are no “nBackup tricks”, either at the command-line or via SQL, that can fix a broken delta file.

After Firebird restarts, a capable DBA can attempt to log in in exclusive mode, from isql or an equivalent tool, and allow the engine to attempt to re-establish the hook between the database and the delta file. The login will appear to “hang” for a period, sometimes several minutes, so fill the Thermos with coffee, be patient and hope the outcome is good. If the backup state is already MERGE, the engine simply restarts the merge operation. In isql, once the cursor reappears beside the SQL> line marker without reporting errors, the database will be in good shape to continue.

If the interruption happened at a point where the backup state was just changing from STALLED to MERGE, it is possible, in the v.2.0 or v.2.1 versions, that both the database and the delta are broken, perhaps by a stray lock, a mismatch or a duplication of page identifiers that the engine discovers when trying to reconstruct the memory table that stores the mappings. It is a problem that seems more likely to occur on a Classic server, where the crash could have crucial effects on the updating of non-shared resources. A log-in attempt might look something like this

C:programsfirebirdFirebird_2_1bin>isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect 'fatso:blobby' user sysdba password 'masterkey';
Statement failed, SQLCODE = -902
internal gds software consistency check (Duplicated item in allocation table detected)
-internal gds software consistency check (Error: can't actualize alloc table)
-internal gds software consistency check (Error: lock allocation table on read)
SQL> _

The actual error(s) returned will depend on the damage that occurred. The engine cannot recover from this situation and professional help will be needed to recover a working database.

Forgetting About a Delta

Using the SQL command:

ALTER DATABASE <database> BEGIN BACKUP

changes the database state to STALLED and creates a delta file. From here on, all changes go to the delta. Nothing else. We have seen cases where someone did just that, in the belief that he was initiating some kind of magical replication scheme that would just go on indefinitely—no more gbak backups, no more maintenance to do, for ever. In one case, the homecoming happened about 15 months later, the day the delta ate the last of the disk space and became horribly corrupted. In such situations of administrative neglect, typically there are no backups to cover the gap. It is a long and expensive task to retrieve data from a dead delta and merge it manually with an old database, with no guarantee that all or any data are retrievable.

Fate vs nBackup

nBackup's existence prior to Firebird 2.0 was private: first as a commissioned feature in a large company's custom-built versions of Firebird and, later, incorporated in a Firebird 1.5 fork by Red Soft Corp., Russia, that became Red Database. The company that originally commissioned the code donated it to the Firebird Project, in accordance with open source licensing, for inclusion in Firebird 2.0.

Over its public life since 2006, nBackup has earnt a bad reputation, often deserved, sadly.

Firebird 2.0.x

Initially, nBackup as presented in Firebird 2.0 really did not work at all reliably. On a Windows Classic server, it would not delete the delta file after a backup completed successfully. In fact, on Windows servers it would not work at all in interactive mode. It could not back up a database that had been created in a recent session. Under load, when locking or backing up a database, it would apply a page-level deadlock, resulting in a bugcheck.

Fixes in v.2.0.1 did not end the woe. Persistent bugs continued to corrupt databases, especially under high- load conditions. Faulty merge logic sometimes broke databases or left them in an erroneous “locked” state that could not be reconciled. Unexpected deadlocks continued to occur during backups and merges. Some of these problems were supposedly fixed for the 2.0.4 sub-release.

Further repair was necessary, right up to the final v.2.0.6 sub-release. The “Forced Writes” setting had not been respected for writes to the delta and backing up a large database would hog I/O resources, bringing production work to a standstill. The issue was about polluting the file system cache and causing the system to swap.

Direct I/O for nBackup

As a workaround, direct (uncached) I/O was introduced as an option for nBackup, albeit at a substantial cost to backup performance. The trap is that nobody can predict which read mode (cached vs uncached) is faster for a particular operating system and database. It takes some experimentation with the -d on|off switch to figure that out. The Firebird engineers arrived at different defaults for Windows and POSIX systems, as the filesystem caching issue mostly affects Windows users.

Firebird 2.1.x

In the initial v.2.1 release, nBackup was considered completely broken. Amongst its problems was faulty support for backing up databases on raw POSIX devices. To create a delta file, it should have enforced an explicit DIFFERENCE file path, since a raw device is not a file in a directory! Instead, it would assume some nonsensical “local directory” as the default location for the delta file, with ugly consequences.

The v.2.1.1 sub-release had nBackup working, at least. However, it soon appeared that something was seriously wrong with its database locking: database files would grow while writes were supposedly going to the delta. The bug was fixed for the v.2.1.2 sub-release. Firebird 2.1.2 had a short life and was followed rapidly by v.2.1.3, so it was not discovered in time that the “Forced Writes” setting had not been respected for writes to the delta. That was fixed for v.2.1.4 and backported to the final sub-release of v.2.0.x, but had to be revisited for the v.2.1.5 sub-release and the Firebird 3 betas. Also fixed for v.2.1.4 was a resurfaced bug, whereby the delta file was being left behind after the end of a backup.

Two more potentially corrupting bugs showed up for fixing in v.2.1.5. Somehow, a regression had led to delta file pages not getting flushed to disk in the revamped Firebird 2.5 nBackup code and the same bug was found in the v.2.1.4 code. The v.2.5.1 fix was backported to v.2.1.5. The v.2.1.4 nBackup code was discovered attempting to read beyond the end of file when the backup state was STALLED, so the fix for this appeared in v.2.1.5, as well.

Firebird 2.5.x

A major revamp of nBackup and the physical backup code in the engine, performed by a Red Soft engineer for the v.2.5.0 release, improved synchronization in the page cache manager and corrected lock ordering to avoid deadlocks. Lock caching was added to reduce contention and the GlobalRWLock class was partly rewritten to improve stability under load. Those improvements, combined with the extra robustness brought to Firebird 2.5 by threadable client connections, led to an expectation of greater reliability and fewer opportunities for corrupting databases.

The recurrence of the problems flushing delta pages to disk has already been mentioned. The implementation of nBackup on POSIX platforms threw up several platform-specific bugs that were fixed in the v.2.5.1 sub- release. The process would throw a segmentation fault (in other words, crash) on Ctrl-C and leave the database locked and the delta file continuing to grow. It was temperamental about a missing firebird.conf and would fail if the requesting user was not root or Owner. As late as v.2.5.3, it was writing its error messages to stdout instead of stderr.

Despite these niggles, the v.2.5+ implementation appears to be meeting the expectations. We have not been asked to fix one database broken by nBackup under v.2.5.x.

Firebird 3.0.x

Since its earliest times, the nBackup toolset has been implemented with a rather idiosyncratic set of switches, out of keeping with those used in Firebird's other command-line tools and inadequately validated at input time. Some work was done during the Firebird 3 alpha workup to clean things up.

Is nBackup Good to Go?

In theory, nBackup has a lot going for it. It provides the internal code to maintain backups of physical pages incrementally without grossly disrupting production work. It can solve a fundamental issue for administrators of large databases, for whom frequent gbak backups just cannot fit around production demands.

In practice, nBackup's adolescence has been lengthy and beleaguered with elements of danger for caring database custodians. Using versions lower than v. 2.5 presents more risk than benefit. Considering both the v.2.0 and 2.1 server series are now out of development, nBackup on those versions will never improve beyond the status quo. Dmitry Yemanov, Firebird's project lead, recommends avoiding it totally on the 2.0.x and earlier 2.1.x versions and using it on v.2.1.6 or v.2.1.7 only when there are no active concurrent connections.

With Firebird 2.5+, the message is “Proceed with caution” if the decision is made to activate nBackup in software deployments, particularly where the housekeeping of Firebird servers is to be administered by users who are not technically expert in the ways of nBackup.

nBackup Has Limitations

nBackup backs up and keeps account of physical changes to database pages, page by page. It has no knowledge of what types of pages it is storing, nor of the validity (or otherwise) of the data, nor whether the page contents are active, interesting or just garbage. A database restored from nBackup backups will be nothing more and nothing less than a physical copy of the database as it was the last time the highest-level backup completed. For the purpose of providing fallback from a physical calamity, for very large databases, that is relatively economical on resources, nBackup in its mature state is good enough.

nBackup does not work with multi-file databases. If you are preparing a multi-file database to implement nBackup, it will be necessary to back it up with gbak and restore it as a single-file database.

In contrast, the traditional gbak backup has more purposes than just backing up. It is a logical, not a physical, backup tool, reading every scrap of data in a snapshot and outputting everything to a file (or files) in a special compressed text format that is, by default, transportable across platforms. In so doing, unless instructed otherwise with the -[no_]g[arbage] switch, it cleans as much garbage as it can from the database that it is backing up. It never backs up garbage. It reads multi-file databases and can back them up as one or multiple backup files. Restoring from a gbak backup creates a brand new database, which can be single- or multiple-file, and reconstructs metadata and data using inserts and updates in a native query language.

Why You Still Need gbak

nBackup and gbak are not interchangeable. You can have one without the other, provided the one you have is gbak! A periodic restore from a gbak backup is required for several essential housekeeping tasks, including retrieval of unused disk space, defragmentation, obliteration of stubborn old garbage and resets of transaction counters and object version numbers.