Brief Description
Index pages have never been garbage-collected by InterBase. Indices would expand as records are added, but would not shrink as records are deleted. For customers who delete a large number of records, this causes three major problems:
- Space utilization: index pages were not freed, so the space could not be reused for other purposes. Though database pages are not returned to the operating system, they should be reused for other purposes within the database.
- Performance: the more "full" an index is, the more efficient it is. The fewer the pages in an index, the fewer the page fetches required to traverse the index, and the more likely that index pages will be held in cache.
- Unlimited growth: when indexing a key which steadily increases, like a record id or current date, records will be added to the end of the index and deleted from the beginning. Since pages were added at the end of the index and not deleted from the beginning, indices would grow without limit.
For InterBase 5.0, functionality has been added to garbage-collect index pages. This feature not only releases empty index pages, it merges less-used pages to keep the index as small as practical. While this feature was added at the urging of one particular customer, it is a very useful feature that should improve index efficiency markedly for all customers. Further, it will allow use of indices in situations where it was not previously possible.
Feature Description
Functional Changes
B-trees:
B-Tree indices are arranged in a tree with multiple levels. The leaf level is
called level 0, with level 1 above it, and so on. An index can have up to 16
levels, depending on the size of the table. The leaf level contains nodes with
two values: pointers to corresponding records in the database, along with the
key value of the record. Levels 1 and above contain nodes with pointers to the
pages at the level below, as well as the key value of the first key on the
lower page.
Pages have a variable number of nodes. The page does not contain a count of nodes on the page. Instead, the end of the page is marked by an END_BUCKET marker. This is a special node with a pointer value of (-1). The key value of the node is the same as the key value of the first node on the next page, at least initially.
When retrieving via the index, a particular key value is found in the following manner:
- The top level page is fetched.
- The key values of each node are compared to the search key until the first key is found.
- Which is greater than or equal to the search key.
- If the page is greater than level 0, we fetch the page pointed to by the previous node.
- If the page is level 0, we look for the first node of equal key value and fetch the corresponding record.
Strategy:
Index pages will be garbage-collected when they are 25% full or less. A page is
garbage-collected by merging it with its left sibling page, provided there is
adequate room. The pages will not be merged if the combined page does not allow
room for a maximum-length key (256 bytes) to be inserted. The idea is to leave
an adequate margin to prevent thrashing between splitting pages and combining
them together. If pages were merged when they fell below 50% and were allowed
to form a 100% full page, then every time a key were inserted a page would
split and every time a key were deleted the page would be garbage-collected.
The result should be that the index in general will be more full than in previous releases. It is difficult to quantify how much more full the index will be, since it will depend on usage.
However, at a minimum there will be very few pages which are empty. Also, it is unlikely that there will be many pages less than 25% full, since it seems unlikely that a near-empty page will be located next to a near-full page.
There are two phases to the garbage collection process: first we make various checks to ensure that garbage collection will be possible, then we update the pages involved.
Check Phase
There are four pages involved in garbage collection: the pointer page at the next higher level, the left sibling page, the page to be garbage-collected, and the right sibling page. The pages are fetched in that order to prevent deadlocking with other processes that might be fetching pages simultaneously. We first check that the index structures look proper and that the page to be garbage-collected is still below the minimum threshold (i.e. that someone didnt add records to it).
Next we make a check to see if the page to be garbage-collected is the first one on its parent. The first page on any pointer page cannot be garbage-collected. This is because the page to the left is on a different parent page, which means that moving records to the left would change the range of records falling within that parent page. This is not valid because the END_BUCKET of the new parent will be less than the merged keys, so the keys would not be found during retrieval. Though the page cannot be immediately garbage-collected, this is not a cause for concern. At some point its parent page will be merged with the page to the left, and then the page can be garbage-collected. This allows the index to be fully garbage-collected when all records are deleted.
Next we check whether the left sibling page has adequate free space to combine the two pages. If the combined page does not have 256 bytes free space, the pages will not be combined. The length of the combined page is computed as follows:
- Add the lengths of the two pages.
- Subtract the overhead of one page.
- Subtract the length of the END_BUCKET of the left page.
- Subtract the length of the prefix compression, if any, of the first node on the right page.
Update Phase
First we delete the pointer to the page from its parent. This operation could cause the parent page itself to fall below the minimum threshold for garbage collection.
Next the right sibling page is updated. At present its left sibling is the page to be garbage-collected, and it must be updated to point to the garbage-collected pages left sibling.
The left sibling is updated to have the right sibling page as its new right sibling. Then any records remaining on the garbage-collected page are copied to the left sibling, as well as the END_BUCKET marker. The former END_BUCKET marker on the left sibling is deleted in the process.
Finally, we release the garbage-collected page to the free list so that it will be available for reuse within the database. Pages are not released to the operating system.
On-disk Integrity
The order in which pages are written to disk is very important to maintain on-disk integrity. Index structures on disk must be valid at all times, so if there is a failure during writes we can recover from it. Thus we set up precedence relationships to make sure that pages are written in the correct order.
The parent is given the highest precedence so that it will be written first. This is to make sure that the parent will not be pointing to a page that has been released. The right sibling has next highest precedence, so that its left sibling will not be a released page. The left sibling is next so that its right sibling will not be a released page. Finally we release the page itself, making sure that the page inventory page will be written after all the other pages, so that the page will not be reused until there is no longer any possibility that someone will fetch it.
The right sibling page comes before the left sibling because it is okay to have the right sibling point two pages to the left but not okay for the left sibling to point two pages to the right. This is because of the "fault-tolerant" algorithm for traversing leaf pages. When moving right-to-left we allow for the possibility that the left sibling pointer may be stale, and refetch pages to the right until we get to the real left sibling.
Merging Nonleaf Pages
Deleting a key from the index is handled by recursive descent down through the index page levels. When a page is garbage-collected, its parent page may fall below the threshold for garbage collection. If so, the parent page (and possibly its parent page and so on) will be garbage-collected at a higher level in the recursion. In this case, we refetch the parent page and return a status to the calling routine that the parent needs to be garbage-collected.
Interestingly, non-leaf pages can never go empty because the leftmost child page cannot be garbage-collected. They can only be merged with their left sibling.
Minimizing Depth
A record key is deleted by recursing down through the index levels to the leaf level. If necessary, the page is garbage-collected and the key deleted at the next level up, and so on until we get back up to the top level.
At this point its possible the top-level page will contain just one node. In this case, the index has an unnecessary extra level. This level is removed by replacing the pointer on the Index Root Page to point to the next level down. (There is an Index Root Page for each table, which points to the top-level index page for each of the tables indices.) We can then release the former top-level page to the free list, making sure that the new Index Root Page has higher precedence so that the page on disk will not point to a freed page.
The effect of this is to minimize the index depth as the size of the table decreases. If all the records are deleted (and purged) from the table, the tree will be reduced to the minimum size index, which is two levels: the top-level page plus one leaf page. This is the minimum size index to prevent thrashing between adding and removing a level for small tables.
Implemenation
Proposed Code changes:
Technical description:
Index garbage collection is defined as: the ability of an index to dynamically
decrease in size as records are deleted and an index page become empty
The technical implementation was done by adding a routine called garbage_collect() in btr.c. The method used to garbage collect an index page is as follows: the remove_leaf_node() function detects that an index page is empty. The garbage_collect() routine is called to remove the empty page. The parent page, the left sibling page, and the right sibling pages are modified to remove all pointers to the page, then the page is placed on the free list and the right and left sibiling page pointers are adjusted to point to the next record in the list. The back pointer on the new page pointed to must also be adjusted to point to the left sibiling page.
When the top level index page contains only one pointer, then the top level page is removed. This allows the index to dynamically shrink in depth when the number of nodes is minimized.
When a page is filled by adding a node to the end of the page, then the page is split by creating an empty page to the right of that page. The old method for growing a new page would have been to place half the nodes on each page--the current page and the new page. This allowed each page to be at least half full and allowed room for the addition of data in each half of the two pages. However, in some cases, this method is inefficient. The current method will cause completely full index to be created for a monotonically increasing key (such as a key generated by gen_id). The index therefore uses half the space and half the lookup time for operations like ORDER BY, GROUP BY, and DISTINCT.
For non-monotonically increasing keys, the old method of inserting data on the page (not at the end of the page) will still cause the old-style method of adding a new page and splitting the existing data over the current and the new page (half the records being stored on the new page).
Future features: Changes were also made to guarantee that scrolling through the index pages will work. This means that databases created after the next release will automatically work when scrolling is introduced. Conversely, scrolling will not be possible on database with an ODS version less than ?
Miscellaneous Code changes:
A fundamental change to index structure was made to cause pages with duplicate
keys to be propagated up to the parent. This was necessary to eliminate a bug
in which pages were placed out of order on the pointer level above it. It will
cause non-leaf levels to be larger for indices with lots of duplicates but will
result in faster processing time for many lookups. This change caused the need
for a minor version ODS change.
As a result of this change, index garbage collection will only occur on databases which were build with the new version of InterBase. For migrating customers that means a gbak restore of their existing database must be done. The ODS version for InterBase 5.0 is currently 8.2. The product will detect a version 8.0 or 8.1 ODS and will not do index garbage collection. In the future, if the ODS version is increased, InterBase will still detect that the original database was not built using ODS 8.2 and no garbage collection of index pages will be done
Issues
Migration
ANYONE WHO WANTS TO TAKE ADVANTAGE OF INDEX GARBAGE COLLECTION HAS TO CREATE A NEW DATABASE OR BACKUP/RESTORE THEIR DATABASE WITH THIS KIT.
Changes were also made to guarantee that scrolling through the index pages will work. This means that databases created after the next release will automatically work when scrolling is introduced. Conversely, scrolling will not be possible on databases with an ODS version less than 8.2.