Sort Merge Join Algorithm - Design Specification

High-Level Design

This is a design specification for the reimplementation of a phase of the sort-merge join algorithm. The current algorithm has a main memory cost that is proportional to the largest equivalence group between the sort streams being merged. All join strategies employed by a database engine should have some upper bound or ceiling on the amount of main memory required to execute a query. The join algorithm should also be capable of operating over large data sets while consuming a finite amount of main memory.

Some of our customers have experienced unrecoverable server memory growth due to the memory requirements of the current sort-merge join algorithm which has resulted in excessive page faulting , poor thread scheduling or server crashes.

The test cases that we have been working with come from a specific customer and create several nested sort/merge joins which collectively try to consume 60MB in the server.

As a result of this customer's experiences, it is necessary to address this deficency of the current sort-merge join algorithm for inclusion in InterBase V6.0. High-Level Algorithm

The proposed change in the algorithm has to do with the phase which performs a cross product over a set of equivalence groups in the sort streams which are being merged. The sort streams have been sorted by a key based on common join terms. In addition to the join terms, each sort record also consists of the columns in the select list of this river as formed by the optimizer.

For example, the following diagram depicts two sort streams which are being merged. Each uppercase letter below represents the sort key of the join terms. Assume that each sort record is 1KB.

                                        |-------- 50,000 repetitions -----------|          |----- 10 ----|
stream 1: AAABBCCCCDDDDDDDDDDDD...DDDDDDDDDEEEEGGGG...GGGYYY

                          |------ 10 ------|     |--------- 25,000 -----------|
stream 2: BBBBBDDDD...DDDDFFGGGGGGG...GGGGGGGGZZZZZZ

When the "D" equivalence group is processed, every sort record for both streams is read into main memory and inserted into a linked list. This would require over 50MB of main memory (50,000 x 1KB +10 x 1KB). The cross product of this equivalence group contributes 500,000 records (50,000 x 10) to the sort/merge output. Each set of equivalence groups is processed in a similar manner until one of the input sort streams is exhausted.

The proposed change is to write large equality groups to an external merge file in the same way that an external sort algorithm writes sort runs to an external file to minimize the memory consumption of the sort. The merge file is written using fixed size blocks of MERGE_BLOCK_SIZE (currently 64KB). If all records in the equivalence group can fit in a single merge block, there is no need to write the block to a merge file.

Additionally, an optimization pass is performed on each set of equivalence groups to compute the optimal join order of the merge streams based on the cardinality of the merge blocks of each merge (equivalence) file. In the example above, it is clear that the optimal join order for the "D" equivalence group is (outer stream 1, inner stream 2) while for the "G" equivalence group it is (outer stream 2, inner stream 1).

For the "D" equivalence group, stream 2 can fit in-memory within a single merge block so it makes sense to let stream 1 drive the join and repeatedly iterate over stream 2's merge records. As the outer stream, the 50MB stream 1 is only read once sequentially. If stream 2 drove the join then the 50MB stream 1 merge file would have to be sequentially read 10 times (once for each merge record in stream 2).

Alternative designs were considered, which in some respects, were superior to this proposal. Unfortunately, they depended on functionality which has not been enabled in the product. This functionality includes scrollable cursors and bookmarks -- features which could have been used to quickly reposition inner merge streams and obviate the need for the complete formation of equivalence groups.

Detailed Design

Internal Data Structures

The sort merge record block (typedef struct smr {} *SMR) has been made obsolete and removed from jrd/blk.h and jrd/rse.h. This structure formed the basis for the linked list of sort merge records.

/* Sort merge record blocks (hold raw sort records) */

typedef struct smr {
    struct blk smr_header;
    struct smr *smr_next;  /* Next record in group */
#ifdef SCROLLABLE_CURSORS
    struct smr *smr_previous;  /* previous record for scrolling backward */
#endif
    UCHAR smr_data [1];  /* Raw sort data (unmapped) */
} *SMR;

A merge file block structure is embedded in the impure merge rsb.

/* Merge (equivalence) file block */

typedef struct mfb {
     struct sfb *mfb_sfb;     /* merge file uses SORT I/O routines */
     ULONG mfb_equal_records; /* equality group cardinality */
     ULONG mfb_record_size;   /* matches sort map length */
     ULONG mfb_current_block; /* current merge block in buffer */
     ULONG mfb_block_size;    /* merge block I/O size */
     ULONG mfb_blocking_factor; /* merge equality records per block */
     UCHAR *mfb_block_data;   /* merge block I/O buffer */
} *MFB;

#define MERGE_BLOCK_SIZE 65536

Detailed Algorithm

The raw sort data from a sort substream is moved to consecutive slots in a merge block buffer. When the block buffer fills, the block is written to a merge file. This replaces the linked list of sort merge records. In effect, the sort merge record conceptually becomes a position in the merge equivalence file.

Repositioning an inner merge stream just entails reading merge block 0 into the block buffer.

New/Affected Modules

  • jrd/rse.c
  • jrd/rse.h
  • jrd/blk.h

Testing Considerations

The previous sort-merge join algorithm did not optimize the join order between the equivalence groups of the sort streams. The new behavior may change the order in which rows from the sort-merge join are produced and presented to the user.

For the Windows platform, the temporary file prefix has changed from "ib" to either "ib_sort_" or "ib_merge_" to distiguish which role a temporary file is playing". For Unix, temporary sort files already had a "gds_sort_" prefix to which we have added a "gds_merge_" prefix for temporary merge files.

The tester should ensure that temporary merge files are cleaned up (deleted) on normal and abnormal client termination -- the behavior should be identical to temporary sort file cleanup. This cleanup includes closing server file descriptors associated with the merge files

The merge block buffers (64KB) use the same virtual memory allocation mechanism that sort buffers (128KB) use. When the merge stream is closed, the memory for the merge buffers is released back to the OS and not to an internal memory heap in the process. This virtual memory mechanism is only used by Superserver platforms which can support it (i.e. Solaris, HP-UX, and NT)

Note that abnormal server termination always leaves temporary files around, whether sort or merge, unless a reboot script has been created to perform this chore. Any such scripts which delete temporary files "ib*" or "gds*" should still work with the new naming conventions.