Understanding the Firebird Optimizer and it's Plans
By IBPhoenix
What is an optimizer?
The SQL language is designed to define requests for data manipulation, and to specify the final representation of results, but by design it does not support the definition of how the request should be handled by the database system. It’s up to database system to decide the best (i.e. fastest) method. This decision process is called optimization, and takes place as the last step of statement compilation. The part of the database engine responsible for statement optimization is called “the optimizer”.
What does the optimizer do?
The compilation of a request statement in Firebird is a multi-step process. The engine itself operates on statements written in BLR (Binary Language Representation), so if the statement is written in SQL (or GDML -Groton Data Manipulation Language, the original data manipulation language for InterBase) or any other language, it must be first translated to BLR. When the BLR statement is compiled, the engine creates RSE (Record Selection Expression) blocks for it, that are then processed by the optimizer into an RSB (Record Source Block) tree. These internal structures are then used by the execution engine to perform the request.
The purpose of optimizer in Firebird is to decide:
- How relevant rows are selected, i.e. analyze direct and indirect filter conditions.
- How data is read from the data stream (table, view, stored procedure). The access method could either be in storage order or out of storage order.
- Join order, if two or more data streams must be joined.
- How particular streams should be joined (Loop, Merge-sort or cross-product).
To find the best answers to such questions, optimizers need to use several strategies:
Heuristic
The optimizer uses a set of hard-written rules that are applied to the statement and data collected about tables, indices, data distribution, phase of the moon etc. and then chooses the access route defined for the rules that have been met.
Cost-based
The optimizer evaluates the “cost” of many different methods for data retrieval and picks the “cheapest” one. The cost is usually defined as number of I/O operations.
Cop-Out
The user can pick or influence the optimizer’s decision. The Firebird optimizer actually uses all three strategies.
Some History
In the beginning, InterBase had a purely cost-based optimizer. This strategy is relatively simple to implement, and usually delivers very good results. Around InterBase version 3, the optimizer was extended with a few heuristic rules to allow pruning of the join order tree (the original optimizer evaluated all possible combinations. It allowed a join of 17 tables in 30 seconds, but unfortunately the actual optimization itself took 12.5 hours). Borland also changed the optimizer, mostly during version 4 development. Unfortunately, these changes led to some anomalies in the optimizer’s behaviour and the production of sub-optimal execution plans for some complex requests. Borland “fixed” this problem by adding the specification an explicit optimization execution plan.
Firebird has implemented many improvements and fixes to the optimizer since.
Selection of relevant rows
Each request can process rows from one or more data streams, but not all rows from these streams must be processed. Relevant rows are qualified by the filter criterion that is directly defined in the statement, or by indirect criteria that are derived from request. The Optimizer uses these selection criteria (called Booleans), along with information about the indices defined for the data stream, to decide about the row retrieval method. This retrieval method could be:
- Natural scan, where each row in stream is read and evaluated for relevance.
- Index search, where candidate rows are found by one or more indices and then read and evaluated.
When an index search is used, it doesn’t mean that all relevant rows are found just by looking at the index, even when all the filter criteria supplied could be matched to index expressions. An index node in Firebird doesn’t contain a transaction number, so rows found by an index might not be visible within the current transaction. Effectively indices are used to narrow down the set of rows that must be evaluated. Row ids (db_keys) of candidate rows found in the index are stored into a data structure called a sparse bitmap. This structure could be a normal bitmap, however a sparse bit map stores large bit arrays more effectively. Firebird creates one bitmap for each index used and, when necessary, performs bitwise logical operations on these bitmaps to form a final bitmap of candidate rows.
Stream access method
Rows from data streams are usually read in their storage order, so each data page is read only once. Firebird uses this approach even when an index search is used to narrow the set of rows. Because the bitmap created from the index scan contains ordered row ids that simply map to the data pages that must be processed. The index is always processed in full before any data page is read.
When a statement contains a request for ordering the data stream (ORDER BY or GROUP BY clause) and the ordering specification maps to an index, it’s possible to read rows in the order defined by this index to get the ordered result set without sorting. Reading rows in index order is also called reading in Out of storage order. While this method will save the engine from sorting, it may result in multiple accesses to the same data pages, which may or may not result in more I/O operations.
Join order
Joins can be always performed in more than one way. The Optimizer evaluates possible join orders and picks the one that has the longest path (join path is defined as a sequence of joins) and the lowest estimated cost of data retrieval (the amount of I/O operations and processing). The cost is computed using the cardinality of the relation, index selectivity and total boolean selectivity. When the optimizer computes the join cost, it also considers the cost of the join method.
Join method
Each join run processes only two data streams. In the context of a join, source data streams are called streams, while derived, joined streams are called rivers. The joining process starts at streams, that are then merged into rivers, and these rivers are merged into an ultimate river.
Source data streams are joined using loops over the left stream, with special processing for outer joins. Rivers are joined using merge-sort (whenever possible) or by cross product.
Optimization, step by step
The optimizer starts with an RSE (Record Selection) block created by the parser, along with an initial Record Source Block (RSB), which simply lists all the tables to be joined.
Examining the node tree that is part of the RSE block, the optimizer creates an RSB tree. Each RSB node describes a join, a boolean restriction, or a type of retrieval from a table.
The final structure describes the joins to be made, and in which order they should be made.
Table 1 – Main RSB types
RSB Type Purpose BOOLEAN Describes a restriction CROSS Describes a join DBKEY Retrieval of particular record FIRST,SKIP Retrieval/skipping of first n records INDEXED Retrieval via an index MERGE Retrieve using sort merge PROJECT Projections SEQUENTIAL Simply retrieve records as stored SORT Perform a sort on records UNION Do an union of the described relations AGGREGATE Perform an aggregate operation NAVIGATE Walk navigational index LEFT_CROSS Describes a left outer join PROCEDURE Stored procedure EXT_SEQUENTIAL External sequential access EXT_INDEXED External indexed access EXT_DBKEY External db_key access
The execution plan available through the isql commands SET PLAN or SET PLANONLY is simply a subset of the RSB tree created by the optimizer, that has been converted to more human-readable format.
The execution plan always has a format:
PLAN (plan_definition)
Where plan_definition consists of blocks enclosed in parentheses. Each block represents a data stream and the method for its processing. Blocks could be nested to represent the tree structure for joining of individual streams. Each block includes stream identification (for example table name) and access method (INDEX and ORDER with index name, or keyword NATURAL for sequential scan), and finally, descriptions of the operation performed on the streams (MERGE, JOIN, SORT).
Decomposition
As its first step, the optimizer looks at the RSE to identify key elements and to create more, derived elements. In this step optimizer will
- Convert a BETWEEN into (a greater than or equal) AND (a less than or equal) nodes.
- Turn a LIKE into a STARTING WITH and a LIKE, if it starts with anything other than a pattern-matching character.
- Create more conditions by distributing equalities and inequalities, i.e. if (A == B) and (A $ C) (B $ C) for any $.
- From the conjunctions X(A,B) and A=C, deduce the conjunct X(C,B).
Processing data streams
For all joins, the optimizer looks for indexed relationships between the tables in a join and tries to combine join fields using an index. Indexed streams are then joined in sequences to form “rivers”. Streams and rivers are processed in a loop, in which the optimizer:
- Finds the river with the most streams, or use the one with the lowest estimated cost (i.e. find the best join order).
2. If there are streams left that are not part of the river, form another best-case river, repeating until there are no streams left. As the last step, the optimizer merges rivers. The key here is the algorithm identifying best join order.
The best join order
If a plan is specified, the order is already present within the streams vector, so the optimizer must settle for the one order as described by explicit plan. In this case the optimizer just verifies that the path specified by explicit plan really exists. The optimizer looks for the best join order in a loop, considering each data stream in turn as the leftmost stream in the join order. The best order from each consideration is then determined and placed into the opt block. So at the end of this loop, the opt block holds the best order.
The best join order for each particular, leftmost stream is computed in a recursive loop (for the next joined stream) as follows:
- If the plan was specified, check that this order matches the order provided by user.
- If the plan was not specified, compute delta and total estimate cost to fetch this stream.
- If the partial order is either longer than any previous partial order, or the same length and cheaper, save this order as “best”.
- Mark this stream as “used” in the sense that it is already included in this particular proposed stream ordering.
- For complex joins, there is a danger that optimizer would spend a lot of time recursing in this routine, so it needs to prune the combinations. Based on experimentation, the cost of recursion becomes significant at about a 7-table join. Therefore, at this point, the optimizer makes a simplifying assumption that if it has already seen a join order that has lower cost than this one, it is time to stop looking for a cheaper one.
- First, the optimizer handles any streams that have direct unique indexed relationships to this stream. If there aren’t any, it next handles any streams that have direct indexed relationships. As the last step, if there were no direct relationships, it looks for indirect relationships.
Use of indices
Firebird uses indices in three ways to speed up queries:
- If a filter condition, or part of it, could be matched to an index expression, Firebird creates a sparse bitmap of relevant data pages and rows. All processing is then narrowed to data pages marked in this bitmap. When more than one bitmap is used, Firebird uses bitwise logical operations to merge bitmaps into one. By using bitmaps, the engine can process data pages very efficiently.
- Whenever possible, indices are also used to join data streams. The mechanism is basically the same as the one used for filter conditions, but index scan and bitmap building is done for each processed row in the left stream in the join loop.
- When a request for stream ordering matches the index key for the source stream, Firebird uses this index to navigate the rows read from the stream.
Decisions about using of indices, result from matching index keys with conditional elements of the statement (booleans).
Indices can be considered only for conditions that compare the full length or the starting part of column values. Therefore, indices are not used for the CONTAINING predicate, a LIKE predicate that begins with a pattern-matching character or for transformation functions (EXTRACT, SUBSTR, UDF etc.). However, Firebird is able to use more than one index for optimizing different parts of a filter condition, or use the same index for more than one part of the condition.
Example:
SELECT … WHERE NAME = ‘John’ AND BIRTHDATE = ‘22.6.1968’;
For different combinations of available indices, Firebird will:
- INDEX(NAME) Index could be used to optimize the condition on NAME. The condition on BIRTHDAY will be evaluated by normal comparison.
- INDEX(NAME), INDEX(BIRTHDAY) Both indices could be used to optimize the whole condition.
- INDEX(BIRTHDAY,NAME) Because conditions are in an AND relation, it’s possible to use this index to optimize the whole condition.
- INDEX(NAME,BIRTHDAY) Same as above, since the order of columns in an index key is irrelevant.
- INDEX(NAME,SURNAME) Index could be used for condition on NAME.
- INDEX(SURNAME,NAME) Index cannot be used. Column SURNAME is not used in the condition and column NAME is not at the beginning of index key.
- INDEX(NAME,SURNAME), INDEX(NAME) Both indices could be used to optimize condition on NAME. In this case, Firebird must decide which one is better to use.
In cases where more than one index matches the same boolean condition, the optimizer picks the one with better selectivity. Selectivity is a number that represents the ratio between the total count of index keys versus the count of duplicate key values, and is used together with table cardinality to compute the estimated number of data pages that must be read. However, when a condition matches only part of index key, index selectivity doesn’t necessarily correspond to the actual efficiency of the index for that particular condition. For example INDEX(NAME, SURNAME) would contain fewer duplicates than INDEX(NAME). Were the condition to be defined only on NAME, then the real selectivity of both indices would be the same.
TIP:
If it’s not necessary to speed up sorting or for referential integrity, it’s preferable to define simple indices instead of composite ones. Firebird is able to combine indices if necessary, and simple indices are much more flexible. More than that, keys of composite indices often overlap, and thus force the optimizer to decide which one is better for a particular case. Although optimizer will do its best, it’s not guaranteed that it will always pick the best one.
On the other hand, where you often use complex conditions over multiple columns, it is better to define a composite index over these columns, because composite indices are more efficient for composite conditions (the engine can process one index instead of multiple ones) and usually have better selectivity. Searching by index doesn’t automatically mean better performance. Because Firebird must read additional pages (the index) in addition to data pages, using an index is of benefit only if the total number of processed pages would be less than a natural scan. The weight of an index is evaluated by the optimizer as part of the data retrieval cost computation. The outcome of such an evaluation may be that the optimizer ultimately decides not to use the index. The cost is a function of the estimated cardinality of the relation, index selectivity, and total boolean selectivity.
Execution plan - examples
The easiest way to read the execution plan and understand the optimization logic is to use some typical examples. The following examples use isql and the standard Firebird sample database EMPLOYEE.GDB. For simplicity, only indices listed are those that are defined for the tables used.
Table JOB:
RDB$PRIMARY2 (JOB_CODE,JOB_GRADE,JOB_COUNTRY) RDB$FOREIGN3 (JOB_COUNTRY)–REFERENCES COUNTRY (COUNTRY); DESCENDING MAXSALX (JOB_COUNTRY,MAX_SALARY) MINSALX (JOB_COUNTRY,MIN_SALARY)
Table COUNTRY:
RDB$PRIMARY1 (COUNTRY)
Table PROJECT:
RDB$PRIMARY12 (PROJ_ID) UNIQUE RDB$11 (PROJ_NAME) RDB$FOREIGN13 (TEAM_LEADER)– REFERENCES EMPLOYEE (EMP_NO) UNIQUE PRODTYPEX (PRODUCT,PROJ_NAME)
Table EMPLOYEE:
RDB$PRIMARY7 (EMP_NO) RDB$FOREIGN8 (DEPT_NO)–REFERENCES DEPARTMENT (DEPT_NO) RDB$FOREIGN9 (JOB_CODE,JOB_GRADE,JOB_COUNTRY)–REFERENCES JOB (JOB_CODE,JOB_GRADE,JOB_COUNTRY) NAMEX (LAST_NAME,FIRST_NAME)
Table PROJ_DEPT_BUDGET:
RDB$PRIMARY17 (FISCAL_YEAR,PROJ_ID,DEPT_NO) RDB$FOREIGN18 (DEPT_NO)–REFERENCES DEPARTMENT (DEPT_NO) RDB$FOREIGN19 (PROJ_ID)–REFERENCES PROJECT (PROJ_ID)
Example 1: A Simple SELECT:
SET PLANONLY ON; SELECT * FROM JOB ; PLAN (JOB NATURAL)
This plan is really simple:
Only one table is processed – JOB. Data stream is processed by a sequential read – keyword NATURAL.
Now let’s look at this same SELECT statement, but with different filter conditions in WHERE clause.
Example 2: Filter condition on non-indexed column:
SELECT * FROM JOB WHERE MIN_SALARY > 50000 ; PLAN (JOB NATURAL)
Plan is the same as before, because there is no index on MIN_SALARY.
Example 3: Filter condition on indexed column:
SELECT * FROM JOB WHERE JOB_CODE ='Admin' ; PLAN (JOB INDEX (RDB$PRIMARY2))
Keyword INDEX in the execution plan indicates that the index is used to narrow the read only to candidate rows. There is no index defined on JOB_CODE itself, but JOB_CODE is at the start of the key in the primary index (RDB$PRIMARY2), so this index is used.
Example 4: Filter condition on column with more than one index:
SELECT * FROM JOB WHERE JOB_COUNTRY ='Japan' ; PLAN (JOB INDEX (MINSALX))
Column JOB_COUNTRY is at the beginning of three indices: the foreign key to table COUNTRY, MAXSALX and MINSALX. The pptimizer picked the MAXSALX index because it has the best selectivity.
Note: Index selectivity is stored in the system table RDB$INDICES, and you can get it by issuing the following statement:
SELECT RDB$INDEX_NAME,RDB$STATISTICS FROM RDB$INDICES ;
Selectivity is a number between 0 and 1. A lower number means better selectivity.
Example 5: Filter condition on more columns with indices:
SELECT * FROM JOB WHERE JOB_COUNTRY ='Japan' AND JOB_CODE ='Admin' ; PLAN (JOB INDEX (RDB$PRIMARY2,MINSALX))
When a filter condition is defined on multiple columns that match to multiple indices, Firebird merges bitmaps produced from each index by appropriate bitwise AND or OR operation into single bitmap.
Example 6: Output sorting I:
SELECT * FROM JOB WHERE JOB_GRADE =2 ORDER BY MIN_SALARY; PLAN SORT ((JOB NATURAL))
Inner parenthesis represents the access to the data stream (JOB), while outer parenthesis defines the sorting operation – SORT (data_stream). The table is read sequentially, because there isn’t an index on JOB_GRADE, neither does this column appear at the beginning of any index. By the same rule, there are no defined indices that could be used for MIN_SALARY.
Example 7: Output sorting II:
SELECT *FROM JOB WHERE JOB_COUNTRY ='Japan ' ORDER BY MAX_SALARY; PLAN SORT ((JOB INDEX (MINSALX)))
Data access is optimized by index MINSALX, but sorting over MAX_SALARY by navigational index is not possible.
Example 8: Output sorting III:
SELECT * FROM JOB WHERE JOB_CODE ='Admin' ORDER BY JOB_CODE; PLAN (JOB ORDER RDB$PRIMARY2)
In this case, the sort is optimized by a navigational index (keyword ORDER). When a navigational index is used, it’s not possible to decipher from the plan if other indices are used for filter conditions. In this particular case, the filter condition is optimized also by RDB$PRIMARY2.
Example 9: Join of two tables (Inner join):
SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY; PLAN JOIN (JOB NATURAL, COUNTRY INDEX (RDB$PRIMARY1))
Stream joins are indicated by JOIN (list of joined streams), and it’s realized as a loop over the first (left) stream with a lookup in the second (right) stream. Because this join doesn’t contain filter conditions, it’s necessary to process all rows from first table.
The optimizer picked up the loop over JOB table, which will be processed sequentially, and joined it with the corresponding rows from COUNTRY using index RDB$PRIMARY1. This index is obviously the best bet, as the index for the primary key has unique key values, so the total number of processed rows would be minimal. The second factor in this decision is the number of rows in both tables (after filtering). In this particular case (JOB contains 31 rows, whilst COUNTRY contains 14 rows), the difference in size between both tables in favour of COUNTRY was not significant enough to get rid of the benefit of using the unique index.
If the statement had defined the join the other way around, it would have had no impact on the final decision taken by optimizer:
SELECT * FROM COUNTRY JOIN JOB ON COUNTRY.COUNTRY = JOB.JOB_COUNTRY; PLAN JOIN (JOB NATURAL, COUNTRY INDEX (RDB$PRIMARY1))
But a different situation will occur on slightly different statement:
SELECT * FROM DEPARTMENT D JOIN PROJECT P ON D.MNGR_NO = P.TEAM_LEADER; PLAN JOIN (D NATURAL, P INDEX (RDB$FOREIGN13))
In this case the optimizer picked up the loop over PROJECT table, because neither candidate index is unique, so the size ratio between the two tables (DEPARTMENT has 21 rows, PROJECT has 6 rows) does matter here.
Example 10: Join of two tables (outer join):
SELECT * FROM COUNTRY C LEFT OUTER JOIN JOB J ON C.COUNTRY = J.JOB_COUNTRY; PLAN JOIN (COUNTRY NATURAL, JOB INDEX (MINSALX))
The one-sided outer join doesn’t allow the optimizer to pick the leftmost table for the join, because all the rows from the outer table must be processed. But there are multiple indices on the JOB table that are applicable to JOB_COUNTRY, so the optimizer has to choose one. In this case the MINSALX index is chosen due to its selectivity, although selectivity doesn’t really matter in this particular case.
Example 11: Join with filter condition:
SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY WHERE JOB.JOB_CODE ='Admin'; PLAN JOIN (JOB INDEX (RDB$PRIMARY2), COUNTRY INDEX (RDB$PRIMARY1))
This statement differs from the one in example 9 only by the additional filter condition on the JOB_CODE column. The join order remains the same, but the table JOB is filtered by index before the join:
SELECT * FROM JOB JOIN COUNTRY ON JOB.JOB_COUNTRY = COUNTRY.COUNTRY WHERE JOB.JOB_COUNTRY ='Japan'; PLAN JOIN (COUNTRY INDEX (RDB$PRIMARY1), JOB INDEX (MINSALX,MAXSALX))
Again the same join, but this time with a filter condition on JOB_COUNTR, that happens to be a foreign key. This time the optimizer completely changed the join order, because JOB_COUNTRY in table JOB is a foreign key in a table that is involved in a join. Thanks to distributed equivalencies (do you remember when I talked about them?) this condition is also applicable to the COUNTRY table. Because the condition applied to the COUNTRY table matches the primary key, and thus evaluates to a single row, it’s better to “loop” over this one record and perform a lookup in the JOB table.
JOB table is also processed via index, actually twice
- First, the index is used for filter condition.
2. Second, the index is used for lookup in join loop. Both lookups are over the same column – JOB_COUNTRY, but it’s really performed twice. The first lookup is for the filter condition and the resultant bitmap persists over the processing of the whole statement. The second lookup is performed for each COUNTRY row processed in the join loop (actually, only one). The fact that the optimizer picks two different indices (MINSALX and MAXSALX) for each lookup is one of the optimizer’s mysteries.
Example 12: Join of three tables:
SELECT * FROM PROJECT JOIN EMPLOYEE ON PROJECT.TEAM_LEADER = EMPLOYEE.EMP_NO JOIN JOB ON EMPLOYEE.JOB_CODE = JOB.JOB_CODE WHERE JOB.JOB_COUNTRY = 'Japan'; PLAN JOIN (JOB INDEX (MINSALX), EMPLOYEE INDEX (RDB$FOREIGN9), PROJECT INDEX (RDB$FOREIGN13))
When more than two tables are joined, the join order has a great impact on the speed of joins. In this case, the statement defines the join as EMPLOYEE->PROJECT->JOB. Because of the filter condition on the JOB table, the optimizer changed the order to JOB->EMPLOYEE->PROJECT.
Note: The optimizer doesn’t know the real size of data sets after filtering; it performs a crude guess based on table size, index selectivity, type of condition and other factors.
For the same join but without the filter condition, the join order would be different:
SELECT * FROM PROJECT JOIN EMPLOYEE ON PROJECT.TEAM_LEADER = EMPLOYEE.EMP_NO JOIN JOB ON EMPLOYEE.JOB_CODE = JOB.JOB_CODE; PLAN JOIN (PROJECT NATURAL, EMPLOYEE INDEX (RDB$PRIMARY7), JOB INDEX (RDB$PRIMARY2))
As you can see, in this case the optimizer rightly favoured the order with lookups using unique indices.
Example 13: Complex join of four tables
The strengths and weaknesses of any optimizer show up in complex queries. For example, say we want to list all projects with budgets for individual departments working on projects in 1994, but just for those projects whose team leaders work in countries with a dollar currency. It’s a pretty wild request that would require a complex query.
All data we want to list are taken from joining PROJECT->PROJ_DEPT_BUDGET, but we must define an additional PROJECT->EMPLOYEE->COUNTRY join for filter condition.
One way to write this request would be:
SELECT P.PROJ_ID,P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID JOIN (PROJECT P2 JOIN EMPLOYEE E ON P2.TEAM_LEADER = E.EMP_NO JOIN COUNTRY C ON E.JOB_COUNTRY = C.COUNTRY)ON P.PROJ_ID = P2.PROJ_ID WHERE C.CURRENCY ='Dollar' AND PB.FISCAL_YEAR = 1994; PLAN JOIN (PB INDEX (RDB$PRIMARY17), P2 INDEX (RDB$PRIMARY12), E INDEX (RDB$PRIMARY7), C INDEX (RDB$PRIMARY1), P INDEX (RDB$PRIMARY12))
The optimizer did a good job with join order here, but didn’t detect the double use of the PROJECT table as P and P2, in order to reduce it to a single use. The next statement performs the same query, but the join is specified differently, without double use of PROJECT table:
SELECT P.PROJ_ID,P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.PROJECTED_BUDGET FROM (PROJECT P JOIN EMPLOYEE E ON P.TEAM_LEADER = E.EMP_NO JOIN COUNTRY C ON E.JOB_COUNTRY = C.COUNTRY) JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID WHERE C.CURRENCY ='Dollar' AND PB.FISCAL_YEAR = 1994; PLAN JOIN (PB INDEX (RDB$PRIMARY17), P INDEX (RDB$PRIMARY12), E INDEX (RDB$PRIMARY7), C INDEX (RDB$PRIMARY1))
As you can see, the new execution plan is optimal in the same way as before, but this time without doubling the processing of the PROJECT table.
Example 14: Sorting and aggregation
The next statement lists all projects with departmental budgets for individual years:
SELECT P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.FISCAL_YEAR,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID ORDER BY P.PRODUCT, P.PROJ_NAME; PLAN SORT (JOIN (PB NATURAL, P INDEX (RDB$PRIMARY12)))
According to this plan, the engine joins the source streams into one, then sorts that. Sorting could be optimized by an index. In this particular case, the ORDER BY clause matches the PRODTYPEX index, but this index cannot be used, because table PROJECT is not the controlling (leftmost) table for the join.
In next example, the situation is completely different:
SELECT P.PROJ_NAME,P.PRODUCT,PB.DEPT_NO,PB.FISCAL_YEAR,PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID ORDER BY PB.FISCAL_YEAR; PLAN JOIN (PB ORDER RDB$PRIMARY17, P INDEX (RDB$PRIMARY12))
Ordering is defined on the indexed column FISCAL_YEAR, which is part of controlling table for the join. Thus it is possible to perform the join loop with a navigational index to get a sorted list without sorting. The use of navigational index is indicated by the ORDER keyword in the plan.
TIP:
If you want to force the optimizer not to use the navigational index and yet perform the sort without an explicit execution plan, you can add a meaningless operation to one of the output columns that would otherwise be used for ordering. For example:
SELECT EMP_NO, LAST_NAME, FIRST_NAME || '' FROM EMPLOYEE ORDER BY 2,3
Requests with aggregates containing a GROUP BY clause are basically the same as requests with sorting, because they are processed in two steps:
- Creation of the basic resulting data set and separation into groups, using a sort.
2. Aggregation calculations on groups. The next example lists all projects and their budgets for year 1994. To get the project’s budget, we need to use an aggregate function (SUM), because project budgets detail for each department is stored in the PROJ_DEPT_BUDGET table:
SELECT P.PROJ_NAME, P.PRODUCT, SUM (PB.PROJECTED_BUDGET) FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID WHERE PB.FISCAL_YEAR = 1994 GROUP BY P.PRODUCT, P.PROJ_NAME; PLAN SORT (JOIN (PB INDEX (RDB$PRIMARY17), P INDEX (RDB$PRIMARY12)))
The execution plan addresses just the first part of the statement processing: building the basic data stream and sorting it.
Example 15: Unions
Because a query with a UNION clause contains two separate select statements, two separate execution plans are created.
The next statement specifies a union listing aggregated budgets for projects in 1994 along with detailed budgets for individual departments:
SELECT P.PROJ_NAME, P.PRODUCT, PB.FISCAL_YEAR,SUM(PB.PROJECTED_BUDGET) FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID WHERE PB.FISCAL_YEAR = 1994 GROUP BY P.PROJ_NAME, P.PRODUCT, PB.FISCAL_YEAR UNION ALL SELECT P.PROJ_NAME, P.PRODUCT, PB.FISCAL_YEAR, PB.PROJECTED_BUDGET FROM PROJECT P JOIN PROJ_DEPT_BUDGET PB ON P.PROJ_ID = PB.PROJ_ID WHERE PB.FISCAL_YEAR = 1994 ; PLAN SORT (JOIN (PB INDEX (RDB$PRIMARY17), P INDEX (RDB$PRIMARY12))) PLAN JOIN (PB INDEX (RDB$PRIMARY17), P INDEX (RDB$PRIMARY12))
If the same statement used UNION instead of UNION ALL, it would have the same execution plan, although it would require additional sorts of both streams to eliminate duplicate rows.
Example 16: Subqueries
The next query lists project team leaders who are also department managers:
SELECT P.PROJ_NAME, E.FULL_NAME, E.PHONE_EXT FROM PROJECT P JOIN EMPLOYEE E ON P.TEAM_LEADER = E.EMP_NO WHERE EMPLOYEE.EMP_NO IN (SELECT D.MNGR_NO FROM DEPARTMENT D); PLAN (D INDEX (RDB$FOREIGN10)) PLAN JOIN (E NATURAL, P INDEX (RDB$FOREIGN13))
Both the main select and the subquery have their own execution plans. The subquery is filtered using the index for the foreign key (MNGR_NO) for individual rows from the join of EMPLOYEE and PROJECT. For comparison, the same query could be realised without using the subquery:
SELECT P.PROJ_NAME, E.FULL_NAME, E.PHONE_EXT FROM PROJECT P JOIN EMPLOYEE E ON P.TEAM_LEADER = E.EMP_NO JOIN DEPARTMENT D ON E.EMP_NO = D.MNGR_NO; PLAN JOIN (D NATURAL, E INDEX (RDB$PRIMARY7), P INDEX (RDB$FOREIGN13))
As you can see, the plan for this statement is more optimal. I hope you noted the lesson about subqueries vs. joins here.
Once more the same query, but this time with a correlated subquery:
SELECT P.PROJ_NAME, E.FULL_NAME, E.PHONE_EXT FROM PROJECT P JOIN EMPLOYEE E ON P.TEAM_LEADER = E.EMP_NO WHERE EXISTS (SELECT D.* FROM DEPARTMENT D WHERE D.MNGR_NO = E.EMP_NO); PLAN (D INDEX (RDB$FOREIGN10)) PLAN JOIN (E NATURAL, P INDEX (RDB$FOREIGN13))
Optimizer improvements in Firebird
As was mentioned earlier, Firebird contains many enhancements and fixes to the optimizer, more details of these changes can be found in the relevant release notes.