Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

External SQL Rewriters

Richard To and Cara Pang

, May 01, 2002


May02: External SQL Rewriters

Richard is the technical director and Cara a senior engineer at LECCO Technology. They can be contacted at [email protected] and [email protected], respectively.


Database query optimizers are resident components in relational database management system (RDBMS) servers. Their function is to optimize Structured Query Language (SQL) statements by generating alternative execution plans to find the one with the least estimated cost. Typically, when a SQL statement is sent to the RDBMS server, it is parsed and presented to the query optimizer. The query optimizer then rewrites the query and evaluates its expression to produce alternative execution plans. A cost estimation is calculated for each alternative execution plan and the plan with the least cost executes that SQL statement.

This approach has two intrinsic deficiencies:

  • A lack of alternative execution plans due to limited plan space or a result of the SQL statement syntax itself.
  • Imprecise cost estimation, which is highly dependent on the accuracy of the data dictionary — even a small miscalculation in a single step may result in a high degree of inaccuracy in the final estimated cost.

So while it's unrealistic to expect an optimizer to find the very best plan in every case, most optimizers aim to find a relatively good execution plan or, more importantly, to avoid the worst plan.

Today's commercial query optimizers are complex pieces of software with many closely guarded details, estimated to represent 40 to 50 man-years of development. As database architectures become more sophisticated and powerful, query optimizers are facing new challenges to accommodate database features such as data warehousing, parallel query, and distributed query. And, as new database features are implemented and databases become bigger, a tradeoff is required since increasing query optimizer intelligence leads to an increase in the database performance overhead.

A New SQL Optimization Concept

It is not unusual that experienced programmers can sometimes boost the performance of a problematic SQL statement up to thousands of times faster. With deep knowledge of the database characteristics and by reconstructing the SQL statement syntax, developers can help the query optimizer find a better execution plan — one that cannot be generated by the query optimizer alone. The issue now is whether an algorithm can accomplish the same optimization. Whether we can implement an algorithm that can perform the same result is a heuristic question.

Figure 1 shows the concept of an algorithm we call "External SQL Rewriter" (ESR), which rewrites SQL statements based on existing database characteristics to produce semantically equivalent SQL with new syntax. The advantages of using an ESR are:

  • It will work on a nonreal-time optimization basis. The time to rewrite and optimize a SQL statement is theoretically unlimited, so it can try more alternatives than the database query optimizer can. The amount of time saved may include days spent on trying to increase the performance of a single critical SQL statement by exhaustive searching of alternative execution plans.
  • SQL performance gains can be realized without overloading the database query optimizer with extra statistics, hence making it both a simpler and safer way to improve the overall database performance.

  • An ESR engine is far more flexible compared with the internal query optimizer of an RDBMS. SQL rewrite knowledge can be easily extended and searching space can be expanded via hardware upgrades without any limitations and drawbacks.

For an ESR to optimize SQL statements, it must not be hindered by the same restrictions as a database query optimizer. In addition, it must have intelligent SQL rewrite capability and knowledge of the database characteristics to provide alternative SQL statements. Using the approach presented in Figure 1, SQL statements can be input directly to the ESR or captured through examining the source files for problematic SQL statements. Upon receiving a SQL statement, a recursive SQL transformation algorithm is applied to rewrite the source SQL and produce every possible semantically equivalent alternative. The recursive SQL transformation engine has built-in artificial intelligence and SQL transformation rules to reconstruct the SQL statement according to the user's database characteristics.

The ESR uses a set of quotas to control the searching level of its transformation rules. These user-definable rules give users control of the search space. If they aren't satisfied with the result or can't find a better SQL alternative within the defined quotas, they can increase the quotas to enlarge the searching space until no further SQL alternatives can be found. Another advantage is that the ESR eliminates internal SQL statements with duplicate execution plans, which not only reduces the human trial and error effort but also guarantees that the rewritten SQL statements will have unique execution plans and performance.

Although an ESR can enlarge its plan space and generate more alternatives than a database query optimizer, it encounters the same problem with inaccurate cost estimation without physically running each rewritten SQL statement — there is no other way to tell which execution plan is the best among all alternatives. The most practical solution, then, is to test run all SQL alternatives. Users can specify when to start the test and how long to run the test. Automatic termination can be provided for unnecessarily long run-time SQL statements. While this solution abandons the use of the estimated cost as the guideline for SQL performance, it still uses it as a test-run order.

Recursive SQL Transformation

It's well known that different SQL syntax causes a database query optimizer to generate different execution plans. An ESR solution must provide the same ability to simulate human SQL rewriting capability — and that's exactly what the new Recursive SQL Transformation technology does. It simulates human SQL transformation techniques by incorporating a set of transformation rules for transforming SQL statements on a piece-by-piece basis. Each transformation rule is independent from each other — like a capsule — and the capsule can only be opened when all necessary conditions are satisfied. This guarantees the semantic equivalence of the rewritten SQL statements. Since any SQL statement transformed by one rule will have a new SQL syntax that may satisfy the requirements of another rule, this transformation can be carried out in a recursive manner.

Listing One shows how two built-in searching rules work together. One rule transforms an IN condition to an EXISTS condition and another rule does the reverse, changing an EXISTS condition to an IN condition. If you consider rewriting the SQL statement, each rule applied to the SQL statement will satisfy another rule. SQL statements with different syntax can be produced by such a set of rules and the order of the rules being fired can introduce different SQL alternatives. In this example, the source SQL has gone through two rules fired in a recursive manner, and if the recursive transformation is not stopped, the loop will continue infinitely, as in Figure 2. There are four unique syntax SQL statements generated by the two transformation rules marked by the solid boxes. If each of these four SQL statements ends up with a new execution plan, there are potentially three different performance SQL statements that can be used as a benchmark to the source SQL.

Implementing a transformation rule actually requires a more complicated control, in that it has to check whether a set operator (UNION, MINUS, or INTERSECT) is in the subquery, whether multiple items are in the select list, and so on. But the beauty of those rules is that they are self protected because the transformation and conditional checking are encapsulated into one rule capsule to prevent generating incorrect SQL statements. In our ESR, many rules are incorporated and the rules can handle extremely complicated situations. Some transformation rules can even be applied endlessly to transform a SQL statement to another semantically equivalent one without limitation, so quotas must be engaged to control the searching space.

A Real-World Example

Take a look at the following SQL statements in Listing Two and read them into an ESR connected to an Oracle 8.1.7 database, where Table Employee has 4,000,671 records, Department has 430 records, and Grade has 61 records. In Listing Three, the SQL statement has the execution plan with an estimated cost of 24,847. The cost estimation is calculated by the database optimizer using the access methods, statistics, and estimated selectivity of the execution plan; how the cost estimation is calculated may differ from one database to another. Generally, cost estimation has no specific unit of measurement and has no meaning if viewed individually or if compared with different SQL statements. The value of this figure is irrelevant to the performance of the SQL statement; only the relative magnitude of it compared with the alternative execution plans of the same SQL statement is meaningful to estimate which execution plan is most likely to run more efficiently.

If you put this into the ESR to see how it generates semantically equivalent SQL, with alternative syntax and execution plans, Figure 3 shows that the ESR has rewritten 240 semantically equivalent SQL alternatives within 30 seconds. Out of 240 rewrites, 14 unique execution plans are generated by the Oracle database, which means that most SQL rewrites generated by the ESR have the same execution plans.

Look at these 14 representative SQL alternatives in Table 1, where the SQL statements are ordered by Oracle estimated cost, lowest cost listed first. After physical benchmarking of all alternative SQL, the elapsed times can be collected. It's obvious that while SQL3 has a low estimated Oracle cost of 1, it actually has the longest elapsed time (>625.86 seconds) and was terminated manually. This example shows that using the cost-estimate measure alone is not enough — the ESR optimizer is needed in predicting SQL1 and SQL2 to have better performance. The same phenomenon occurred with the SQL9 and SQL10 statements. Both had relatively lower costs than SQL11, SQL12, SQL13, and SQL14, but their elapsed time and resource consumption were much higher than the latter four.

Figure 4 shows the Oracle cost estimation for this example and the benchmark test results, with detailed statistics. Within the ESR preset quota searching space, the best SQL alternative is found to be SQL2, along with its new execution plan. Since the ESR algorithm runs on a trial and error basis, it doesn't provide any explanation mechanism to tell why the new syntax is better than the original SQL statement for this specific environment. (Users may have to spend some time studying why the new syntax can cause the database optimizer to pick up another execution plan and why the new plan has better performance.) But at this point, all that users have to do is copy and paste SQL2 to the source program and compile it to gain a tremendous performance improvement. Listing Four shows the new syntax, while Listing Five presents the new execution plan with a cost estimate of 1.

Conclusion

After years of hard work to simplify the SQL optimization process, people are still suffering from bad SQL performance despite claims by some commercial database vendors that their query optimizers have enhanced intelligence. While no one should underestimate the intelligence of today's built-in query optimizers, extending their use only buries users with performance overhead. This is because the more intelligent the query optimizers are, the less user control is provided to influence the behavior of their query optimizers. Using an external SQL rewriter algorithm in collaboration with the built-in query optimizer can relieve the performance overload.

The real importance of ESR technology is that no human involvement is required throughout the whole optimization process. The ESR algorithm searches every possible way to rewrite the original SQL and trigger the database optimizer to generate alternative execution plans. ESRs are extendible: They can grow as user needs change, by incorporating more SQL transformation rules to handle additional database features and more complex SQL statements. Theoretically, there are no size limitations. And, as database vendors are more open to having users influence their database query optimizers, ESR offers more control over the optimization strategy and make optimizers even more useful.

DDJ

Listing One

SELECT * 
 FROM A
 WHERE A.C1 IN (SELECT B.C1 
                 FROM B
                 WHERE EXISTS (SELECT 'x' 
                                FROM C
                                WHERE B.C2 = C.C2 ))

Back to Article

Listing Two

SELECT * 
  FROM grade A 
 WHERE A.grd_id IN (SELECT B.emp_grade 
                      FROM employee B 
                     WHERE EXISTS (SELECT 'x' 
                                     FROM department C
                                    WHERE B.emp_dept = C.dpt_id))

Back to Article

Listing Three

EXECUTION PLAN :
[-] SELECT STATEMENT CHOOSE Cost = 
                         24847 (Cost = 24847 Cardinality = 61 Bytes = 3111 )
    [-] HASH JOIN (Cost = 24847 Cardinality = 61 Bytes = 3111 )
        [-] VIEW SYS.VW_NSO_1 (Cost = 24845 Cardinality = 45 Bytes = 585 )
            [-] SORT UNIQUE (Cost = 24845 Cardinality = 45 Bytes = 495 )
                [-] NESTED LOOPS 
                        (Cost = 6072 Cardinality = 3999700 Bytes = 43996700 )
                    [-] PARTITION RANGE ALL ( Partition Start = 
                                     1 Partition Stop = 14 Partition ID = 5)
                        [ ] TABLE ACCESS FULL SQLEXP.EMPLOYEE ANALYZED 
                                 (Cost = 6072 Cardinality = 3999700 Bytes = 
                                      27997900 Partition Start = 1 Partition 
                                                   Stop = 14 Partition ID = 5)
                    [ ] INDEX UNIQUE SCAN SQLEXP.DPT_ID_INX ANALYZED 
                                           ( Cardinality = 430 Bytes = 1720 )
        [ ] TABLE ACCESS FULL SQLEXP.GRADE ANALYZED 
                                     (Cost = 1 Cardinality = 61 Bytes = 2318 )

Back to Article

Listing Four

EXECUTION PLAN :
[-] SELECT STATEMENT CHOOSE Cost = 1 (Cost = 1 Cardinality = 4 Bytes = 152 )
    [-] FILTER ( )
        [ ] TABLE ACCESS FULL SQLEXP.GRADE ANALYZED 
                                      (Cost = 1 Cardinality = 4 Bytes = 152 )
        [-] NESTED LOOPS (Cost = 92709 Cardinality = 88883 Bytes = 977713 )
            [-] TABLE ACCESS BY GLOBAL INDEX ROWID SQLEXP.EMPLOYEE ANALYZED 
                              (Cost = 3826 Cardinality = 88883 Bytes = 622181 
                               Partition Start = ROW LOCATION Partition Stop =
                               ROW LOCATION Partition ID = 4)
                [ ] INDEX RANGE SCAN SQLEXP.EMP_GRADE_IDX ANALYZED 
                                            (Cost = 225 Cardinality = 88883 )
            [ ] INDEX FULL SCAN SQLEXP.DPT_ID_INX ANALYZED 
                                    (Cost = 1 Cardinality = 430 Bytes = 1720 )

Back to Article

Listing Five

SELECT * 
 FROM grade A 
 WHERE EXISTS (SELECT 'X' 
                 FROM employee b 
                 WHERE B.emp_dept IN (SELECT C.dpt_id || '' 
FROM department c) 
                 AND B.emp_grade = A.grd_id)

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.