Tags: DatabaseOptimization, MySQL, QueryOptimizer, IndexSelectivity, BTree
Section 1: The Frustration of Ignored Indexes
If you are a backend engineer or database administrator, you have likely encountered this incredibly frustrating scenario: your application is suffering from slow database response times, so you identify the problematic query, analyze the WHERE clause, and deploy a brand new index. You confidently expect the database to use this index and drastically reduce the query execution time. However, when you check the execution plan, you are greeted with the dreaded type: ALL—the database optimizer has completely ignored your carefully crafted index and opted for a massive, inefficient Full Table Scan instead [1].
In a moment of desperation, many developers reach for what seems like the easiest solution: modifying the SQL query to include a FORCE INDEX or USE INDEX hint. By explicitly telling the database engine which index to use, the query suddenly executes using the intended B-Tree structure, and performance appears to improve. Problem solved, right?
Not quite. While FORCE INDEX might act as a temporary bandage, it is almost always a dangerous technical debt that masks a fundamental flaw in your database schema, index design, or query structure. Relational database management systems (RDBMS) possess highly sophisticated, mathematics-driven Query Optimizers. When the optimizer chooses to ignore an index, it usually has a highly logical, cost-based reason for doing so.
In this comprehensive guide, we will dive deep into the architectural and mathematical reasons why query optimizers reject indexes. We will explore why relying on FORCE INDEX is a dangerous practice, and more importantly, we will outline the exact engineering principles you must follow to design indexes that the optimizer will naturally and eagerly choose. Whether you are using MySQL, PostgreSQL, Oracle, or even NoSQL document stores, mastering these underlying computer science concepts will permanently elevate your database tuning skills.
Section 2: The Mathematical Reality of the Cost-Based Optimizer
To understand why your index is being ignored, you must first understand how the database makes decisions. Modern relational databases utilize a Cost-Based Optimizer (CBO). Before a query is executed, the CBO generates multiple potential execution plans, calculates a mathematical cost for each plan, and selects the path with the lowest overall cost [2], [3].
The cost is not measured in milliseconds; rather, it is a calculated metric primarily driven by the most expensive operation in computer architecture: Disk I/O.
To find data using a standard B+-Tree secondary index, the database must traverse the tree from the root node down to the leaf nodes. In a secondary index, the leaf nodes do not contain the actual row data; instead, they contain pointers (or primary key values) that tell the database where the actual row resides on the physical disk [4].
The mathematical cost formula for fetching records via a secondary B+-Tree index can be simplified as follows:
Let us break down exactly what these variables represent:
•
: The height of the index tree (the depth of the B+-Tree).
•
n: The number of records that match the query condition and must be fetched from the physical disk.
•
: The block transfer time (the time it takes to read a block of data from the disk into memory).
•
: The disk seek time (the physical time it takes for the disk head to move to the correct location).
The most dangerous variable in this equation is n. Every time the database finds a matching entry in the secondary index, it must perform a separate, random disk lookup to fetch the full row data. This is known as Random I/O. Disk seek time () is incredibly slow compared to memory operations.
If your query matches 100,000 rows, the database must perform 100,000 random disk seeks. As n grows larger, the accumulated cost of explodes exponentially.
At a certain tipping point—usually when a query needs to fetch more than 15% to 20% of the total rows in a table—the optimizer's mathematical model determines that performing thousands of random disk seeks is vastly more expensive than simply reading the entire table from start to finish. Reading the table sequentially is known as Sequential I/O. With Sequential I/O, the disk head moves in one continuous motion, sweeping up large contiguous blocks of data at high speeds [5].
Therefore, when the optimizer ignores your index, it is not making a mistake. It has calculated that the Random I/O penalty of using your index is mathematically heavier than the Sequential I/O cost of a Full Table Scan.
Section 3: The Hidden Dangers of FORCE INDEX
Understanding the mathematical cost model makes it easier to see why deploying FORCE INDEX is a flawed architectural decision. When you hardcode an index hint into your application code, you strip the Cost-Based Optimizer of its autonomy. You are essentially telling the database, "I do not care about the mathematical cost; use this path no matter what."
This introduces severe, long-term risks to your system:
1.
Dynamic Data Distribution: Data is never static. An index that is highly efficient when your table has 10,000 rows might become a catastrophic bottleneck when the table grows to 100 million rows. If a specific customer suddenly generates an enormous amount of data, querying their records via the forced index could trigger millions of random disk seeks, crippling your database server. Because you hardcoded FORCE INDEX, the optimizer is legally bound to execute the disastrous plan, leading to sudden, inexplicable system outages [6], [7].
2.
Engine Upgrades: Database engines are constantly evolving. An upgrade from MySQL 5.7 to MySQL 8.0, or PostgreSQL 12 to 15, brings massive improvements to the optimizer's algorithms (such as Index Condition Pushdown or improved Hash Joins) [8]. By forcing indexes, you prevent your application from benefiting from these under-the-hood engine enhancements.
3.
Schema Changes: If a future developer drops or renames the index you forced, your queries will instantly fail, causing hard application crashes.
Optimizer hints should only be used as a temporary emergency patch during a live production incident, or in highly specialized data warehouse queries where the data distribution is mathematically guaranteed to never change. In all other scenarios, the correct approach is to engineer your indexes so the optimizer naturally chooses them.
Section 4: Engineering Indexes the Optimizer Will Love
If we cannot rely on FORCE INDEX, how do we convince the optimizer to use our indexes? The answer lies in manipulating the variables in the cost equation to make the index path undeniably cheaper than the full table scan.
4.1 Mastering Index Selectivity
The primary reason an optimizer rejects an index is poor selectivity. Index Selectivity is a statistical metric that represents the ratio of distinct (unique) values to the total number of rows in the table [9].
Selectivity = Distinct_Values / Total_Records
Selectivity ranges from a value approaching 0 (terrible) to 1 (perfect). A unique primary key, such as a Social Security Number or a User UUID, has a selectivity of 1. Every lookup points to exactly one row, meaning the n in our cost formula is 1. The optimizer loves highly selective indexes because the Random I/O cost is negligible.
Conversely, consider a status column in an e-commerce orders table with only three possible values: PENDING, SHIPPED, and CANCELLED. If you have 10 million orders, and 3 million are SHIPPED, the selectivity is incredibly low (3 / 10,000,000). If you write a query WHERE status = 'SHIPPED', the optimizer knows it will have to fetch 3 million rows. The cost of 3 million random disk seeks vastly outweighs a sequential table scan, so the index is ignored.
The Fix: Never create single-column indexes on low-selectivity columns like booleans, genders, or basic statuses. Instead, combine them with highly selective columns to create multicolumn (composite) indexes.
4.2 The Art of Multicolumn Index Ordering
When creating composite indexes, the order in which you declare the columns is paramount. A B-Tree composite index sorts data hierarchically based on the order of the columns defined in the index definition.
If you have a query like:
SELECT * FROM users WHERE
organization_id = 55 AND last_login_date > '2023-01-01';
SQL
복사
You must follow two golden rules for column ordering:
1.
Most Selective First: Place the column that filters out the most rows at the very beginning of the index [10]. In the example above, organization_id is likely much more selective than a broad date range.
2.
Equality Before Range (The ESR Rule): Always place columns queried with equality operators (=, IN) before columns queried with range operators (>, <, BETWEEN, LIKE).
Why is the ESR rule so critical? Because of how B-Trees function. As soon as the database encounters a range condition in the index, it can no longer use the subsequent columns in the index for rapid B-Tree traversal; it must sequentially scan the remaining leaf nodes to filter the data [11], [12]. By placing equality conditions first, you isolate a tiny, specific slice of the B-Tree before applying the range scan, drastically reducing the dataset size.
4.3 Eradicating Random I/O with Covering Indexes
The absolute most powerful optimization technique in relational databases is the Covering Index.
Recall our mathematical cost formula: . The massive performance penalty comes from the database having to leave the B-Tree index to fetch the actual row data from the physical disk blocks.
But what if the database never had to leave the index?
If an index contains all the columns required by your SELECT, WHERE, and ORDER BY clauses, it is considered a "Covering Index." Because all the necessary data is already present inside the B-Tree's leaf nodes, the database completely skips the row lookup phase [6]. The (random disk seek) portion of the cost equation is instantly reduced to zero.
When you run an EXPLAIN on your query, a successful Covering Index will display Extra: Using index [6], [13]. This is the holy grail of query optimization.
For example, if you frequently run:
SELECT first_name, last_name FROM employees WHERE department_id = 10;
SQL
복사
Instead of just indexing department_id, create a composite index on (department_id, first_name, last_name). The query will execute in a fraction of a millisecond because it never touches the underlying table data.
4.4 Conquering the Filesort Penalty
Another major reason queries perform terribly, even when hitting an index, is post-retrieval sorting. If your query includes an ORDER BY clause, the database will attempt to use the natural sorted order of the B-Tree index to return the results.
However, if your ORDER BY columns do not align perfectly with your index structure, the database must fetch all matching rows, place them into a dedicated memory buffer defined by the sort_buffer_size variable, and sort them manually. If the result set is larger than the sort_buffer_size memory allocation, the database is forced to write chunks of data to temporary files on the physical disk, sort them, and merge them back together [8].
This disastrously slow operation is flagged in the EXPLAIN output as Extra: Using filesort.
The Fix: Align your composite index order with your sorting requirements. If you query WHERE category_id = 5 ORDER BY created_at DESC, your index must be strictly ordered as (category_id, created_at). This allows the database to instantly slice down to category_id = 5 and read the records sequentially, as they are already physically sorted by created_at inside the B-Tree.
4.5 Avoiding Hidden Type Conversions and Non-Sargable Queries
Sometimes you build a perfect index, but the optimizer still ignores it because of how you wrote the SQL. A query must be "Sargable" (Search Argument Able) for the optimizer to traverse the B-Tree.
If you wrap an indexed column inside a function, you instantly destroy the optimizer's ability to use the index [14].
Bad: WHERE YEAR(created_at) = 2023
Because the database must apply the YEAR() function to every single row in the table to evaluate the condition, it is forced to perform a Full Table Scan.
Good: WHERE created_at >= '2023-01-01 00:00:00' AND created_at <= '2023-12-31 23:59:59'
Similarly, beware of implicit type casting. If you have an indexed VARCHAR column called account_number, but you query it using an integer (WHERE account_number = 12345), the database must silently convert all strings in the column to integers to compare them. This breaks the B-Tree traversal and triggers a full scan. Always match your query parameter data types exactly to your schema column definitions.
Section 5: Understanding InnoDB Architecture (Clustered vs. Secondary Indexes)
If you are using MySQL's default InnoDB storage engine, the penalty for ignored indexes is uniquely severe due to its underlying physical architecture.
InnoDB utilizes a Clustered Index architecture. In a clustered index, the physical rows of the table are stored directly inside the leaf nodes of the Primary Key B+-Tree [15]. Because the table is the index, retrieving data via the Primary Key is breathtakingly fast.
However, this means that all other indexes you create are Secondary Indexes. The leaf nodes of a secondary index do not contain physical disk pointers; they contain the Primary Key value of the matching row [4], [16].
Therefore, every time InnoDB uses a secondary index, it must perform a "double lookup." It first traverses the secondary index to find the Primary Key value, and then it must traverse the Primary Key Clustered Index to find the actual row data. This double-traversal mechanism makes the Random I/O penalty significantly higher in InnoDB than in engines like MyISAM (which uses physical byte offsets). This architectural reality reinforces exactly why maximizing Index Selectivity and utilizing Covering Indexes are absolutely mandatory when working with InnoDB workloads.
Section 6: Universal Principles Across Different Database Systems
The frustrating experience of an ignored index is not isolated to MySQL. The mathematical principles of Cost-Based Optimization govern almost every major database system in existence.
•
PostgreSQL: PostgreSQL uses a highly advanced Query Planner that heavily relies on table statistics. If your indexes are being ignored in Postgres, it is often because the planner's statistics are outdated, tricking the optimizer into thinking a sequential scan is cheaper. Running the VACUUM ANALYZE command updates the statistical metadata, accurately reflecting the index selectivity and often fixing the execution plan immediately [17].
•
Oracle: Oracle Database utilizes an incredibly complex CBO that relies on Histograms. Histograms allow Oracle to understand data skew (e.g., knowing that the value "ACTIVE" occurs 99% of the time, while "SUSPENDED" occurs 1% of the time). Oracle will dynamically choose a Full Table Scan when querying "ACTIVE", but flawlessly switch to an Index Scan when querying "SUSPENDED".
•
MongoDB & NoSQL: Even schema-less document stores like MongoDB utilize B-Tree structures for their indexes. When you run an .explain("executionStats") command in MongoDB and see a COLLSCAN (Collection Scan), it means the index was ignored. The exact same ESR (Equality, Sort, Range) rule that dictates MySQL composite index ordering strictly applies to MongoDB compound indexes.
Section 7: Conclusion and Actionable Checklist
When a database query optimizer ignores your index, it is not acting maliciously; it is performing a highly calculated, mathematical cost analysis. Forcing the database's hand with FORCE INDEX bypasses this mathematical reality, creating fragile systems that will inevitably buckle under the weight of dynamic data growth and evolving infrastructure.
Instead of fighting the optimizer, you must collaborate with it. By minimizing the burden of Random I/O and eliminating memory-crushing sort operations, you guarantee that the optimizer will naturally select your heavily optimized paths.
Before you deploy your next database schema update, ensure you pass this architectural checklist:
•
Does the column I am indexing have high Index Selectivity, or am I trying to index a low-cardinality boolean flag?
•
In my composite indexes, have I placed the most selective columns first?
•
Have I followed the ESR rule, ensuring all Equality conditions are listed before Range conditions?
•
Can I add one or two more columns to my index to create a Covering Index and completely eliminate row data lookups?
•
Are my ORDER BY operations triggering a filesort and exceeding the sort_buffer_size, or is my index naturally sorting the data for the engine?
•
Is my query entirely Sargable, free of implicit type casting and column-wrapping functions?
By mastering these deep computer science principles, you transition from simply guessing at index creation to engineering mathematically sound, high-performance database architectures that will flawlessly scale to billions of rows.
References
[1] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — Invoking EXPLAIN To use EXPLAIN, simply add the word EXPLAIN just before the SELECT keyword in your query. MySQL will set a flag on the query. When it executes the query, the flag causes it to return information about each step in the execution plan, instead of executing it. It returns one or more r…
[2] Database-System-Concepts-7th-Edition — [Manning et al. (2008)] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Infor-mation Retrieval, Cambridge University Press (2008). [Samet (2006)] H. Samet, Foundations of Multidimensional and Metric Data Structures, Mor-gan Kaufmann (2006). [Shekhar and Chawla (2003)] S. Shekhar and S. C…
[3] Database-System-Concepts-7th-Edition — Regardless of the way the query is written, it is the job of the optimizer to find the least-cost plan for the query. To find the least costly query-evaluation plan, the optimizer needs to generate al-ternative plans that produce the same result as the given expression and to choose the least costly…
[4] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — 2. Many storage engines actually use a B+Tree index, in which each leaf node contains a link to the next for fast range traversals through nodes. Refer to computer science literature for a detailed explanation of B-Tree indexes. 148 | Chapter 5: Indexing for High Performance structure for these inde…
[5] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — Archive, 19 Blackhole, 20 column-oriented, 22 community, 23 and consistency, 633 CSV, 20 Falcon, 22 Federated, 20 InnoDB, 15 Memory, 20 Merge, 21 mixing, 11, 500 MyISAM, 18 NDB Cluster, 21 OLTP, 22 ScaleDB, 574 XtraDB, 680 stored code, 282–284, 289 Stored Procedure Library, 667 stored procedures and…
[6] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — 178 | Chapter 5: Indexing for High Performance select_type: SIMPLE table: inventory type: index possible_keys: NULL key: idx_store_id_film_id key_len: 3 ref: NULL rows: 4673 Extra: Using index Index-covered queries have subtleties that can …
[7] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — There’s a way to work around both problems with a combination of clever indexing and query rewriting. We can extend the index to cover (artist, title, prod_id) and rewrite the query as follows: Indexing Strategies for High Performance | 179 mysql> EXPLAIN SELECT * -> FROM products -> JOIN…
[8] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — 12. If you need to sort in different directions, a trick that sometimes helps is to store a reversed or negated value. 182 | Chapter 5: Indexing for High Performance UNIQUE KEY rental_date (rental_date,inventory_id,customer_id), KEY idx_fk_inventory_id (inventory_id), KEY idx_fk_customer_id (c…
[9] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — Indexing Strategies for High Performance | 159 Here’s another example of a common mistake: mysql> SELECT ... WHERE TO_DAYS(CURRENT_DATE) - TO_DAYS(date_col) <= 10; Prefix Indexes and Index Selectivity Sometimes you need to index very long character columns, which makes your indexes large and slow. O…
[10] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — Indexing Strategies for High Performance | 165 Let’s use the following query as an example: SELECT * FROM payment WHERE staff_id = 2 AND customer_id = 584; Should you create an index on (staff_id, customer_id), or should you reverse the column order? We can run some quick queries to help examine the…
[11] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — An Indexing Case Study | 191 Avoiding Multiple Range Conditions Let’s assume we have a last_online column and we want to be able to show the users who were online during the previous week: WHERE eye_color IN('brown','blue','hazel') AND hair_color IN('black','red','blonde','brown') AND sex …
[12] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — By now, you can probably see the pattern: if a user wants to see both active and inactive results, we can add an IN() list. We’ve added a lot of these lists, but the alternative is to create separate indexes that can satisfy every combination of columns on which we need to filter. We’d have to use a…
[13] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — For example, suppose you want to rewrite the following UPDATE statement to make it EXPLAIN-able: UPDATE sakila.actor INNER JOIN sakila.film_actor USING (actor_id) SET actor.last_update=film_actor.last_update; The following EXPLAIN statement is not equivalent to the UPDATE, because it doesn’t requ…
[14] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — 718 | Appendix C: Transferring Large Files APPENDIX D Using EXPLAIN This appendix shows you how to invoke EXPLAIN to get information about the query execution plan, and how to interpret the output. The EXPLAIN command is the main way to find out how the query optimizer decides to execute queries. Th…
[15] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — A clustering primary key can help performance, but it can also cause serious perfor-mance problems. Thus, you should think carefully about clustering, especially when you change a table’s storage engine from InnoDB to something else (or vice versa). 6. Oracle users will be familiar with the term “in…
[16] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — Comparison of InnoDB and MyISAM data layout The differences between clustered and nonclustered data layouts, and the correspond-ing differences between primary and secondary indexes, can be confusing and surpris-ing. Let’s see how InnoDB and MyISAM lay out the following table: CREATE TABLE layout_te…
[17] Database-System-Concepts-7th-Edition — 3.34 Using the university schema, write an SQL query to find the number of students in each section. The result columns should appear in the order “courseid, secid, year, semester, num”. You do not need to output sections with 0 students. Tools 123 3.35 Using the university schema, write an SQL quer…
