My current query:
SELECT
eod.date,
SUM(eod.stable_price) AS sumStablePrice
FROM
item_eod eod
WHERE
eod.date >= CURRENT_DATE - INTERVAL 90 DAY
GROUP BY
eod.date
ORDER BY
eod.date DESC;
Introduction
Query optimization is a cornerstone of efficient database management, especially for large datasets. Today, I explored how to optimize a costly GROUP BY
query in MySQL by adding an appropriate index. This blog dives into the details of my learning process, highlighting the impact of indexing on performance.
Understanding the Problem
The original query aimed to calculate the sum of a column (stable_price
) grouped by certain criteria and filtered by a date
column. The execution plan revealed inefficiencies:
- Execution Plan:
-> Sort: eod.date DESC (actual time=2725.176..2725.183 rows=91 loops=1)
-> Table scan on <temporary> (actual time=2725.123..2725.136 rows=91 loops=1)
-> Aggregate using temporary table (actual time=2725.121..2725.121 rows=91 loops=1)
-> Filter: (eod.date >= DATE'2024-09-25') (cost=232440.49 rows=761190) (actual time=12.863..1637.540 rows=2272445 loops=1)
-> Table scan on eod (cost=232440.49 rows=2283800) (actual time=0.067..1353.590 rows=2296891 loops=1)
This execution plan highlights the following:
Current Index: item_id_date_idx (item_id, date)
- Index Fields:
item_id, date
- Order: Ascending order by default (no
DESC
). - Execution Plan:
- Aggregate Using Temporary Table: The execution plan involves creating a temporary table to perform the aggregation.
- Table Scan: The query scans the entire table, applying a filter to select rows where
date >= '2024-09-25'
. This filter applies late in the pipeline. - Sort: Results are sorted by
eod.date DESC
after retrieving the data.
- Performance Characteristics:
- Time Spent on Sorting: Sorting is performed after the temporary table is aggregated, consuming significant time.
- Table Scans: A full scan of the table or temporary table results in high costs in terms of time and resources.
- Effectiveness of Index: The index on
item_id, date
is partially helpful but not fully utilized for filtering or aggregation in this query.
- A full table scan is performed, filtering rows by
date
. - The query uses a temporary table for aggregation and sorting, which increases execution time.
- The total time taken is 2725ms, indicating inefficiency.
Why GROUP BY Can Be Costly
When MySQL processes a GROUP BY
query:
- Rows are filtered based on the
WHERE
clause. - The result set is grouped and aggregated.
- Sorting may occur if the results are not already ordered.
Without an optimized index, the query must:
- Read all matching rows from the table.
- Use temporary tables and perform sorting/aggregation in memory or on disk, which is costly.
Optimizing with a New Index
To address the inefficiencies, I created a new index:
CREATE INDEX idx_date_itemid_stableprice ON eod (date, item_id, stable_price);
- Structure:
date
is the leading column, enabling efficient filtering bydate >= '2024-09-25'
.item_id
supports grouping.stable_price
ensures that the index covers the query, avoiding the need to read table rows.
Improved Execution Plan
After adding the new index, the execution plan showed significant improvements:
- New Index:
idx_date_itemid_stableprice (date, item_id, stable_price)
- Execution Plan:
-> Group aggregate: sum(eod.stable_price) (cost=344939.51 rows=93) (actual time=35.802..1544.939 rows=91 loops=1)
-> Filter: (eod.date >= DATE'2024-09-25') (cost=230749.51 rows=1141900) (actual time=0.043..1301.919 rows=2272445 loops=1)
-> Covering index range scan on eod using idx_date_itemid_stableprice over ('2024-09-25' <= date) (reverse) (cost=230749.51 rows=1141900) (actual time=0.042..1054.120 rows=2272445 loops=1)
Key Improvements:
Index Fields: date, item_id, stable_price
Order: Ascending by default but used in reverse (DESC
) order in the query.
Execution Plan:
- Covering Index Scan: Uses the new index efficiently, directly scanning rows in the order required (
date
in descending order). - Filter Early: The filter
date >= '2024-09-25'
is applied early in the query execution, reducing the dataset size for subsequent operations. - Group Aggregate: Aggregates (
SUM(stable_price)
) are performed directly over the indexed data without creating a temporary table.
Performance Characteristics:
- Reduced Sorting Overhead: Since the index includes
date
as the leading column, the query does not require additional sorting. - Efficient Range Scan: The range scan on
date
effectively narrows the data processed, significantly improving query speed. - Covering Index: All fields required for the query (
date
,item_id
, andstable_price
) are in the index, eliminating the need to access the table rows.
Comparison of Key Metrics
Aspect | Current Index (item_id_date_idx ) | New Index (idx_date_itemid_stableprice ) |
---|---|---|
Filter Application | Applied late, after table scan. | Applied early, during index scan. |
Sorting | Performed after scanning rows. | Eliminated due to index order. |
Aggregation | Requires a temporary table. | Directly on indexed data. |
Scan Type | Full table scan (expensive). | Index range scan (efficient). |
Execution Time | 2725ms | 1545ms (nearly 43% faster). |
I/O and CPU Overhead | Higher due to full scans and sorting. | Lower due to optimized index usage. |
Key Takeaways
- Choosing the Right Index:
- Ensure the leading column in the index matches the filtering condition.
- Include columns required for grouping and aggregation to make the index covering.
- Understanding Execution Plans:
- Use
EXPLAIN
orEXPLAIN ANALYZE
to identify bottlenecks like full table scans, temporary tables, or sorting operations.
- Use
- Impact of Indexing:
- Proper indexing can dramatically reduce query execution time and resource usage.
- Additional Optimizations:
- If the dataset is very large, consider partitioning by
date
to further improve performance. - Pre-aggregate data in a summary table for repetitive queries.
- If the dataset is very large, consider partitioning by
Conclusion
Optimizing queries in MySQL often involves understanding how the database processes queries and designing indexes that align with query patterns. By analyzing execution plans and experimenting with index structures, you can achieve significant performance gains. In this case, the new index transformed a costly GROUP BY
query into an efficient operation, halving the execution time and eliminating unnecessary overhead.
If you have similar optimization experiences or questions, feel free to share in the comments!
Reference: GROUP BY Optimization