Optimizing GROUP BY in MySQL: Insights from Real-Life Query Analysis

Optimizing GROUP BY in MySQL: Insights from Real-Life Query Analysis

My current query:

SELECT
    eod.date,
    SUM(eod.stable_price) AS sumStablePrice
FROM
    item_eod eod
WHERE
    eod.date >= CURRENT_DATE - INTERVAL 90 DAY
GROUP BY
    eod.date
ORDER BY
    eod.date DESC;

Introduction

Query optimization is a cornerstone of efficient database management, especially for large datasets. Today, I explored how to optimize a costly GROUP BY query in MySQL by adding an appropriate index. This blog dives into the details of my learning process, highlighting the impact of indexing on performance.


Understanding the Problem

The original query aimed to calculate the sum of a column (stable_price) grouped by certain criteria and filtered by a date column. The execution plan revealed inefficiencies:

  • Execution Plan:
-> Sort: eod.date DESC  (actual time=2725.176..2725.183 rows=91 loops=1)
    -> Table scan on <temporary>  (actual time=2725.123..2725.136 rows=91 loops=1)
        -> Aggregate using temporary table  (actual time=2725.121..2725.121 rows=91 loops=1)
            -> Filter: (eod.date >= DATE'2024-09-25')  (cost=232440.49 rows=761190) (actual time=12.863..1637.540 rows=2272445 loops=1)
                -> Table scan on eod  (cost=232440.49 rows=2283800) (actual time=0.067..1353.590 rows=2296891 loops=1)

This execution plan highlights the following:

Current Index: item_id_date_idx (item_id, date)

  • Index Fields: item_id, date
  • Order: Ascending order by default (no DESC).
  • Execution Plan:
    • Aggregate Using Temporary Table: The execution plan involves creating a temporary table to perform the aggregation.
    • Table Scan: The query scans the entire table, applying a filter to select rows where date >= '2024-09-25'. This filter applies late in the pipeline.
    • Sort: Results are sorted by eod.date DESC after retrieving the data.
  • Performance Characteristics:
    • Time Spent on Sorting: Sorting is performed after the temporary table is aggregated, consuming significant time.
    • Table Scans: A full scan of the table or temporary table results in high costs in terms of time and resources.
    • Effectiveness of Index: The index on item_id, date is partially helpful but not fully utilized for filtering or aggregation in this query.
  • A full table scan is performed, filtering rows by date.
  • The query uses a temporary table for aggregation and sorting, which increases execution time.
  • The total time taken is 2725ms, indicating inefficiency.

Why GROUP BY Can Be Costly

When MySQL processes a GROUP BY query:

  1. Rows are filtered based on the WHERE clause.
  2. The result set is grouped and aggregated.
  3. Sorting may occur if the results are not already ordered.

Without an optimized index, the query must:

  • Read all matching rows from the table.
  • Use temporary tables and perform sorting/aggregation in memory or on disk, which is costly.

Optimizing with a New Index

To address the inefficiencies, I created a new index:

CREATE INDEX idx_date_itemid_stableprice ON eod (date, item_id, stable_price);
  • Structure:
    • date is the leading column, enabling efficient filtering by date >= '2024-09-25'.
    • item_id supports grouping.
    • stable_price ensures that the index covers the query, avoiding the need to read table rows.

Improved Execution Plan

After adding the new index, the execution plan showed significant improvements:

  • New Index: idx_date_itemid_stableprice (date, item_id, stable_price)
  • Execution Plan:
-> Group aggregate: sum(eod.stable_price)  (cost=344939.51 rows=93) (actual time=35.802..1544.939 rows=91 loops=1)
    -> Filter: (eod.date >= DATE'2024-09-25')  (cost=230749.51 rows=1141900) (actual time=0.043..1301.919 rows=2272445 loops=1)
        -> Covering index range scan on eod using idx_date_itemid_stableprice over ('2024-09-25' <= date) (reverse)  (cost=230749.51 rows=1141900) (actual time=0.042..1054.120 rows=2272445 loops=1)
Key Improvements:

Index Fields: date, item_id, stable_price
Order: Ascending by default but used in reverse (DESC) order in the query.
Execution Plan:

  • Covering Index Scan: Uses the new index efficiently, directly scanning rows in the order required (date in descending order).
  • Filter Early: The filter date >= '2024-09-25' is applied early in the query execution, reducing the dataset size for subsequent operations.
  • Group Aggregate: Aggregates (SUM(stable_price)) are performed directly over the indexed data without creating a temporary table.

Performance Characteristics:

  • Reduced Sorting Overhead: Since the index includes date as the leading column, the query does not require additional sorting.
  • Efficient Range Scan: The range scan on date effectively narrows the data processed, significantly improving query speed.
  • Covering Index: All fields required for the query (date, item_id, and stable_price) are in the index, eliminating the need to access the table rows.

Comparison of Key Metrics

AspectCurrent Index (item_id_date_idx)New Index (idx_date_itemid_stableprice)
Filter ApplicationApplied late, after table scan.Applied early, during index scan.
SortingPerformed after scanning rows.Eliminated due to index order.
AggregationRequires a temporary table.Directly on indexed data.
Scan TypeFull table scan (expensive).Index range scan (efficient).
Execution Time2725ms1545ms (nearly 43% faster).
I/O and CPU OverheadHigher due to full scans and sorting.Lower due to optimized index usage.

Key Takeaways

  1. Choosing the Right Index:
    • Ensure the leading column in the index matches the filtering condition.
    • Include columns required for grouping and aggregation to make the index covering.
  2. Understanding Execution Plans:
    • Use EXPLAIN or EXPLAIN ANALYZE to identify bottlenecks like full table scans, temporary tables, or sorting operations.
  3. Impact of Indexing:
    • Proper indexing can dramatically reduce query execution time and resource usage.
  4. Additional Optimizations:
    • If the dataset is very large, consider partitioning by date to further improve performance.
    • Pre-aggregate data in a summary table for repetitive queries.

Conclusion

Optimizing queries in MySQL often involves understanding how the database processes queries and designing indexes that align with query patterns. By analyzing execution plans and experimenting with index structures, you can achieve significant performance gains. In this case, the new index transformed a costly GROUP BY query into an efficient operation, halving the execution time and eliminating unnecessary overhead.

If you have similar optimization experiences or questions, feel free to share in the comments!

Reference: GROUP BY Optimization

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

    Leave a Reply