Tuesday, September 12, 2006

MySQL "group-wise limiting"

Time to really geek out. I've run into this class of problem a bunch of times over the years and I've never seen an answer that I like.


I've got a query that I'm using GROUP BY for. Is is possible to limit the rows per group.


Every SQL dialect that I've seen makes it possible to limit the rows in a results set. Usually something like:



mysql> select * from sometable limit 10;



and you'll get only the first ten rows. However, when you introduce grouping it all falls apart. Consider this schema representing (denormalized) grades for different students from two classes:



create temporary table groupwise_select(
id int not null primary key auto_increment,
class_id int not null,
name varchar(255) not null,
grade float not null);
insert into groupwise_select(class_id, name, grade) values(1, 'Andy', 100.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Andy', 98.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Andy', 99.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Andy', 97.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Bob', 97.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Bob', 98.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Bob', 65.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Henry', 32.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Henry', 33.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Henry', 34.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Henry', 35.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Henry', 100.0);
insert into groupwise_select(class_id, name, grade) values(1, 'Amelia', 73.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Valerie', 100.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Valerie', 100.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Valerie', 100.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Valerie', 100.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Valerie', 100.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Hans', 13.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Hans', 14.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Hans', 15.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Hans', 16.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Oscar', 99.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Oscar', 99.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Oscar', 99.0);
insert into groupwise_select(class_id, name, grade) values(2, 'Oscar', 104.0);



Your assignment is to return the two best students (highest average) from each class. OK, great. Lets start with a simple grouping query to return the average for each student


mysql> select class_id, name, avg(grade) as average from groupwise_select group by name, class_id order by class_id, average desc;
+----------+---------+-----------------+
| class_id | name | average |
+----------+---------+-----------------+
| 1 | Andy | 98.5 |
| 1 | Bob | 86.666666666667 |
| 1 | Amelia | 73 |
| 1 | Henry | 46.8 |
| 2 | Oscar | 100.25 |
| 2 | Valerie | 100 |
| 2 | Hans | 14.5 |
+----------+---------+-----------------+
7 rows in set (0.01 sec)



Good we now know that the four students we need to return as the answer are Andy (98.5), Bob(86.7), Oscar(100.25), and Valerie(100). This should be easy let's just use a LIMIT



mysql> select class_id, name, avg(grade) from groupwise_select group by name, class_id LIMIT 2;
+----------+------+-----------------+
| class_id | name | average |
+----------+------+-----------------+
| 1 | Andy | 98.5 |
| 1 | Bob | 86.666666666667 |
+----------+------+-----------------+
2 rows in set (0.00 sec)



Hmmm... That only returned the first two rows from our previous query. Yep, that's because the LIMIT is applied as the last thing the query does. Now you'll probably bang your head against the wall trying all different variations of GROUP BY, ORDER BY, and LIMIT at this point and ultimately you'll end up doing the LIMITing in application code instead of SQL concluding that it's just not possible.

However, there is a solution! It uses MySQL's user defined variables and sub-selects. Let's jump to the answer:



mysql> set @idx=0; set @cur_class=0;
mysql> SELECT class_id, name, average FROM (
SELECT r.class_id, r.name, r.grade,
IF(@cur_class != r.class_id, @idx:=1, @idx:=@idx+1) AS row_index,
IF(@cur_class != r.class_id, @cur_class:=r.class_id, 0) AS discard
FROM (SELECT class_id, name, avg(grade) AS average FROM
groupwise_select GROUP BY name, class_id
ORDER BY class_id, average DESC) AS r
HAVING row_index <= 2) AS r2;

+----------+---------+------------------+
| class_id | name | average |
+----------+---------+------------------+
| 1 | Andy | 98.5 |
| 1 | Bob | 86.6666666666667 |
| 2 | Oscar | 100.25 |
| 2 | Valerie | 100 |
+----------+---------+------------------+
4 rows in set (0.01 sec)



Awesome! So how does that work. There are probably a few key insights here:

  1. sub-selects must evaluate in their entirety before the outer-select (in this case our query returns the averages for ALL of the students sorted first by class_id and then by average)

  2. The true and false expression parts of the IF() function can be used to set user-defined variables

  3. IF() function expressions included in a SELECT statement are evaluated left to right

  4. You can't use a WHERE clause on a computed column (like row_index) but you can include it in the HAVING clause (I don't presume to really understand that)



So given those insights we first retrieve ALL of the GROUPed averages for the students order by class_id and then average, second we number the results for each class, and last we trim out the rows that have a row number less than or equal to our target (2 students per class). Voilla! Group-wise limiting.

4 comments:

  1. Hah! I still have all of those "technotes" on my desktop ;) "Semi-good"? I'm not sure if it gets any better than those days. I still wonder how that work environment emerged and whether I'll ever get to work like that again.

    ReplyDelete
  2. P.S. I checked and it works if one of the students took both classes as well. For example give 'Oscar' a 104 average in class 1 and he shows up at the top of the class 1 results as well.

    ReplyDelete
  3. Hi, just wondering if there's such a thing as GROUP BY LIMIT?

    ReplyDelete
  4. Smart Pad... Nope there is no "GROUP BY LIMIT"

    ReplyDelete