Fast group rank() function

debugcn 投稿 Dev

ParoX

There are various ways people try to emulate MSSQL RANK() or ROW_NUMBER() functions in MySQL, but all of them I've tried so far are slow.

I have a table that looks like this:

CREATE TABLE ratings
    (`id` int, `category` varchar(1), `rating` int)
;

INSERT INTO ratings
    (`id`, `category`, `rating`)
VALUES
    (3, '*', 54),
    (4, '*', 45),
    (1, '*', 43),
    (2, '*', 24),
    (2, 'A', 68),
    (3, 'A', 43),
    (1, 'A', 12),
    (3, 'B', 22),
    (4, 'B', 22),
    (4, 'C', 44)
;

Except it has 220,000 records. There are about 90,000 unique id's.

I wanted to rank the id's first by looking at the categories which were not * where a higher rating is a lower rank.

SELECT g1.id,
       g1.category,
       g1.rating,
       Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
         g1.category,
         g1.rating
ORDER BY g1.category,
         rank

Output:

id  category    rating  rank
2   A   68  1
3   A   43  2
1   A   12  3
4   B   22  1
3   B   22  2
4   C   44  1

Then I wanted to take the smallest rank an id had, and average that with the rank they have within the * category. Giving a total query of:

SELECT X1.id,
       (X1.rank + X2.minrank) / 2 AS OverallRank
FROM
  (SELECT g1.id,
          g1.category,
          g1.rating,
          Count(*) AS rank
   FROM ratings AS g1
   JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
   AND g1.category = g2.category
   WHERE g1.category = '*'
   GROUP BY g1.id,
            g1.category,
            g1.rating
   ORDER BY g1.category,
            rank) X1
JOIN
  (SELECT id,
          Min(rank) AS MinRank
   FROM
     (SELECT g1.id,
             g1.category,
             g1.rating,
             Count(*) AS rank
      FROM ratings AS g1
      JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
      AND g1.category = g2.category
      WHERE g1.category != '*'
      GROUP BY g1.id,
               g1.category,
               g1.rating
      ORDER BY g1.category,
               rank) X
   GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank

Giving me

id  OverallRank
3   1.5000
4   1.5000
2   2.5000
1   3.0000

This query is correct and the output I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating and category and id,category

Edit:

Result of SHOW CREATE TABLE ratings:

CREATE TABLE `rating` (
     `id` int(11) NOT NULL,
     `category` varchar(255) NOT NULL,
     `rating` int(11) NOT NULL DEFAULT '1500',
     `rd` int(11) NOT NULL DEFAULT '350',
     `vol` float NOT NULL DEFAULT '0.06',
     `wins` int(11) NOT NULL,
     `losses` int(11) NOT NULL,
     `streak` int(11) NOT NULL DEFAULT '0',
     PRIMARY KEY (`streak`,`rd`,`id`,`category`),
     UNIQUE KEY `id_category` (`id`,`category`),
     KEY `rating` (`rating`,`rd`),
     KEY `streak_idx` (`streak`),
     KEY `category_idx` (`category`),
     KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The PRIMARY KEY is the most common use case of queries to this table, that is why it's the clustered key. It's worth noting that the server is a raid 10 of SSDs with a 9GB/s FIO random read. So I don't suspect the indices not being clustered will affect much.

Output of (select count(distinct category) from ratings) is 50

In the interest that this could be how the data is or an oversight on me, I am included the export of the entire table. It is only 200KB zipped: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0

The first query takes 27 seconds to run

Paul Spiegel

You can use temporary tables with an AUTO_INCREMENT column to generate ranks (row number).

For example - to generate ranks for the '*' category:

drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
    rank int unsigned auto_increment primary key,
    id int NOT NULL
) engine=memory
    select null as rank, id
    from ratings r
    where r.category = '*'
    order by r.category, r.rating desc, r.id desc;

This runs in something like 30 msec. While your approach with the selfjoin takes 45 seconds on my machine. Even with a new index on (category, rating, id) it still takes 14 seconds to run.

To generate ranks per group (per category) is a bit more complicated. We can still use an AUTO_INCREMENT column, but will need to calculate and substract an offset per category:

drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
    pos int unsigned auto_increment primary key,
    category varchar(50) not null,
    id int NOT NULL
) engine=memory
    select null as pos, category, id
    from ratings r
    where r.category <> '*'
    order by r.category, r.rating desc, r.id desc;

drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
    select category, min(pos) - 1 as `offset`
    from tmp_pos
    group by category;

select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id

This runs in about 220 msec. The selfjoin solution takes 42 sec or 13 sec with the new index.

Now you just need to combine the last query with the first temp table, to get your final result:

select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
    select t.id, min(t.pos - o.offset) as min_rank
    from tmp_pos t
    join tmp_cat_offset o using(category)
    group by t.id
) t1
join tmp_main_cat_rank t2 using(id);

Overall runtime is ~280 msec without an additional index and ~240 msec with an index on (category, rating, id).

A note to the selfjoin approach: It's an elegant solution and performs fine with a small group size. It's fast with an average group size <= 2. It can be acceptable for a group size of 10. But you have an average group size 447 (count(*) / count(distinct category)). That means every row is joined with 447 other rows (on average). You can see the impact by removing the group by clause:

SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'

The result is more than 10M rows.

However - with an index on (category, rating, id) your query runs in 33 seconds on my machine.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-05-30

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

Fast group rank() function

Fast group rank() function

Pandas -- rank in different orders with a group

group by rank continuous date by pandas

MySQL 8 : RANK function in UPDATE query impossible?

group_by operation in dplyr vs data.table for fast implementation

NLS Function By Group

MYSQL query with group function

Using ifelse function by group

SQL-複数のWHEREおよびGROUP BYを指定したRANK（）

What's a fast ideal hash function for int32?

Array intersection as aggregate function for group by

Having trouble with Group By function in SQL

R dplyr chaining group by into function

Use RankX function to calculate date rank per Fiscal Year per Entity ID

'not a single-group group function' in oracle when using LISTAGG

Why I got Not a single-group group function in sql

Excel Rank vs SQL Rank

groupByKey in Haskell - How to group items in a list by a function?

Keep Column which is not in aggregate function in group by statement

Conditional group by with window function in Snowflake query

MongoDB Aggregation - Possible to have an OR operator in a group function?

Access Group By Month Names Using Format Function

in python why if rank: is faster than if rank != 0:

Rank based on several variables

OracleSQL-DENSE_RANK

Rank with Ties in R

Numpy rank 1 arrays

Rank a particlar column

How to rank a column with a condition

Setting rank based on query