Fast group rank() function

ParoX

There are various ways people try to emulate MSSQL RANK() or ROW_NUMBER() functions in MySQL, but all of them I've tried so far are slow.

I have a table that looks like this:

CREATE TABLE ratings
    (`id` int, `category` varchar(1), `rating` int)
;

INSERT INTO ratings
    (`id`, `category`, `rating`)
VALUES
    (3, '*', 54),
    (4, '*', 45),
    (1, '*', 43),
    (2, '*', 24),
    (2, 'A', 68),
    (3, 'A', 43),
    (1, 'A', 12),
    (3, 'B', 22),
    (4, 'B', 22),
    (4, 'C', 44)
;

Except it has 220,000 records. There are about 90,000 unique id's.

I wanted to rank the id's first by looking at the categories which were not * where a higher rating is a lower rank.

SELECT g1.id,
       g1.category,
       g1.rating,
       Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
         g1.category,
         g1.rating
ORDER BY g1.category,
         rank

Output:

id  category    rating  rank
2   A   68  1
3   A   43  2
1   A   12  3
4   B   22  1
3   B   22  2
4   C   44  1

Then I wanted to take the smallest rank an id had, and average that with the rank they have within the * category. Giving a total query of:

SELECT X1.id,
       (X1.rank + X2.minrank) / 2 AS OverallRank
FROM
  (SELECT g1.id,
          g1.category,
          g1.rating,
          Count(*) AS rank
   FROM ratings AS g1
   JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
   AND g1.category = g2.category
   WHERE g1.category = '*'
   GROUP BY g1.id,
            g1.category,
            g1.rating
   ORDER BY g1.category,
            rank) X1
JOIN
  (SELECT id,
          Min(rank) AS MinRank
   FROM
     (SELECT g1.id,
             g1.category,
             g1.rating,
             Count(*) AS rank
      FROM ratings AS g1
      JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
      AND g1.category = g2.category
      WHERE g1.category != '*'
      GROUP BY g1.id,
               g1.category,
               g1.rating
      ORDER BY g1.category,
               rank) X
   GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank

Giving me

id  OverallRank
3   1.5000
4   1.5000
2   2.5000
1   3.0000

This query is correct and the output I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating and category and id,category

Edit:

Result of SHOW CREATE TABLE ratings:

CREATE TABLE `rating` (
     `id` int(11) NOT NULL,
     `category` varchar(255) NOT NULL,
     `rating` int(11) NOT NULL DEFAULT '1500',
     `rd` int(11) NOT NULL DEFAULT '350',
     `vol` float NOT NULL DEFAULT '0.06',
     `wins` int(11) NOT NULL,
     `losses` int(11) NOT NULL,
     `streak` int(11) NOT NULL DEFAULT '0',
     PRIMARY KEY (`streak`,`rd`,`id`,`category`),
     UNIQUE KEY `id_category` (`id`,`category`),
     KEY `rating` (`rating`,`rd`),
     KEY `streak_idx` (`streak`),
     KEY `category_idx` (`category`),
     KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The PRIMARY KEY is the most common use case of queries to this table, that is why it's the clustered key. It's worth noting that the server is a raid 10 of SSDs with a 9GB/s FIO random read. So I don't suspect the indices not being clustered will affect much.

Output of (select count(distinct category) from ratings) is 50

In the interest that this could be how the data is or an oversight on me, I am included the export of the entire table. It is only 200KB zipped: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0

The first query takes 27 seconds to run

Paul Spiegel

You can use temporary tables with an AUTO_INCREMENT column to generate ranks (row number).

For example - to generate ranks for the '*' category:

drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
    rank int unsigned auto_increment primary key,
    id int NOT NULL
) engine=memory
    select null as rank, id
    from ratings r
    where r.category = '*'
    order by r.category, r.rating desc, r.id desc;

This runs in something like 30 msec. While your approach with the selfjoin takes 45 seconds on my machine. Even with a new index on (category, rating, id) it still takes 14 seconds to run.

To generate ranks per group (per category) is a bit more complicated. We can still use an AUTO_INCREMENT column, but will need to calculate and substract an offset per category:

drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
    pos int unsigned auto_increment primary key,
    category varchar(50) not null,
    id int NOT NULL
) engine=memory
    select null as pos, category, id
    from ratings r
    where r.category <> '*'
    order by r.category, r.rating desc, r.id desc;

drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
    select category, min(pos) - 1 as `offset`
    from tmp_pos
    group by category;

select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id

This runs in about 220 msec. The selfjoin solution takes 42 sec or 13 sec with the new index.

Now you just need to combine the last query with the first temp table, to get your final result:

select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
    select t.id, min(t.pos - o.offset) as min_rank
    from tmp_pos t
    join tmp_cat_offset o using(category)
    group by t.id
) t1
join tmp_main_cat_rank t2 using(id);

Overall runtime is ~280 msec without an additional index and ~240 msec with an index on (category, rating, id).

A note to the selfjoin approach: It's an elegant solution and performs fine with a small group size. It's fast with an average group size <= 2. It can be acceptable for a group size of 10. But you have an average group size 447 (count(*) / count(distinct category)). That means every row is joined with 447 other rows (on average). You can see the impact by removing the group by clause:

SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'

The result is more than 10M rows.

However - with an index on (category, rating, id) your query runs in 33 seconds on my machine.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Pandas -- rank in different orders with a group

分類Dev

group by rank continuous date by pandas

分類Dev

MySQL 8 : RANK function in UPDATE query impossible?

分類Dev

group_by operation in dplyr vs data.table for fast implementation

分類Dev

NLS Function By Group

分類Dev

MYSQL query with group function

分類Dev

Using ifelse function by group

分類Dev

SQL-複数のWHEREおよびGROUP BYを指定したRANK()

分類Dev

What's a fast ideal hash function for int32?

分類Dev

Array intersection as aggregate function for group by

分類Dev

Having trouble with Group By function in SQL

分類Dev

R dplyr chaining group by into function

分類Dev

Use RankX function to calculate date rank per Fiscal Year per Entity ID

分類Dev

'not a single-group group function' in oracle when using LISTAGG

分類Dev

Why I got Not a single-group group function in sql

分類Dev

Excel Rank vs SQL Rank

分類Dev

groupByKey in Haskell - How to group items in a list by a function?

分類Dev

Keep Column which is not in aggregate function in group by statement

分類Dev

Conditional group by with window function in Snowflake query

分類Dev

MongoDB Aggregation - Possible to have an OR operator in a group function?

分類Dev

Access Group By Month Names Using Format Function

分類Dev

in python why if rank: is faster than if rank != 0:

分類Dev

Rank based on several variables

分類Dev

OracleSQL-DENSE_RANK

分類Dev

Rank with Ties in R

分類Dev

Numpy rank 1 arrays

分類Dev

Rank a particlar column

分類Dev

How to rank a column with a condition

分類Dev

Setting rank based on query