There are various ways people try to emulate MSSQL RANK() or ROW_NUMBER() functions in MySQL, but all of them I've tried so far are slow.
I have a table that looks like this:
CREATE TABLE ratings
(`id` int, `category` varchar(1), `rating` int)
;
INSERT INTO ratings
(`id`, `category`, `rating`)
VALUES
(3, '*', 54),
(4, '*', 45),
(1, '*', 43),
(2, '*', 24),
(2, 'A', 68),
(3, 'A', 43),
(1, 'A', 12),
(3, 'B', 22),
(4, 'B', 22),
(4, 'C', 44)
;
Except it has 220,000 records. There are about 90,000 unique id's.
I wanted to rank the id's first by looking at the categories which were not *
where a higher rating is a lower rank.
SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank
Output:
id category rating rank
2 A 68 1
3 A 43 2
1 A 12 3
4 B 22 1
3 B 22 2
4 C 44 1
Then I wanted to take the smallest rank an id had, and average that with the rank they have within the * category. Giving a total query of:
SELECT X1.id,
(X1.rank + X2.minrank) / 2 AS OverallRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category = '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X1
JOIN
(SELECT id,
Min(rank) AS MinRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X
GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank
Giving me
id OverallRank
3 1.5000
4 1.5000
2 2.5000
1 3.0000
This query is correct and the output I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating
and category
and id,category
Edit:
Result of SHOW CREATE TABLE ratings
:
CREATE TABLE `rating` (
`id` int(11) NOT NULL,
`category` varchar(255) NOT NULL,
`rating` int(11) NOT NULL DEFAULT '1500',
`rd` int(11) NOT NULL DEFAULT '350',
`vol` float NOT NULL DEFAULT '0.06',
`wins` int(11) NOT NULL,
`losses` int(11) NOT NULL,
`streak` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`streak`,`rd`,`id`,`category`),
UNIQUE KEY `id_category` (`id`,`category`),
KEY `rating` (`rating`,`rd`),
KEY `streak_idx` (`streak`),
KEY `category_idx` (`category`),
KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The PRIMARY KEY
is the most common use case of queries to this table, that is why it's the clustered key. It's worth noting that the server is a raid 10 of SSDs with a 9GB/s FIO random read. So I don't suspect the indices not being clustered will affect much.
Output of (select count(distinct category) from ratings)
is 50
In the interest that this could be how the data is or an oversight on me, I am included the export of the entire table. It is only 200KB zipped: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0
The first query takes 27 seconds to run
You can use temporary tables with an AUTO_INCREMENT column to generate ranks (row number).
For example - to generate ranks for the '*' category:
drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
rank int unsigned auto_increment primary key,
id int NOT NULL
) engine=memory
select null as rank, id
from ratings r
where r.category = '*'
order by r.category, r.rating desc, r.id desc;
This runs in something like 30 msec. While your approach with the selfjoin takes 45 seconds on my machine. Even with a new index on (category, rating, id)
it still takes 14 seconds to run.
To generate ranks per group (per category) is a bit more complicated. We can still use an AUTO_INCREMENT column, but will need to calculate and substract an offset per category:
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
pos int unsigned auto_increment primary key,
category varchar(50) not null,
id int NOT NULL
) engine=memory
select null as pos, category, id
from ratings r
where r.category <> '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
select category, min(pos) - 1 as `offset`
from tmp_pos
group by category;
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
This runs in about 220 msec. The selfjoin solution takes 42 sec or 13 sec with the new index.
Now you just need to combine the last query with the first temp table, to get your final result:
select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
) t1
join tmp_main_cat_rank t2 using(id);
Overall runtime is ~280 msec without an additional index and ~240 msec with an index on (category, rating, id)
.
A note to the selfjoin approach: It's an elegant solution and performs fine with a small group size. It's fast with an average group size <= 2. It can be acceptable for a group size of 10. But you have an average group size 447 (count(*) / count(distinct category)
). That means every row is joined with 447 other rows (on average). You can see the impact by removing the group by clause:
SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
The result is more than 10M rows.
However - with an index on (category, rating, id)
your query runs in 33 seconds on my machine.
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加