首页 > Java资讯 > 正文

MySQL中distinct和group by性能比较

2021-02-12 14:00:00
5950次动力节点

MySQL是目前最流行的关系型数据库之一，而关系数据库将数据保存在不同的表中，而不是将所有数据放在一个大仓库内，这样就增加了速度并提高了灵活性。我们知道在MySQL数据库中DISTINCT可以去掉重复数据，而GROUP BY在分组后也会去掉重复数据，那这两个关键字在去掉重复数据时的效率，究竟谁会更高一点？本文我们就来比较一些distinct和group by的性能。

一、测试过程：

准备一张测试表

  CREATE TABLE `test_test` (
     `id` int(11) NOT NULL auto_increment,
      `num` int(11) NOT NULL default '0',
      PRIMARY KEY  (`id`)
     ) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

建个储存过程向表中插入10W条数据

   create procedure p_test(pa int(11))
     begin
      declare max_num int(11) default 100000;
      declare i int default 0;
     declare rand_num int;
      select count(id) into max_num from test_test;
     while i < pa do
              if max_num < 100000 then
                      select cast(rand()*100 as unsigned) into rand_num;
                      insert into test_test(num)values(rand_num);
              end if;
              set i = i +1;
      end while;
     end

调用存储过程插入数据

call p_test(100000);

开始测试：（不加索引）

 select distinct num from test_test;
    select num from test_test group by num;
    [SQL] select distinct num from test_test;
    受影响的行: 0
    时间: 0.078ms
    [SQL]  
   select num from test_test group by num;
   受影响的行: 0
    时间: 0.031ms

二、num字段上创建索引

ALTER TABLE `test_test` ADD INDEX `num_index` (`num`) ;

再次查询

select distinct num from test_test;
    select num from test_test group by num;
    [SQL] select distinct num from test_test;
   受影响的行: 0
    时间: 0.000ms
    [SQL]  
    select num from test_test group by num;
    受影响的行: 0
    时间: 0.000ms

这时候我们发现时间太小了 0.000秒都无法精确了。

我们转到命令行下测试

 mysql> set profiling=1;
    mysql> select distinct(num) from test_test;
    mysql> select num from test_test group by num;
    mysql> show profiles;
    +----------+------------+----------------------------------------+
    | Query_ID | Duration   | Query                                  |
    +----------+------------+----------------------------------------+
    |        1 | 0.00072550 | select distinct(num) from test_test    |
    |        2 | 0.00071650 | select num from test_test group by num |
   +----------+------------+----------------------------------------+

加了索引之后 distinct 比没加索引的 distinct 快了 107倍。

加了索引之后 group by 比没加索引的 group by 快了 43倍。

再来对比：distinct 和 group by

不管是加不加索引 group by 都比 distinct 快。因此使用的时候建议选 group by。

默认情况下，distinct会被hive翻译成一个全局唯一reduce任务来做去重操作，因而并行度为1。而group by则会被hive翻译成分组聚合运算，会有多个reduce任务并行处理，每个reduce对收到的一部分数据组，进行每组聚合（去重）

通过上述两个实验，我们可以得出这样一条结论：在重复量比较高的表中，使用DISTINCT可以有效提高查询效率，而在重复量比较低的表中，使用DISTINCT会严重降低查询效率。所以并不是所有的DISTINCT都是降低效率的，当然你得提前判断数据的重复量。想要获取更多的MySQL知识，请到本站的MySQL教程观看最新的MySQL学习资料，开启全新的MySQL学习之旅。

标签

MySQL教程

上一篇：实例解析MySQL多表联查下一篇：MySQL外键使用详解