mysql not in、left join、IS NULL、NOT EXISTS 效率問題記錄

2020-01-19 00:02:54

字體：大中小

來源：轉載

供稿：網友

NOT IN、JOIN、IS NULL、NOT EXISTS效率對比

語句一：select count(*) from A where A.a not in (select a from B)

語句二：select count(*) from A left join B on A.a = B.a where B.a is null

語句三：select count(*) from A where not exists (select a from B where A.a = B.a)

知道以上三條語句的實際效果是相同的已經很久了，但是一直沒有深究其間的效率對比。一直感覺上語句二是最快的。
今天工作上因為要對一個數千萬行數據的庫進行數據清除，需要刪掉兩千多萬行數據。大量的用到了以上三條語句所要實現的功能。本來用的是語句一，但是結果是執行速度1個小時32分，日志文件占用21GB。時間上雖然可以接受，但是對硬盤空間的占用確是個問題。因此將所有的語句一都換成語句二。本以為會更快。沒想到執行40多分鐘后，第一批50000行都沒有刪掉，反而讓SQL SERVER崩潰掉了，結果令人詫異。試了試單獨執行這條語句，查詢近一千萬行的表，語句一用了4秒，語句二卻用了18秒，差距很大。語句三的效率與語句一接近。

第二種寫法是大忌，應該盡量避免。第一種和第三種寫法本質上幾乎一樣。

假設buffer pool足夠大，寫法二相對于寫法一來說存在以下幾點不足：
（1）left join本身更耗資源（需要更多資源來處理產生的中間結果集）
（2）left join的中間結果集的規模不會比表A小
（3）寫法二還需要對left join產生的中間結果做is null的條件篩選，而寫法一則在兩個集合join的同時完成了篩選，這部分開銷是額外的

這三點綜合起來，在處理海量數據時就會產生比較明顯的區別（主要是內存和CPU上的開銷）。我懷疑樓主在測試時buffer pool可能已經處于飽和狀態，這樣的話，寫法二的那些額外開銷不得不借助磁盤上的虛擬內存，在SQL Server做換頁時，由于涉及到較慢的I/O操作因此這種差距會更加明顯。

關于日志文件過大，這也是正常的，因為刪除的記錄多嘛。可以根據數據庫的用途考慮將恢復模型設為simple，或者在刪除結束后將日志truncate掉并把文件shrink下來。

因為以前曾經作過一個對這個庫進行無條件刪除的腳本，就是要刪除數據量較大的表中的所有數據，但是因為客戶要求，不能使用truncate table，怕破壞已有的庫結構。所以只能用delete刪，當時也遇到了日志文件過大的問題，當時采用的方法是分批刪除，在SQL2K中用set rowcount @chunk，在SQL2K5中用delete top @chunk。這樣的操作不僅使刪除時間大大減少，而且讓日志量大大減少，只增長了1G左右。
但是這次清除數據的工作需要加上條件，就是delete A from A where ....后面有條件的。再次使用分批刪除的方法，卻已經沒效果了。
不知您知不知道這是為什么。

mysql not in 和 left join 效率問題記錄

首先說明該條sql的功能是查詢集合a不在集合b的數據。
not in的寫法

復制代碼代碼如下:

 
select add_tb.RUID 
from (select distinct RUID 
from UserMsg 
where SubjectID =12 
and CreateTime>'2009-8-14 15:30:00' 
and CreateTime<='2009-8-17 16:00:00' 
) add_tb 
where add_tb.RUID 
not in (select distinct RUID 
from UserMsg 
where SubjectID =12 
and CreateTime<'2009-8-14 15:30:00' 
) 

復制代碼代碼如下:

 
select a.ruid,b.ruid 
from(select distinct RUID 
from UserMsg 
where SubjectID =12 
and CreateTime >= '2009-8-14 15:30:00' 
and CreateTime<='2009-8-17 16:00:00' 
) a left join ( 
select distinct RUID 
from UserMsg 
where SubjectID =12 and CreateTime< '2009-8-14 15:30:00' 
) b on a.ruid = b.ruid 
where b.ruid is null 

復制代碼代碼如下:

 
select distinct a.RUID 
from UserMsg a 
left join UserMsg b 
on a.ruid = b.ruid 
and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00' 
where a.subjectID =12 
and a.createTime >= '2009-8-14 15:30:00' 
and a.createtime <='2009-8-17 16:00:00' 
and b.ruid is null; 

復制代碼代碼如下:

 
select distinct a.ruid 
from UserMsg a 
where a.subjectID =12 
and a.createTime >= '2009-8-14 15:30:00' 
and a.createTime <='2009-8-17 16:00:00' 
and not exists ( 
select distinct RUID 
from UserMsg 
where subjectID =12 and createTime < '2009-8-14 15:30:00' 
and ruid=a.ruid 
) 

復制代碼代碼如下:

 
select a.ruid,b.ruid 
from( select distinct RUID 
from UserMsg 
where CreateTime >= '2009-8-14 15:30:00' 
and CreateTime<='2009-8-17 16:00:00' 
) a left join UserMsg b 
on a.ruid = b.ruid 
and b.createTime < '2009-8-14 15:30:00' 
where b.ruid is null;