原文章地址:http://www.cnblogs.com/5211314jackrose/p/5818124.html
事務(wù)處理是DBMS中最關(guān)鍵的技術(shù),對SQLite也一樣,它涉及到并發(fā)控制,以及故障恢復(fù)等等。在數(shù)據(jù)庫中使用事務(wù)可以保證數(shù)據(jù)的統(tǒng)一和完整性,同時(shí)也可以提高效率。假設(shè)需要在一張表內(nèi)一次插入20個(gè)人的名字才算是操作成功,那么在不使用事務(wù)的情況下,如果插入過程中出現(xiàn)異常或者在插入過程中出現(xiàn)一些其他數(shù)據(jù)庫操作的話,就很有可能影響了操作的完整性。所以事務(wù)可以很好地解決這樣的情況,首先事務(wù)是可以把啟動(dòng)事務(wù)過程中的所有操作視為事務(wù)的過程。等到所有過程執(zhí)行完畢后,我們可以根據(jù)操作是否成功來決定事務(wù)是否進(jìn)行提交或者回滾。提交事務(wù)后會(huì)一次性把所有數(shù)據(jù)提交到數(shù)據(jù)庫,如果回滾了事務(wù)就會(huì)放棄這次的操作,而對原來表的數(shù)據(jù)不進(jìn)行更改。
SQLite中分別以BEGIN、COMMIT和ROLLBACK啟動(dòng)、提交和回滾事務(wù)。見如下示例:

@try{ char *errorMsg; if (sqlite3_exec(_database, "BEGIN", NULL, NULL, &errorMsg)==SQLITE_OK) { NSLog(@”啟動(dòng)事務(wù)成功”); sqlite3_free(errorMsg); sqlite3_stmt *statement; if (sqlite3_PRepare_v2(_database, [@"insert into persons(name) values(?);" UTF8String], -1, &statement, NULL)==SQLITE_OK) { //綁定參數(shù) const char *text=[@”張三” cStringUsingEncoding:NSUTF8StringEncoding]; sqlite3_bind_text(statement, index, text, strlen(text), SQLITE_STATIC); if (sqlite3_step(statement)!=SQLITE_DONE) { sqlite3_finalize(statement); } } if (sqlite3_exec(_database, "COMMIT", NULL, NULL, &errorMsg)==SQLITE_OK) { NSLog(@”提交事務(wù)成功”); } sqlite3_free(errorMsg); } else{ sqlite3_free(errorMsg); } } @catch(NSException *e){ char *errorMsg; if (sqlite3_exec(_database, "ROLLBACK", NULL, NULL, &errorMsg)==SQLITE_OK) { NSLog(@”回滾事務(wù)成功”); } sqlite3_free(errorMsg); } @finally{ } 
在SQLite中,如果沒有為當(dāng)前的SQL命令(SELECT除外)顯示的指定事務(wù),那么SQLite會(huì)自動(dòng)為該操作添加一個(gè)隱式的事務(wù),以保證該操作的原子性和一致性。當(dāng)然,SQLite也支持顯示的事務(wù),其語法與大多數(shù)關(guān)系型數(shù)據(jù)庫相比基本相同。見如下示例:

sqlite> BEGIN TRANSACTION;sqlite> INSERT INTO testtable VALUES(1);sqlite> INSERT INTO testtable VALUES(2); sqlite> COMMIT TRANSACTION; --顯示事務(wù)被提交,數(shù)據(jù)表中的數(shù)據(jù)也發(fā)生了變化。sqlite> SELECT COUNT(*) FROM testtable;COUNT(*)----------2sqlite> BEGIN TRANSACTION;sqlite> INSERT INTO testtable VALUES(1);sqlite> ROLLBACK TRANSACTION; --顯示事務(wù)被回滾,數(shù)據(jù)表中的數(shù)據(jù)沒有發(fā)生變化。sqlite> SELECT COUNT(*) FROM testtable;COUNT(*)----------2
Page Cache之事務(wù)處理——SQLite原子提交的實(shí)現(xiàn)
下面通過具體示例來分析SQLite原子提交的實(shí)現(xiàn)(基于Version 3.3.6的代碼):
CREATE TABLE episodes( id integer primary key,name text, cid int);insert into episodes(name,cid) values("cat",1); --插入一條記錄 它經(jīng)過編譯器處理后生成的虛擬機(jī)代碼如下:

sqlite> explain insert into episodes(name,cid) values("cat",1);0|Trace|0|0|0|explain insert into episodes(name,cid) values("cat",1);|00|1|Goto|0|12|0||00|2|SetNumColumns|0|3|0||00|3|OpenWrite|0|2|0||00|4|NewRowid|0|2|0||00|5|Null|0|3|0||00|6|String8|0|4|0|cat|00|7|Integer|1|5|0||00|8|MakeRecord|3|3|6|dad|00|9|Insert|0|6|2|episodes|0b|10|Close|0|0|0||00|11|Halt|0|0|0||00|12|Transaction|0|1|0||00|13|VerifyCookie|0|1|0||00|14|Transaction|1|1|0||00|15|VerifyCookie|1|0|0||00|16|TableLock|0|2|1|episodes|00|17|Goto|0|2|0||00|
1、初始狀態(tài)(Initial State) 當(dāng)一個(gè)數(shù)據(jù)庫連接第一次打開時(shí),狀態(tài)如圖所示。圖中最右邊(“Disk”標(biāo)注)表示保存在存儲(chǔ)設(shè)備中的內(nèi)容。每個(gè)方框代表一個(gè)扇區(qū)。藍(lán)色的塊表示這個(gè)扇區(qū)保存了原始數(shù)據(jù)。圖中中間區(qū)域是操作系統(tǒng)的磁盤緩沖區(qū)。開始的時(shí)候,這些緩存是還沒有被使用,因此這些方框是空白的。圖中左邊區(qū)域顯示SQLite用戶進(jìn)程的內(nèi)存。因?yàn)檫@個(gè)數(shù)據(jù)庫連接剛剛打開,所以還沒有任何數(shù)據(jù)記錄被讀入,所以這些內(nèi)存也是空的。

2、獲取讀鎖(Acquiring A Read Lock) 在SQLite寫數(shù)據(jù)庫之前,它必須先從數(shù)據(jù)庫中讀取相關(guān)信息。比如,在插入新的數(shù)據(jù)時(shí),SQLite會(huì)先從sqlite_master表中讀取數(shù)據(jù)庫模式(相當(dāng)于數(shù)據(jù)字典),以便編譯器對INSERT語句進(jìn)行分析,確定數(shù)據(jù)插入的位置。在進(jìn)行讀操作之前,必須先獲取數(shù)據(jù)庫的共享鎖(shared lock),共享鎖允許兩個(gè)或更多的連接在同一時(shí)刻讀取數(shù)據(jù)庫。但是共享鎖不允許其它連接對數(shù)據(jù)庫進(jìn)行寫操作。 shared lock存在于操作系統(tǒng)磁盤緩存,而不是磁盤本身。文件鎖的本質(zhì)只是操作系統(tǒng)的內(nèi)核數(shù)據(jù)結(jié)構(gòu),當(dāng)操作系統(tǒng)崩潰或掉電時(shí),這些內(nèi)核數(shù)據(jù)也會(huì)隨之消失。

3、讀取數(shù)據(jù) 一旦得到shared lock,就可以進(jìn)行讀操作。如圖所示,數(shù)據(jù)先由OS從磁盤讀取到OS緩存,然后再由OS移到用戶進(jìn)程空間。一般來說,數(shù)據(jù)庫文件分為很多頁,而一次讀操作只讀取一小部分頁面。如圖,從8個(gè)頁面讀取3個(gè)頁面。
4、獲取Reserved Lock 在對數(shù)據(jù)進(jìn)行修改操作之前,先要獲取數(shù)據(jù)庫文件的Reserved Lock,Reserved Lock和shared lock的相似之處在于,它們都允許其它進(jìn)程對數(shù)據(jù)庫文件進(jìn)行讀操作。Reserved Lock和Shared Lock可以共存,但是只能是一個(gè)Reserved Lock和多個(gè)Shared Lock——多個(gè)Reserved Lock不能共存。所以,在同一時(shí)刻,只能進(jìn)行一個(gè)寫操作。 Reserved Lock意味著當(dāng)前進(jìn)程(連接)想修改數(shù)據(jù)庫文件,但是還沒開始修改操作,所以其它的進(jìn)程可以讀數(shù)據(jù)庫,但不能寫數(shù)據(jù)庫。
5、創(chuàng)建恢復(fù)日志(Creating A Rollback Journal File) 在對數(shù)據(jù)庫進(jìn)行寫操作之前,SQLite先要?jiǎng)?chuàng)建一個(gè)單獨(dú)的日志文件,然后把要修改的頁面的原始數(shù)據(jù)寫入日志。回滾日志包含一個(gè)日志頭(圖中的綠色)——記錄數(shù)據(jù)庫文件的原始大小。所以即使數(shù)據(jù)庫文件大小改變了,我們?nèi)灾罃?shù)據(jù)庫的原始大小。 從OS的角度來看,當(dāng)一個(gè)文件創(chuàng)建時(shí),大多數(shù)OS(Windows、linux、Mac OS X)不會(huì)向磁盤寫入數(shù)據(jù),新創(chuàng)建的文件此時(shí)位于磁盤緩存中,之后才會(huì)真正寫入磁盤。如圖,日志文件位于OS磁盤緩存中,而不是位于磁盤。

以上5步的實(shí)現(xiàn)代碼:

//事務(wù)指令的實(shí)現(xiàn)//p1為數(shù)據(jù)庫文件的索引號(hào)--0為main database;1為temporary tables使用的文件//p2不為0,一個(gè)寫事務(wù)開始case OP_Transaction: { //數(shù)據(jù)庫的索引號(hào) int i = pOp->p1; //指向數(shù)據(jù)庫對應(yīng)的btree Btree *pBt; assert( i>=0 && i<db->nDb ); assert( (p->btreeMask & (1<<i))!=0 ); //設(shè)置btree指針 pBt = db->aDb[i].pBt; if( pBt ){ //從這里btree開始事務(wù),主要給文件加鎖,并設(shè)置btree事務(wù)狀態(tài) rc = sqlite3BtreeBeginTrans(pBt, pOp->p2); if( rc==SQLITE_BUSY ){ p->pc = pc; p->rc = rc = SQLITE_BUSY; goto vdbe_return; } if( rc!=SQLITE_OK && rc!=SQLITE_READONLY /* && rc!=SQLITE_BUSY */ ){ goto abort_due_to_error; } } break;}//開始一個(gè)事務(wù),如果第二個(gè)參數(shù)不為0,則一個(gè)寫事務(wù)開始,否則是一個(gè)讀事務(wù)//如果wrflag>=2,一個(gè)exclusive事務(wù)開始,此時(shí)別的連接不能訪問數(shù)據(jù)庫int sqlite3BtreeBeginTrans(Btree *p, int wrflag){ BtShared *pBt = p->pBt; int rc = SQLITE_OK; btreeIntegrity(p); /* If the btree is already in a write-transaction, or it ** is already in a read-transaction and a read-transaction ** is requested, this is a no-op. */ //如果b-tree處于一個(gè)寫事務(wù);或者處于一個(gè)讀事務(wù),一個(gè)讀事務(wù)又請求,則返回SQLITE_OK if( p->inTrans==TRANS_WRITE || (p->inTrans==TRANS_READ && !wrflag) ){ return SQLITE_OK; } /* Write transactions are not possible on a read-only database */ //寫事務(wù)不能訪問只讀數(shù)據(jù)庫 if( pBt->readOnly && wrflag ){ return SQLITE_READONLY; } /* If another database handle has already opened a write transaction ** on this shared-btree structure and a second write transaction is ** requested, return SQLITE_BUSY. */ //如果數(shù)據(jù)庫已存在一個(gè)寫事務(wù),則該寫事務(wù)請求時(shí)返回SQLITE_BUSY if( pBt->inTransaction==TRANS_WRITE && wrflag ){ return SQLITE_BUSY; } do { //如果數(shù)據(jù)庫對應(yīng)btree的第一個(gè)頁面還沒讀進(jìn)內(nèi)存 //則把該頁面讀進(jìn)內(nèi)存,數(shù)據(jù)庫也相應(yīng)的加read lock if( pBt->pPage1==0 ){ //加read lock,并讀頁面到內(nèi)存 rc = lockBtree(pBt); } if( rc==SQLITE_OK && wrflag ){ //對數(shù)據(jù)庫文件加RESERVED_LOCK鎖 rc = sqlite3pager_begin(pBt->pPage1->aData, wrflag>1); if( rc==SQLITE_OK ){ rc = newDatabase(pBt); } } if( rc==SQLITE_OK ){ if( wrflag ) pBt->inStmt = 0; }else{ unlockBtreeIfUnused(pBt); } }while( rc==SQLITE_BUSY && pBt->inTransaction==TRANS_NONE && sqlite3InvokeBusyHandler(pBt->pBusyHandler) ); if( rc==SQLITE_OK ){ if( p->inTrans==TRANS_NONE ){ //btree的事務(wù)數(shù)加1 pBt->nTransaction++; } //設(shè)置btree事務(wù)狀態(tài) p->inTrans = (wrflag?TRANS_WRITE:TRANS_READ); if( p->inTrans>pBt->inTransaction ){ pBt->inTransaction = p->inTrans; } } btreeIntegrity(p); return rc;}/***獲取數(shù)據(jù)庫的寫鎖,發(fā)生以下情況時(shí)去除寫鎖:** * sqlite3pager_commit() is called.** * sqlite3pager_rollback() is called.** * sqlite3pager_close() is called.** * sqlite3pager_unref() is called to on every outstanding page.**pData指向數(shù)據(jù)庫的打開的頁面,此時(shí)并不修改,僅僅只是獲取**相應(yīng)的pager,檢查它是否處于read-lock狀態(tài)**如果打開的不是臨時(shí)文件,則打開日志文件.**如果數(shù)據(jù)庫已經(jīng)處于寫狀態(tài),則do nothing*/int sqlite3pager_begin(void *pData, int exFlag){ PgHdr *pPg = DATA_TO_PGHDR(pData); Pager *pPager = pPg->pPager; int rc = SQLITE_OK; assert( pPg->nRef>0 ); assert( pPager->state!=PAGER_UNLOCK ); //pager已經(jīng)處于share狀態(tài) if( pPager->state==PAGER_SHARED ){ assert( pPager->aInJournal==0 ); if( MEMDB ){ pPager->state = PAGER_EXCLUSIVE; pPager->origDbSize = pPager->dbSize; }else{ //對文件加 RESERVED_LOCK rc = sqlite3OsLock(pPager->fd, RESERVED_LOCK); if( rc==SQLITE_OK ){ //設(shè)置pager的狀態(tài) pPager->state = PAGER_RESERVED; if( exFlag ){ rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK); } } if( rc!=SQLITE_OK ){ return rc; } pPager->dirtyCache = 0; TRACE2("TRANSACTION %d/n", PAGERID(pPager)); //使用日志,不是臨時(shí)文件,則打開日志文件 if( pPager->useJournal && !pPager->tempFile ){ //為pager打開日志文件,pager應(yīng)該處于RESERVED或EXCLUSIVE狀態(tài) //會(huì)向日志文件寫入header rc = pager_open_journal(pPager); } } } return rc;}//創(chuàng)建日志文件,pager應(yīng)該處于RESERVED或EXCLUSIVE狀態(tài)static int pager_open_journal(Pager *pPager){ int rc; assert( !MEMDB ); assert( pPager->state>=PAGER_RESERVED ); assert( pPager->journalOpen==0 ); assert( pPager->useJournal ); assert( pPager->aInJournal==0 ); sqlite3pager_pagecount(pPager); //日志文件頁面位圖 pPager->aInJournal = sqliteMalloc( pPager->dbSize/8 + 1 ); if( pPager->aInJournal==0 ){ rc = SQLITE_NOMEM; goto failed_to_open_journal; } //打開日志文件 rc = sqlite3OsOpenExclusive(pPager->zJournal, &pPager->jfd, pPager->tempFile); //日志文件的位置指針 pPager->journalOff = 0; pPager->setMaster = 0; pPager->journalHdr = 0; if( rc!=SQLITE_OK ){ goto failed_to_open_journal; } /*一般來說,OS此時(shí)創(chuàng)建的文件位于磁盤緩存,并沒有實(shí)際 **存在于磁盤,下面三個(gè)操作就是為了把結(jié)果寫入磁盤,而對于 **windows系統(tǒng)來說,并沒有提供相應(yīng)API,所以實(shí)際上沒有意義. */ //fullSync操作對windows沒有意義 sqlite3OsSetFullSync(pPager->jfd, pPager->full_fsync); sqlite3OsSetFullSync(pPager->fd, pPager->full_fsync); /* Attempt to open a file descriptor for the directory that contains a file. **This file descriptor can be used to fsync() the directory **in order to make sure the creation of a new file is actually written to disk. */ sqlite3OsOpenDirectory(pPager->jfd, pPager->zDirectory); pPager->journalOpen = 1; pPager->journalStarted = 0; pPager->needSync = 0; pPager->alwaysRollback = 0; pPager->nRec = 0; if( pPager->errCode ){ rc = pPager->errCode; goto failed_to_open_journal; } pPager->origDbSize = pPager->dbSize; //寫入日志文件的header--24個(gè)字節(jié) rc = writeJournalHdr(pPager); if( pPager->stmtAutoopen && rc==SQLITE_OK ){ rc = sqlite3pager_stmt_begin(pPager); } if( rc!=SQLITE_OK && rc!=SQLITE_NOMEM ){ rc = pager_unwritelock(pPager); if( rc==SQLITE_OK ){ rc = SQLITE_FULL; } } return rc;failed_to_open_journal: sqliteFree(pPager->aInJournal); pPager->aInJournal = 0; if( rc==SQLITE_NOMEM ){ /* If this was a malloc() failure, then we will not be closing the pager ** file. So delete any journal file we may have just created. Otherwise, ** the system will get confused, we have a read-lock on the file and a ** mysterious journal has appeared in the filesystem. */ sqlite3OsDelete(pPager->zJournal); }else{ sqlite3OsUnlock(pPager->fd, NO_LOCK); pPager->state = PAGER_UNLOCK; } return rc;}/*寫入日志文件頭**journal header的格式如下:** - 8 bytes: 標(biāo)志日志文件的魔數(shù)** - 4 bytes: 日志文件中記錄數(shù)** - 4 bytes: Random number used for page hash.** - 4 bytes: 原來數(shù)據(jù)庫的大小(kb)** - 4 bytes: 扇區(qū)大小512byte*/static int writeJournalHdr(Pager *pPager){ //日志文件頭 char zHeader[sizeof(aJournalMagic)+16]; int rc = seekJournalHdr(pPager); if( rc ) return rc; pPager->journalHdr = pPager->journalOff; if( pPager->stmtHdrOff==0 ){ pPager->stmtHdrOff = pPager->journalHdr; } //設(shè)置文件指針指向header之后 pPager->journalOff += JOURNAL_HDR_SZ(pPager); /* FIX ME: ** ** Possibly for a pager not in no-sync mode, the journal magic should not ** be written until nRec is filled in as part of next syncJournal(). ** ** Actually maybe the whole journal header should be delayed until that ** point. Think about this. */ memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic)); /* The nRec Field. 0xFFFFFFFF for no-sync journals. */ put32bits(&zHeader[sizeof(aJournalMagic)], pPager->noSync ? 0xffffffff : 0); /* The random check-hash initialiser */ sqlite3Randomness(sizeof(pPager->cksumInit), &pPager->cksumInit); put32bits(&zHeader[sizeof(aJournalMagic)+4], pPager->cksumInit); /* The initial database size */ put32bits(&zHeader[sizeof(aJournalMagic)+8], pPager->dbSize); /* The assumed sector size for this process */ put32bits(&zHeader[sizeof(aJournalMagic)+12], pPager->sectorSize); //寫入文件頭 rc = sqlite3OsWrite(pPager->jfd, zHeader, sizeof(zHeader)); /* The journal header has been written successfully. Seek the journal ** file descriptor to the end of the journal header sector. */ if( rc==SQLITE_OK ){ rc = sqlite3OsSeek(pPager->jfd, pPager->journalOff-1); if( rc==SQLITE_OK ){ rc = sqlite3OsWrite(pPager->jfd, "/000", 1); } } return rc;}
其實(shí)現(xiàn)過程如下圖所示:

6、修改位于用戶進(jìn)程空間的頁面(Changing Database Pages In User Space) 頁面的原始數(shù)據(jù)寫入日志之后,就可以修改頁面了——位于用戶進(jìn)程空間。每個(gè)數(shù)據(jù)庫連接都有自己私有的空間,所以頁面的變化只對該連接可見,而對其它連接的數(shù)據(jù)仍然是磁盤緩存中的數(shù)據(jù)。從這里可以明白一件事:一個(gè)進(jìn)程在修改頁面數(shù)據(jù)的同時(shí),其它進(jìn)程可以繼續(xù)進(jìn)行讀操作。圖中的紅色表示修改的頁面。

7、日志文件刷入磁盤(Flushing The Rollback Journal File To Mass Storage) 接下來把日志文件的內(nèi)容刷入磁盤,這對于數(shù)據(jù)庫從意外中恢復(fù)來說是至關(guān)重要的一步。而且這通常也是一個(gè)耗時(shí)的操作,因?yàn)榇疟PI/O速度很慢。 這個(gè)步驟不只把日志文件刷入磁盤那么簡單,它的實(shí)現(xiàn)實(shí)際上分成兩步:首先把日志文件的內(nèi)容刷入磁盤(即頁面數(shù)據(jù));然后把日志文件中頁面的數(shù)目寫入日志文件頭,再把header刷入磁盤(這一過程在代碼中清晰可見)。

代碼如下:

/***Sync日志文件,保證所有的臟頁面寫入磁盤日志文件*/static int syncJournal(Pager *pPager){ PgHdr *pPg; int rc = SQLITE_OK; /* Sync the journal before modifying the main database ** (assuming there is a journal and it needs to be synced.) */ if( pPager->needSync ){ if( !pPager->tempFile ){ assert( pPager->journalOpen ); /* assert( !pPager->noSync ); // noSync might be set if synchronous ** was turned off after the transaction was started. Ticket #615 */#ifndef NDEBUG { /* Make sure the pPager->nRec counter we are keeping agrees ** with the nRec computed from the size of the journal file. */ i64 jSz; rc = sqlite3OsFileSize(pPager->jfd, &jSz); if( rc!=0 ) return rc; assert( pPager->journalOff==jSz ); }#endif { /* Write the nRec value into the journal file header. If in ** full-synchronous mode, sync the journal first. This ensures that ** all data has really hit the disk before nRec is updated to mark ** it as a candidate for rollback. */ if( pPager->fullSync ){ TRACE2("SYNC journal of %d/n", PAGERID(pPager)); //首先保證臟頁面中所有的數(shù)據(jù)都已經(jīng)寫入日志文件 rc = sqlite3OsSync(pPager->jfd, 0); if( rc!=0 ) return rc; } rc = sqlite3OsSeek(pPager->jfd, pPager->journalHdr + sizeof(aJournalMagic)); if( rc ) return rc; //頁面的數(shù)目寫入日志文件 rc = write32bits(pPager->jfd, pPager->nRec); if( rc ) return rc; rc = sqlite3OsSeek(pPager->jfd, pPager->journalOff); if( rc ) return rc; } TRACE2("SYNC journal of %d/n", PAGERID(pPager)); rc = sqlite3OsSync(pPager->jfd, pPager->full_fsync); if( rc!=0 ) return rc; pPager->journalStarted = 1; } pPager->needSync = 0; /* Erase the needSync flag from every page. */ //清除needSync標(biāo)志位 for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){ pPg->needSync = 0; } pPager->pFirstSynced = pPager->pFirst; }#ifndef NDEBUG /* If the Pager.needSync flag is clear then the PgHdr.needSync ** flag must also be clear for all pages. Verify that this ** invariant is true. */ else{ for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){ assert( pPg->needSync==0 ); } assert( pPager->pFirstSynced==pPager->pFirst ); }#endif return rc;}
8、獲取排斥鎖(Obtaining An Exclusive Lock) 在對數(shù)據(jù)庫文件進(jìn)行修改之前(注:這里不是內(nèi)存中的頁面),我們必須得到數(shù)據(jù)庫文件的排斥鎖(Exclusive Lock)。得到排斥鎖的過程可分為兩步:首先得到Pending lock;然后Pending lock升級到exclusive lock。 Pending lock允許其它已經(jīng)存在的Shared lock繼續(xù)讀數(shù)據(jù)庫文件,但是不允許產(chǎn)生新的shared lock,這樣做目的是為了防止寫操作發(fā)生餓死情況。一旦所有的shared lock完成操作,則pending lock升級到exclusive lock。

9、修改的頁面寫入文件(Writing Changes To The Database File) 一旦得到exclusive lock,其它的進(jìn)程就不能進(jìn)行讀操作,此時(shí)就可以把修改的頁面寫回?cái)?shù)據(jù)庫文件,但是通常OS都把結(jié)果暫時(shí)保存到磁盤緩存中,直到某個(gè)時(shí)刻才會(huì)真正把結(jié)果寫入磁盤。

以上2步的實(shí)現(xiàn)代碼:

//把所有的臟頁面寫入數(shù)據(jù)庫//到這里開始獲取EXCLUSIVEQ鎖,并將頁面寫回操作系統(tǒng)文件static int pager_write_pagelist(PgHdr *pList){ Pager *pPager; int rc; if( pList==0 ) return SQLITE_OK; pPager = pList->pPager; /* At this point there may be either a RESERVED or EXCLUSIVE lock on the ** database file. If there is already an EXCLUSIVE lock, the following ** calls to sqlite3OsLock() are no-ops. ** ** Moving the lock from RESERVED to EXCLUSIVE actually involves going ** through an intermediate state PENDING. A PENDING lock prevents new ** readers from attaching to the database but is unsufficient for us to ** write. The idea of a PENDING lock is to prevent new readers from ** coming in while we wait for existing readers to clear. ** ** While the pager is in the RESERVED state, the original database file ** is unchanged and we can rollback without having to playback the ** journal into the original database file. Once we transition to ** EXCLUSIVE, it means the database file has been changed and any rollback ** will require a journal playback. */ //加EXCLUSIVE_LOCK鎖 rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK); if( rc!=SQLITE_OK ){ return rc; } while( pList ){ assert( pList->dirty ); rc = sqlite3OsSeek(pPager->fd, (pList->pgno-1)*(i64)pPager->pageSize); if( rc ) return rc; /* If there are dirty pages in the page cache with page numbers greater ** than Pager.dbSize, this means sqlite3pager_truncate() was called to ** make the file smaller (presumably by auto-vacuum code). Do not write ** any such pages to the file. */ if( pList->pgno<=pPager->dbSize ){ char *pData = CODEC2(pPager, PGHDR_TO_DATA(pList), pList->pgno, 6); TRACE3("STORE %d page %d/n", PAGERID(pPager), pList->pgno); //寫入文件 rc = sqlite3OsWrite(pPager->fd, pData, pPager->pageSize); TEST_INCR(pPager->nWrite); }#ifndef NDEBUG else{ TRACE3("NOSTORE %d page %d/n", PAGERID(pPager), pList->pgno); }#endif if( rc ) return rc; //設(shè)置dirty pList->dirty = 0;#ifdef SQLITE_CHECK_PAGES pList->pageHash = pager_pagehash(pList);#endif //指向下一個(gè)臟頁面 pList = pList->pDirty; } return SQLITE_OK;}
10、修改結(jié)果刷入存儲(chǔ)設(shè)備(Flushing Changes To Mass Storage) 為了保證修改結(jié)果真正寫入磁盤,這一步必不可少。對于數(shù)據(jù)庫存的完整性,這一步也是關(guān)鍵的一步。由于要進(jìn)行實(shí)際的I/O操作,所以和第7步一樣,將花費(fèi)較多的時(shí)間。

以上幾步實(shí)現(xiàn)代碼如下(以上幾步是在函數(shù)sqlite3BtreeSync()--btree.c中調(diào)用的):

//同步btree對應(yīng)的數(shù)據(jù)庫文件//該函數(shù)返回之后,只需要提交寫事務(wù),刪除日志文件int sqlite3BtreeSync(Btree *p, const char *zMaster){ int rc = SQLITE_OK; if( p->inTrans==TRANS_WRITE ){ BtShared *pBt = p->pBt; Pgno nTrunc = 0;#ifndef SQLITE_OMIT_AUTOVACUUM if( pBt->autoVacuum ){ rc = autoVacuumCommit(pBt, &nTrunc); if( rc!=SQLITE_OK ){ return rc; } }#endif //調(diào)用pager進(jìn)行sync rc = sqlite3pager_sync(pBt->pPager, zMaster, nTrunc); } return rc;}//把pager所有臟頁面寫回文件int sqlite3pager_sync(Pager *pPager, const char *zMaster, Pgno nTrunc){ int rc = SQLITE_OK; TRACE4("DATABASE SYNC: File=%s zMaster=%s nTrunc=%d/n", pPager->zFilename, zMaster, nTrunc); /* If this is an in-memory db, or no pages have been written to, or this ** function has already been called, it is a no-op. */ //pager不處于PAGER_SYNCED狀態(tài),dirtyCache為1, //則進(jìn)行sync操作 if( pPager->state!=PAGER_SYNCED && !MEMDB && pPager->dirtyCache ){ PgHdr *pPg; assert( pPager->journalOpen ); /* If a master journal file name has already been written to the ** journal file, then no sync is required. This happens when it is ** written, then the process fails to upgrade from a RESERVED to an ** EXCLUSIVE lock. The next time the process tries to commit the ** transaction the m-j name will have already been written. */ if( !pPager->setMaster ){ //pager修改計(jì)數(shù) rc = pager_incr_changecounter(pPager); if( rc!=SQLITE_OK ) goto sync_exit;#ifndef SQLITE_OMIT_AUTOVACUUM if( nTrunc!=0 ){ /* If this transaction has made the database smaller, then all pages ** being discarded by the truncation must be written to the journal ** file. */ Pgno i; void *pPage; int iSkip = PAGER_MJ_PGNO(pPager); for( i=nTrunc+1; i<=pPager->origDbSize; i++ ){ if( !(pPager->aInJournal[i/8] & (1<<(i&7))) && i!=iSkip ){ rc = sqlite3pager_get(pPager, i, &pPage); if( rc!=SQLITE_OK ) goto sync_exit; rc = sqlite3pager_write(pPage); sqlite3pager_unref(pPage); if( rc!=SQLITE_OK ) goto sync_exit; } } }#endif rc = writeMasterJournal(pPager, zMaster); if( rc!=SQLITE_OK ) goto sync_exit; //sync日志文件 rc = syncJournal(pPager); if( rc!=SQLITE_OK ) goto sync_exit; }#ifndef SQLITE_OMIT_AUTOVACUUM if( nTrunc!=0 ){ rc = sqlite3pager_truncate(pPager, nTrunc); if( rc!=SQLITE_OK ) goto sync_exit; }#endif /* Write all dirty pages to the database file */ pPg = pager_get_all_dirty_pages(pPager); //把所有臟頁面寫回操作系統(tǒng)文件 rc = pager_write_pagelist(pPg); if( rc!=SQLITE_OK ) goto sync_exit; /* Sync the database file. */ //sync數(shù)據(jù)庫文件 if( !pPager->noSync ){ rc = sqlite3OsSync(pPager->fd, 0); } pPager->state = PAGER_SYNCED; }else if( MEMDB && nTrunc!=0 ){ rc = sqlite3pager_truncate(pPager, nTrunc); }sync_exit: return rc;}
接下來的過程如下圖所示:
11、刪除日志文件(Deleting The Rollback Journal) 一旦更改寫入設(shè)備,日志文件將會(huì)被刪除,這是事務(wù)真正提交的時(shí)刻。如果在這之前系統(tǒng)發(fā)生崩潰,就會(huì)進(jìn)行恢復(fù)處理,使得數(shù)據(jù)庫和沒發(fā)生改變一樣;如果在這之后系統(tǒng)發(fā)生崩潰,表明所有的更改都已經(jīng)寫入磁盤。SQLite就是根據(jù)日志存在情況決定是否對數(shù)據(jù)庫進(jìn)行恢復(fù)處理。刪除文件本質(zhì)上不是一個(gè)原子操作,但是從用戶進(jìn)程的角度來看是一個(gè)原子操作,所以一個(gè)事務(wù)看起來是一個(gè)原子操作。 在許多系統(tǒng)中,刪除文件也是一個(gè)高代價(jià)的操作。作為優(yōu)化,SQLite可以配置成把日志文件的長度截為0或者把日志文件頭清零。

12、釋放鎖(Releasing The Lock) 作為原子提交的最后一步,釋放排斥鎖使得其它進(jìn)程可以開始訪問數(shù)據(jù)庫。 下圖中,我們指明了當(dāng)鎖被釋放的時(shí)候用戶空間所擁有的信息已經(jīng)被清空了。對于老版本的SQLite可以這么認(rèn)為,但最新的SQLite會(huì)保存些用戶空間的緩存不會(huì)被清空,可能下一個(gè)事務(wù)開始的時(shí)候,這些數(shù)據(jù)剛好可以用上。重新利用這些內(nèi)存要比再次從操作系統(tǒng)磁盤緩存或者硬盤中讀取輕松和快捷得多。在再次使用這些數(shù)據(jù)之前,我們必須先取得一個(gè)共享鎖,同時(shí)我們還不得不去檢查一下,保證還沒有其他進(jìn)程在我們擁有共享鎖之前對數(shù)據(jù)庫文件進(jìn)行了修改。數(shù)據(jù)庫文件的第一頁中有一個(gè)計(jì)數(shù)器,數(shù)據(jù)庫文件每做一次修改,這個(gè)計(jì)數(shù)器就會(huì)增長一下。我們可以通過檢查這個(gè)計(jì)數(shù)器就可得知是否有其他進(jìn)程修改過數(shù)據(jù)庫文件。如果數(shù)據(jù)庫文件已經(jīng)被修改過了,那么用戶內(nèi)存空間的緩存就不得不清空,并重新讀入。大多數(shù)情況下,這種情況不大會(huì)發(fā)生,因此用戶空間的內(nèi)存緩存將是有效的,這對于性能提高來說作用是顯著的。

以上2步(以上2步是在sqlite3BtreeCommit()--btree.c函數(shù)中實(shí)現(xiàn)的)代碼如下:

//提交事務(wù),至此一個(gè)事務(wù)完成.主要做兩件事://刪除日志文件,釋放數(shù)據(jù)庫文件的寫鎖int sqlite3BtreeCommit(Btree *p){ BtShared *pBt = p->pBt; btreeIntegrity(p); /* If the handle has a write-transaction open, commit the shared-btrees ** transaction and set the shared state to TRANS_READ. */ if( p->inTrans==TRANS_WRITE ){ int rc; assert( pBt->inTransaction==TRANS_WRITE ); assert( pBt->nTransaction>0 ); //調(diào)用pager,提交事務(wù) rc = sqlite3pager_commit(pBt->pPager); if( rc!=SQLITE_OK ){ return rc; } pBt->inTransaction = TRANS_READ; pBt->inStmt = 0; } unlockAllTables(p); /* If the handle has any kind of transaction open, decrement the transaction ** count of the shared btree. If the transaction count reaches 0, set ** the shared state to TRANS_NONE. The unlockBtreeIfUnused() call below ** will unlock the pager. */ if( p->inTrans!=TRANS_NONE ){ pBt->nTransaction--; if( 0==pBt->nTransaction ){ pBt->inTransaction = TRANS_NONE; } }}//提交事務(wù),主要調(diào)用pager_unwritelock()函數(shù)int sqlite3pager_commit(Pager *pPager){ int rc; PgHdr *pPg; if( pPager->errCode ){ return pPager->errCode; } if( pPager->state<PAGER_RESERVED ){ return SQLITE_ERROR; } TRACE2("COMMIT %d/n", PAGERID(pPager)); if( MEMDB ){ pPg = pager_get_all_dirty_pages(pPager); while( pPg ){ clearHistory(PGHDR_TO_HIST(pPg, pPager)); pPg->dirty = 0; pPg->inJournal = 0; pPg->inStmt = 0; pPg->needSync = 0; pPg->pPrevStmt = pPg->pNextStmt = 0; pPg = pPg->pDirty; } pPager->pDirty = 0;#ifndef NDEBUG for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){ PgHistory *pHist = PGHDR_TO_HIST(pPg, pPager); assert( !pPg->alwaysRollback ); assert( !pHist->pOrig ); assert( !pHist->pStmt ); }#endif pPager->pStmt = 0; pPager->state = PAGER_SHARED; return SQLITE_OK; } if( pPager->dirtyCache==0 ){ /* Exit early (without doing the time-consuming sqlite3OsSync() calls) ** if there have been no changes to the database file. */ assert( pPager->needSync==0 ); rc = pager_unwritelock(pPager); pPager->dbSize = -1; return rc; } assert( pPager->journalOpen ); rc = sqlite3pager_sync(pPager, 0, 0); //刪除文件,釋放寫鎖 if( rc==SQLITE_OK ){ rc = pager_unwritelock(pPager); pPager->dbSize = -1; } return rc;}//對數(shù)據(jù)庫加read lock,刪除日志文件static int pager_unwritelock(Pager *pPager){ PgHdr *pPg; int rc; assert( !MEMDB ); if( pPager->state<PAGER_RESERVED ){ return SQLITE_OK; } sqlite3pager_stmt_commit(pPager); if( pPager->stmtOpen ){ sqlite3OsClose(&pPager->stfd); pPager->stmtOpen = 0; } if( pPager->journalOpen ){ //關(guān)閉日志文件 sqlite3OsClose(&pPager->jfd); pPager->journalOpen = 0; //刪除日志文件 sqlite3OsDelete(pPager->zJournal); sqliteFree( pPager->aInJournal ); pPager->aInJournal = 0; for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){ pPg->inJournal = 0; pPg->dirty = 0; pPg->needSync = 0;#ifdef SQLITE_CHECK_PAGES pPg->pageHash = pager_pagehash(pPg);#endif } pPager->pDirty = 0; pPager->dirtyCache = 0; pPager->nRec = 0; }else{ assert( pPager->aInJournal==0 ); assert( pPager->dirtyCache==0 || pPager->useJournal==0 ); } //釋放寫鎖,加讀鎖 rc = sqlite3OsUnlock(pPager->fd, SHARED_LOCK); pPager->state = PAGER_SHARED; pPager->origDbSize = 0; pPager->setMaster = 0; pPager->needSync = 0; pPager->pFirstSynced = pPager->pFirst; return rc;}
下圖可進(jìn)一步描述該過程:

其中sqlite3BtreeSync()和sqlite3BtreeCommit()是如何被調(diào)用的?
一般來說,事務(wù)提交方式為自動(dòng)提交的話,在虛擬機(jī)中的OP_Halt指令實(shí)現(xiàn)提交事務(wù),相關(guān)代碼如下:

//虛擬機(jī)停機(jī)指令case OP_Halt: { /* no-push */ p->pTos = pTos; p->rc = pOp->p1; p->pc = pc; p->errorAction = pOp->p2; if( pOp->p3 ){ sqlite3SetString(&p->zErrMsg, pOp->p3, (char*)0); } //設(shè)置虛擬機(jī)狀態(tài)SQLITE_MAGIC_RUN 為 SQLITE_MAGIC_HALT, //并提交事務(wù) rc = sqlite3VdbeHalt(p); assert( rc==SQLITE_BUSY || rc==SQLITE_OK ); if( rc==SQLITE_BUSY ){ p->rc = SQLITE_BUSY; return SQLITE_BUSY; } return p->rc ? SQLITE_ERROR : SQLITE_DONE;}//當(dāng)虛擬機(jī)要停機(jī)時(shí),調(diào)用該函數(shù),如果VDBE改變了數(shù)據(jù)庫且為自動(dòng)//提交模式,則提交這些改變int sqlite3VdbeHalt(Vdbe *p){ sqlite3 *db = p->db; int i; int (*xFunc)(Btree *pBt) = 0; /* Function to call on each btree backend */ int isSpecialError; /* Set to true if SQLITE_NOMEM or IOERR */ /* This function contains the logic that determines if a statement or ** transaction will be committed or rolled back as a result of the ** execution of this virtual machine. ** ** Special errors: ** ** If an SQLITE_NOMEM error has occured in a statement that writes to ** the database, then either a statement or transaction must be rolled ** back to ensure the tree-structures are in a consistent state. A ** statement transaction is rolled back if one is open, otherwise the ** entire transaction must be rolled back. ** ** If an SQLITE_IOERR error has occured in a statement that writes to ** the database, then the entire transaction must be rolled back. The ** I/O error may have caused garbage to be written to the journal ** file. Were the transaction to continue and eventually be rolled ** back that garbage might end up in the database file. ** ** In both of the above cases, the Vdbe.errorAction variable is ** ignored. If the sqlite3.autoCommit flag is false and a transaction ** is rolled back, it will be set to true. ** ** Other errors: ** ** No error: ** */ if( sqlite3MallocFailed() ){ p->rc = SQLITE_NOMEM; } if( p->magic!=VDBE_MAGIC_RUN ){ /* Already halted. Nothing to do. */ assert( p->magic==VDBE_MAGIC_HALT ); return SQLITE_OK; } //釋放虛擬機(jī)中所有的游標(biāo) closeAllCursors(p); checkActiveVdbeCnt(db); /* No commit or rollback needed if the program never started */ if( p->pc>=0 ){ /* Check for one of the special errors - SQLITE_NOMEM or SQLITE_IOERR */ isSpecialError = ((p->rc==SQLITE_NOMEM || p->rc==SQLITE_IOERR)?1:0); if( isSpecialError ){ /* This loop does static analysis of the query to see which of the ** following three categories it falls into: ** ** Read-only ** Query with statement journal ** Query without statement journal ** ** We could do something more elegant than this static analysis (i.e. ** store the type of query as part of the compliation phase), but ** handling malloc() or IO failure is a fairly obscure edge case so ** this is probably easier. Todo: Might be an opportunity to reduce ** code size a very small amount though */ int isReadOnly = 1; int isStatement = 0; assert(p->aOp || p->nOp==0); for(i=0; i<p->nOp; i++){ switch( p->aOp[i].opcode ){ case OP_Transaction: isReadOnly = 0; break; case OP_Statement: isStatement = 1; break; } } /* If the query was read-only, we need do no rollback at all. Otherwise, ** proceed with the special handling. */ if( !isReadOnly ){ if( p->rc==SQLITE_NOMEM && isStatement ){ xFunc = sqlite3BtreeRollbackStmt; }else{ /* We are forced to roll back the active transaction. Before doing ** so, abort any other statements this handle currently has active. */ sqlite3AbortOtherActiveVdbes(db, p); sqlite3RollbackAll(db); db->autoCommit = 1; } } } /* If the auto-commit flag is set and this is the only active vdbe, then ** we do either a commit or rollback of the current transaction. ** ** Note: This block also runs if one of the special errors handled ** above has occured. */ //如果自動(dòng)提交事務(wù),則提交事務(wù) if( db->autoCommit && db->activeVdbeCnt==1 ){ if( p->rc==SQLITE_OK || (p->errorAction==OE_Fail && !isSpecialError) ){ /* The auto-commit flag is true, and the vdbe program was ** successful or hit an 'OR FAIL' constraint. This means a commit ** is required. */ //提交事務(wù) int rc = vdbeCommit(db); if( rc==SQLITE_BUSY ){ return SQLITE_BUSY; }else if( rc!=SQLITE_OK ){ p->rc = rc; sqlite3RollbackAll(db); }else{ sqlite3CommitInternalChanges(db); } }else{ sqlite3RollbackAll(db); } }else if( !xFunc ){ if( p->rc==SQLITE_OK || p->errorAction==OE_Fail ){ xFunc = sqlite3BtreeCommitStmt; }else if( p->errorAction==OE_Abort ){ xFunc = sqlite3BtreeRollbackStmt; }else{ sqlite3AbortOtherActiveVdbes(db, p); sqlite3RollbackAll(db); db->autoCommit = 1; } } /* If xFunc is not NULL, then it is one of sqlite3BtreeRollbackStmt or ** sqlite3BtreeCommitStmt. Call it once on each backend. If an error occurs ** and the return code is still SQLITE_OK, set the return code to the new ** error value. */ assert(!xFunc || xFunc==sqlite3BtreeCommitStmt || xFunc==sqlite3BtreeRollbackStmt ); for(i=0; xFunc && i<db->nDb; i++){ int rc; Btree *pBt = db->aDb[i].pBt; if( pBt ){ rc = xFunc(pBt); if( rc && (p->rc==SQLITE_OK || p->rc==SQLITE_CONSTRAINT) ){ p->rc = rc; sqlite3SetString(&p->zErrMsg, 0); } } } /* If this was an INSERT, UPDATE or DELETE and the statement was committed, ** set the change counter. */ if( p->changeCntOn && p->pc>=0 ){ if( !xFunc || xFunc==sqlite3BtreeCommitStmt ){ sqlite3VdbeSetChanges(db, p->nChange); }else{ sqlite3VdbeSetChanges(db, 0); } p->nChange = 0; } /* Rollback or commit any schema changes that occurred. */ if( p->rc!=SQLITE_OK && db->flags&SQLITE_InternChanges ){ sqlite3ResetInternalSchema(db, 0); db->flags = (db->flags | SQLITE_InternChanges); } } /* We have successfully halted and closed the VM. Record this fact. */ if( p->pc>=0 ){ db->activeVdbeCnt--; } p->magic = VDBE_MAGIC_HALT; checkActiveVdbeCnt(db); return SQLITE_OK;}//提交事務(wù),主要調(diào)用://sqlite3BtreeSync()--同步btree, sqlite3BtreeCommit()---提交事務(wù)static int vdbeCommit(sqlite3 *db){ int i; int nTrans = 0; /* Number of databases with an active write-transaction */ int rc = SQLITE_OK; int needXcommit = 0; for(i=0; i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( pBt && sqlite3BtreeIsInTrans(pBt) ){ needXcommit = 1; if( i!=1 ) nTrans++; } } /* If there are any write-transactions at all, invoke the commit hook */ if( needXcommit && db->xCommitCallback ){ sqlite3SafetyOff(db); rc = db->xCommitCallback(db->pCommitArg); sqlite3SafetyOn(db); if( rc ){ return SQLITE_CONSTRAINT; } } /* The simple case - no more than one database file (not counting the ** TEMP database) has a transaction active. There is no need for the ** master-journal. ** ** If the return value of sqlite3BtreeGetFilename() is a zero length ** string, it means the main database is :memory:. In that case we do ** not support atomic multi-file commits, so use the simple case then ** too. */ //簡單的情況,只有一個(gè)數(shù)據(jù)庫文件,不需要master-journal if( 0==strlen(sqlite3BtreeGetFilename(db->aDb[0].pBt)) || nTrans<=1 ){ for(i=0; rc==SQLITE_OK && i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( pBt ){ //同步btree rc = sqlite3BtreeSync(pBt, 0); } } /* Do the commit only if all databases successfully synced */ //commite事務(wù) if( rc==SQLITE_OK ){ for(i=0; i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( pBt ){ sqlite3BtreeCommit(pBt); } } } } /* The complex case - There is a multi-file write-transaction active. ** This requires a master journal file to ensure the transaction is ** committed atomicly. */#ifndef SQLITE_OMIT_DISKIO else{ int needSync = 0; char *zMaster = 0; /* File-name for the master journal */ char const *zMainFile = sqlite3BtreeGetFilename(db->aDb[0].pBt); OsFile *master = 0; /* Select a master journal file name */ do { u32 random; sqliteFree(zMaster); sqlite3Randomness(sizeof(random), &random); zMaster = sqlite3MPrintf("%s-mj%08X", zMainFile, random&0x7fffffff); if( !zMaster ){ return SQLITE_NOMEM; } }while( sqlite3OsFileExists(zMaster) ); /* Open the master journal. */ rc = sqlite3OsOpenExclusive(zMaster, &master, 0); if( rc!=SQLITE_OK ){ sqliteFree(zMaster); return rc; } /* Write the name of each database file in the transaction into the new ** master journal file. If an error occurs at this point close ** and delete the master journal file. All the individual journal files ** still have 'null' as the master journal pointer, so they will roll ** back independently if a failure occurs. */ for(i=0; i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( i==1 ) continue; /* Ignore the TEMP database */ if( pBt && sqlite3BtreeIsInTrans(pBt) ){ char const *zFile = sqlite3BtreeGetJournalname(pBt); if( zFile[0]==0 ) continue; /* Ignore :memory: databases */ if( !needSync && !sqlite3BtreeSyncDisabled(pBt) ){ needSync = 1; } rc = sqlite3OsWrite(master, zFile, strlen(zFile)+1); if( rc!=SQLITE_OK ){ sqlite3OsClose(&master); sqlite3OsDelete(zMaster); sqliteFree(zMaster); return rc; } } } /* Sync the master journal file. Before doing this, open the directory ** the master journal file is store in so that it gets synced too. */ zMainFile = sqlite3BtreeGetDirname(db->aDb[0].pBt); rc = sqlite3OsOpenDirectory(master, zMainFile); if( rc!=SQLITE_OK || (needSync && (rc=sqlite3OsSync(master,0))!=SQLITE_OK) ){ sqlite3OsClose(&master); sqlite3OsDelete(zMaster); sqliteFree(zMaster); return rc; } /* Sync all the db files involved in the transaction. The same call ** sets the master journal pointer in each individual journal. If ** an error occurs here, do not delete the master journal file. ** ** If the error occurs during the first call to sqlite3BtreeSync(), ** then there is a chance that the master journal file will be ** orphaned. But we cannot delete it, in case the master journal ** file name was written into the journal file before the failure ** occured. */ for(i=0; i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( pBt && sqlite3BtreeIsInTrans(pBt) ){ rc = sqlite3BtreeSync(pBt, zMaster); if( rc!=SQLITE_OK ){ sqlite3OsClose(&master); sqliteFree(zMaster); return rc; } } } sqlite3OsClose(&master); /* Delete the master journal file. This commits the transaction. After ** doing this the directory is synced again before any individual ** transaction files are deleted. */ rc = sqlite3OsDelete(zMaster); assert( rc==SQLITE_OK ); sqliteFree(zMaster); zMaster = 0; rc = sqlite3OsSyncDirectory(zMainFile); if( rc!=SQLITE_OK ){ /* This is not good. The master journal file has been deleted, but ** the directory sync failed. There is no completely safe course of ** action from here. The individual journals contain the name of the ** master journal file, but there is no way of knowing if that ** master journal exists now or if it will exist after the Operating ** system crash that may follow the fsync() failure. */ return rc; } /* All files and directories have already been synced, so the following ** calls to sqlite3BtreeCommit() are only closing files and deleting ** journals. If something goes wrong while this is happening we don't ** really care. The integrity of the transaction is already guaranteed, ** but some stray 'cold' journals may be lying around. Returning an ** error code won't help matters. */ for(i=0; i<db->nDb; i++){ Btree *pBt = db->aDb[i].pBt; if( pBt ){ sqlite3BtreeCommit(pBt); } } }#endif return rc;}
Page Cache之并發(fā)控制
pager層是SQLite實(shí)現(xiàn)最為核心的模塊,它具有四大功能:I/O、頁面緩存、并發(fā)控制和日志恢復(fù)。而這些功能不僅是上層Btree的基礎(chǔ),而且對系統(tǒng)的性能和健壯性有至關(guān)重要的影響。其中并發(fā)控制和日志恢復(fù)是事務(wù)處理實(shí)現(xiàn)的基礎(chǔ)。SQLite并發(fā)控制的機(jī)制非常簡單——即封鎖機(jī)制;另外,它的查詢優(yōu)化機(jī)制也非常簡單——基于索引。這一切使得整個(gè)SQLite的實(shí)現(xiàn)變得簡單,同時(shí)變得很小,保證其運(yùn)行速度非常快,所以特別適合嵌入式設(shè)備。SQLite是基于鎖來實(shí)現(xiàn)并發(fā)控制的,其鎖機(jī)制實(shí)現(xiàn)得非常簡單而巧妙。
SQLite的并發(fā)控制機(jī)制是采用加鎖的方式,實(shí)現(xiàn)簡單,也非常巧妙,如下圖所示:

1、RESERVED LOCK RESERVED鎖意味著進(jìn)程將要對數(shù)據(jù)庫進(jìn)行寫操作。某一時(shí)刻只能有一個(gè)RESERVED Lock,但是RESERVED鎖和SHARED鎖可以共存,而且可以對數(shù)據(jù)庫加新的SHARED鎖。 為什么要用RESERVED鎖? 主要是出于并發(fā)性的考慮。由于SQLite只有庫級排斥鎖(EXCLUSIVE LOCK),如果寫事務(wù)一開始就上EXCLUSIVE鎖,然后再進(jìn)行實(shí)際的數(shù)據(jù)更新,寫磁盤操作,這會(huì)使得并發(fā)性大大降低。而SQLite一旦得到數(shù)據(jù)庫的RESERVED鎖,就可以對緩存中的數(shù)據(jù)進(jìn)行修改,而與此同時(shí),其它進(jìn)程可以繼續(xù)進(jìn)行讀操作。直到真正需要寫磁盤時(shí)才對數(shù)據(jù)庫加EXCLUSIVE鎖。
2、PENDING LOCK PENDING LOCK意味著進(jìn)程已經(jīng)完成緩存中的數(shù)據(jù)修改,并想立即將更新寫入磁盤。它將等待此時(shí)已經(jīng)存在的讀鎖事務(wù)完成,但是不允許對數(shù)據(jù)庫加新的SHARED LOCK(這與RESERVED LOCK相區(qū)別)。 為什么要有PENDING LOCK? 主要是為了防止出現(xiàn)寫?zhàn)I死的情況。由于寫事務(wù)先要獲取RESERVED LOCK,所以可能一直產(chǎn)生新的SHARED LOCK,使得寫事務(wù)發(fā)生餓死的情況。
3、加鎖機(jī)制的具體實(shí)現(xiàn)
SQLite在pager層獲取鎖的函數(shù)如下:

//獲取一個(gè)文件的鎖,如果忙則重復(fù)該操作,//直到busy回調(diào)函數(shù)返回flase,或者成功獲得鎖static int pager_wait_on_lock(Pager *pPager, int locktype){ int rc; assert( PAGER_SHARED==SHARED_LOCK ); assert( PAGER_RESERVED==RESERVED_LOCK ); assert( PAGER_EXCLUSIVE==EXCLUSIVE_LOCK ); if( pPager->state>=locktype ){ rc = SQLITE_OK; }else{ //重復(fù)直到獲得鎖 do { rc = sqlite3OsLock(pPager->fd, locktype); }while( rc==SQLITE_BUSY && sqlite3InvokeBusyHandler(pPager->pBusyHandler) ); if( rc==SQLITE_OK ){ //設(shè)置pager的狀態(tài) pPager->state = locktype; } } return rc;}
Windows下具體的實(shí)現(xiàn)如下:

static int winLock(OsFile *id, int locktype){ int rc = SQLITE_OK; /* Return code from subroutines */ int res = 1; /* Result of a windows lock call */ int newLocktype; /* Set id->locktype to this value before exiting */ int gotPendingLock = 0;/* True if we acquired a PENDING lock this time */ winFile *pFile = (winFile*)id; assert( pFile!=0 ); TRACE5("LOCK %d %d was %d(%d)/n", pFile->h, locktype, pFile->locktype, pFile->sharedLockByte); /* If there is already a lock of this type or more restrictive on the ** OsFile, do nothing. Don't use the end_lock: exit path, as ** sqlite3OsEnterMutex() hasn't been called yet. */ //當(dāng)前的鎖>=locktype,則返回 if( pFile->locktype>=locktype ){ return SQLITE_OK; } /* Make sure the locking sequence is correct */ assert( pFile->locktype!=NO_LOCK || locktype==SHARED_LOCK ); assert( locktype!=PENDING_LOCK ); assert( locktype!=RESERVED_LOCK || pFile->locktype==SHARED_LOCK ); /* Lock the PENDING_LOCK byte if we need to acquire a PENDING lock or ** a SHARED lock. If we are acquiring a SHARED lock, the acquisition of ** the PENDING_LOCK byte is temporary. */ newLocktype = pFile->locktype; /*兩種情況: (1)如果當(dāng)前文件處于無鎖狀態(tài)(獲取讀鎖--讀事務(wù) **和寫事務(wù)在最初階段都要經(jīng)歷的階段), **(2)處于RESERVED_LOCK,且請求的鎖為EXCLUSIVE_LOCK(寫事務(wù)) **則對執(zhí)行加PENDING_LOCK */ /////////////////////(1)/////////////////// if( pFile->locktype==NO_LOCK || (locktype==EXCLUSIVE_LOCK && pFile->locktype==RESERVED_LOCK) ){ int cnt = 3; //加pending鎖 while( cnt-->0 && (res = LockFile(pFile->h, PENDING_BYTE, 0, 1, 0))==0 ){ /* Try 3 times to get the pending lock. The pending lock might be ** held by another reader process who will release it momentarily. */ TRACE2("could not get a PENDING lock. cnt=%d/n", cnt); Sleep(1); } //設(shè)置為gotPendingLock為1,使和在后面要釋放PENDING鎖 gotPendingLock = res; } /* Acquire a shared lock */ /*獲取shared lock **此時(shí),事務(wù)應(yīng)該持有PENDING鎖,而PENDING鎖作為事務(wù)從UNLOCKED到 **SHARED_LOCKED的一個(gè)過渡,所以事務(wù)由PENDING->SHARED **此時(shí),實(shí)際上鎖處于兩個(gè)狀態(tài):PENDING和SHARED, **直到后面釋放PENDING鎖后,才真正處于SHARED狀態(tài) */ ////////////////(2)///////////////////////////////////// if( locktype==SHARED_LOCK && res ){ assert( pFile->locktype==NO_LOCK ); res = getReadLock(pFile); if( res ){ newLocktype = SHARED_LOCK; } } /* Acquire a RESERVED lock */ /*獲取RESERVED **此時(shí)事務(wù)持有SHARED_LOCK,變化過程為SHARED->RESERVED。 **RESERVED鎖的作用就是為了提高系統(tǒng)的并發(fā)性能 */ ////////////////////////(3)///////////////////////////////// if( locktype==RESERVED_LOCK && res ){ assert( pFile->locktype==SHARED_LOCK ); //加RESERVED鎖 res = LockFile(pFile->h, RESERVED_BYTE, 0, 1, 0); if( res ){ newLocktype = RESERVED_LOCK; } } /* Acquire a PENDING lock */ /*獲取PENDING鎖 **此時(shí)事務(wù)持有RESERVED_LOCK,且事務(wù)申請EXCLUSIVE_LOCK **變化過程為:RESERVED->PENDING。 **PENDING狀態(tài)只是唯一的作用就是防止寫?zhàn)I死. **讀事務(wù)不會(huì)執(zhí)行該代碼,但是寫事務(wù)會(huì)執(zhí)行該代碼, **執(zhí)行該代碼后gotPendingLock設(shè)為0,后面就不會(huì)釋放PENDING鎖。 */ //////////////////////////////(4)//////////////////////////////// if( locktype==EXCLUSIVE_LOCK && res ){ //這里沒有實(shí)際的加鎖操作,只是把鎖的狀態(tài)改為PENDING狀態(tài) newLocktype = PENDING_LOCK; //設(shè)置了gotPendingLock,后面就不會(huì)釋放PENDING鎖了, //相當(dāng)于加了PENDING鎖,實(shí)際上是在開始處加的PENDING鎖 gotPendingLock = 0; } /* Acquire an EXCLUSIVE lock */ /*獲取EXCLUSIVE鎖 **當(dāng)一個(gè)事務(wù)執(zhí)行該代碼時(shí),它應(yīng)該滿足以下條件: **(1)鎖的狀態(tài)為:PENDING (2)是一個(gè)寫事務(wù) **變化過程:PENDING->EXCLUSIVE */ /////////////////////////(5)/////////////////////////////////////////// if( locktype==EXCLUSIVE_LOCK && res ){ assert( pFile->locktype>=SHARED_LOCK ); res = unlockReadLock(pFile); TRACE2("unreadlock = %d/n", res); res = LockFile(pFile->h, SHARED_FIRST, 0, SHARED_SIZE, 0); if( res ){ newLocktype = EXCLUSIVE_LOCK; }else{ TRACE2("error-code = %d/n", GetLastError()); } } /* If we are holding a PENDING lock that ought to be released, then ** release it now. */ /*此時(shí)事務(wù)在第2步中獲得PENDING鎖,它將申請SHARED_LOCK(第3步,和圖形相對照), **而在之前它已經(jīng)獲取了PENDING鎖, **所以在這里它需要釋放PENDING鎖,此時(shí)鎖的變化為:PENDING->SHARED */ //////////////////////////(6)///////////////////////////////////// if( gotPendingLock && locktype==SHARED_LOCK ){ UnlockFile(pFile->h, PENDING_BYTE, 0, 1, 0); } /* Update the state of the lock has held in the file descriptor then ** return the appropriate result code. */ if( res ){ rc = SQLITE_OK; }else{ TRACE4("LOCK FAILED %d trying for %d but got %d/n", pFile->h, locktype, newLocktype); rc = SQLITE_BUSY; } //在這里設(shè)置文件鎖的狀態(tài) pFile->locktype = newLocktype; return rc;}
在幾個(gè)關(guān)鍵的部位標(biāo)記數(shù)字。
(I)對于一個(gè)讀事務(wù)會(huì)的完整經(jīng)過:語句序列:(1)——>(2)——>(6)相應(yīng)的狀態(tài)真正的變化過程為:UNLOCKED→PENDING(1)→PENDING、SHARED(2)→SHARED(6)→UNLOCKED
(II)對于一個(gè)寫事務(wù)完整經(jīng)過:第一階段:語句序列:(1)——>(2)——>(6)狀態(tài)變化:UNLOCKED→PENDING(1)→PENDING、SHARED(2)→SHARED(6)。此時(shí)事務(wù)獲得SHARED LOCK。第二個(gè)階段:語句序列:(3)此時(shí)事務(wù)獲得RESERVED LOCK。第三個(gè)階段:事務(wù)執(zhí)行修改操作。第四個(gè)階段:語句序列:(1)——>(4)——>(5)狀態(tài)變化為:RESERVED→ RESERVED 、PENDING(1)→PENDING(4)→EXCLUSIVE(5)。此時(shí)事務(wù)獲得排斥鎖,就可以進(jìn)行寫磁盤操作了。
注:在上面的過程中,由于(1)的執(zhí)行,使得某些時(shí)刻SQLite處于兩種狀態(tài),但它持續(xù)的時(shí)間很短,從某種程度上來說可以忽略,但是為了把問題說清楚,在這里描述了這一微妙而巧妙的過程。
4、SQLite的死鎖問題 SQLite的加鎖機(jī)制會(huì)不會(huì)出現(xiàn)死鎖? 這是一個(gè)很有意思的問題,對于任何采取加鎖作為并發(fā)控制機(jī)制的DBMS都得考慮這個(gè)問題。有兩種方式處理死鎖問題:(1)死鎖預(yù)防(deadlock prevention)(2)死鎖檢測(deadlock detection)與死鎖恢復(fù)(deadlock recovery)。SQLite采取了第一種方式,如果一個(gè)事務(wù)不能獲取鎖,它會(huì)重試有限次(這個(gè)重試次數(shù)可以由應(yīng)用程序運(yùn)行預(yù)先設(shè)置,默認(rèn)為1次)——這實(shí)際上是基本鎖超時(shí)的機(jī)制。如果還是不能獲取鎖,SQLite返回SQLITE_BUSY錯(cuò)誤給應(yīng)用程序,應(yīng)用程序此時(shí)應(yīng)該中斷,之后再重試;或者中止當(dāng)前事務(wù)。雖然基于鎖超時(shí)的機(jī)制簡單,容易實(shí)現(xiàn),但是它的缺點(diǎn)也是明顯的——資源浪費(fèi)。
5、事務(wù)類型(Transaction Types) 既然SQLite采取了這種機(jī)制,所以應(yīng)用程序得處理SQLITE_BUSY錯(cuò)誤,先來看一個(gè)會(huì)產(chǎn)生SQLITE_BUSY錯(cuò)誤的例子:

所以應(yīng)用程序應(yīng)該盡量避免產(chǎn)生死鎖,那么應(yīng)用程序如何做可以避免死鎖的產(chǎn)生呢? 答案就是為你的程序選擇正確合適的事務(wù)類型。 SQLite有三種不同的事務(wù)類型,這不同于鎖的狀態(tài)。事務(wù)可以從DEFERRED、IMMEDIATE或者EXCLUSIVE,一個(gè)事務(wù)的類型在BEGIN命令中指定:
BEGIN [ DEFERRED | IMMEDIATE | EXCLUSIVE ] TRANSACTION; 一個(gè)deferred事務(wù)不獲取任何鎖,直到它需要鎖的時(shí)候,而且BEGIN語句本身也不會(huì)做什么事情——它開始于UNLOCK狀態(tài);默認(rèn)情況下是這樣的。如果僅僅用BEGIN開始一個(gè)事務(wù),那么事務(wù)就是DEFERRED的,同時(shí)它不會(huì)獲取任何鎖,當(dāng)對數(shù)據(jù)庫進(jìn)行第一次讀操作時(shí),它會(huì)獲取SHARED LOCK;同樣,當(dāng)進(jìn)行第一次寫操作時(shí),它會(huì)獲取RESERVED LOCK。 由BEGIN開始的Immediate事務(wù)會(huì)試著獲取RESERVED LOCK。如果成功,BEGIN IMMEDIATE保證沒有別的連接可以寫數(shù)據(jù)庫。但是,別的連接可以對數(shù)據(jù)庫進(jìn)行讀操作,但是RESERVED LOCK會(huì)阻止其它的連接BEGIN IMMEDIATE或者BEGIN EXCLUSIVE命令,SQLite會(huì)返回SQLITE_BUSY錯(cuò)誤。這時(shí)你就可以對數(shù)據(jù)庫進(jìn)行修改操作,但是你不能提交,當(dāng)你COMMIT時(shí),會(huì)返回SQLITE_BUSY錯(cuò)誤,這意味著還有其它的讀事務(wù)沒有完成,得等它們執(zhí)行完后才能提交事務(wù)。 Exclusive事務(wù)會(huì)試著獲取對數(shù)據(jù)庫的EXCLUSIVE鎖。這與IMMEDIATE類似,但是一旦成功,EXCLUSIVE事務(wù)保證沒有其它的連接,所以就可對數(shù)據(jù)庫進(jìn)行讀寫操作了。 上面那個(gè)例子的問題在于兩個(gè)連接最終都想寫數(shù)據(jù)庫,但是他們都沒有放棄各自原來的鎖,最終,shared鎖導(dǎo)致了問題的出現(xiàn)。如果兩個(gè)連接都以BEGIN IMMEDIATE開始事務(wù),那么死鎖就不會(huì)發(fā)生。在這種情況下,在同一時(shí)刻只能有一個(gè)連接進(jìn)入BEGIN IMMEDIATE,其它的連接就得等待。BEGIN IMMEDIATE和BEGIN EXCLUSIVE通常被寫事務(wù)使用。就像同步機(jī)制一樣,它防止了死鎖的產(chǎn)生。 基本的準(zhǔn)則是:如果你在使用的數(shù)據(jù)庫沒有其它的連接,用BEGIN就足夠了。但是,如果你使用的數(shù)據(jù)庫在其它的連接也要對數(shù)據(jù)庫進(jìn)行寫操作,就得使用BEGIN IMMEDIATE或BEGIN EXCLUSIVE開始你的事務(wù)。
新聞熱點(diǎn)
疑難解答
圖片精選
網(wǎng)友關(guān)注