詳解Python安裝tesserocr遇到的各種問題及解決辦法

2020-01-04 13:35:38

字體：大中小

來源：轉載

供稿：網友

Tesseract的安裝及配置

在Python爬蟲過程中，難免遇到各種各樣的驗證碼問題，最簡單的就是 Python,安裝,tesserocr ?這種驗證碼了，那么在遇到驗證碼的時候該怎么辦呢？我們就需要OCR技術了，OCR-即Optical Character Recognition光學字符識別，是指通過掃描字符，然后將其形狀翻譯成電子文本的過程。而tesserocr是Python的一個OCR識別庫，所以在安裝tesserocr之前，我們需要安裝tesseract這個東西

下載地址：https://digi.bib.uni-mannheim.de/tesseract/可以選擇下載不帶dev的穩定版本，我下載的是3.05.01版本的，不過這個版本的可能比較早了，識別能力不是很厲害，讀者可以選擇下載最新版本的3.05.02，識別能力應該會好很多。

下載完就是一路雙擊，在最后的Additional Language data(download)選上這個選項，是OCR支持各種語言的包，然后繼續安裝，直到安裝成功。

我的安裝路徑為：G:/Program Files (x86)/Tesseract-OCR

Python,安裝,tesserocr

安裝完成后就得需要配置環境變量，打開環境變量設置，在path中加入如下

Python,安裝,tesserocr ?的設置，這樣tesseract就安裝成功并配置完成了、

tesserocr庫的安裝

剛開始我直接在cmd下輸入 pip install tesserocr 很不幸報錯了，報錯類似于如下。。。因為之前我的報錯，沒有截圖。所以。。

Python,安裝,tesserocr

就是類似于這種的截圖，這該怎么辦，難道要去下載visual C++嗎？我們有更好的解決方法，下載對應的.whl文件

下載地址：https://github.com/simonflueckiger/tesserocr-windows_build/releases一定要下載對應版本的

Python,安裝,tesserocr

我的是3.5.1，所以我下載的是這個版本的。讀者可以自行選擇。

我的tesserocr-2.2.2-cp36-cp36m-win_amd64.whl文件下載在G盤根目錄下，然后在cmd里輸入 pip install G:/tesserocr-2.2.2-cp36-cp36m-win_amd64.whl 開始安裝whl文件，發現報錯了。提示不能安裝whl文件。。原來是沒有安裝wheel。

然后我就去安裝了wheel 直接 pip install wheel即可。

安裝成功在輸入pip install G:/tesserocr-2.2.2-cp36-cp36m-win_amd64.whl 發現開始安裝了。

哎心累啊，總算弄好了。但是，我在pycharm中調用tesserocr 這個庫，他又提示報錯了，這是為什么呢？百度了一下最終解決。

原來需要在pycharm下的terrminal下輸入如下圖：

Python,安裝,tesserocr

如果報錯了還得有一步操作。

將Tesseract-OCR下的tessdata文件復制到你的Python安裝路徑的scripts下：

Python,安裝,tesserocr

這樣

Python,安裝,tesserocr

這下應該就徹底安裝成功了。。

這下在pycharm里總算不會報錯了，我們來試一下識別這兩張圖片的效果

Python,安裝,tesserocr

代碼：

from PIL import Imageimport tesserocrimag=Image.open('test.jpg')print(tesserocr.image_to_text(imag))imag1=Image.open('image.png')print(tesserocr.image_to_text(imag1))

輸出結果如下：

Python,安裝,tesserocr

將762408識別成了162408 我也很無奈呀。。?？赡苁且驗榘姹咎肆税?/p>

以上就是我安裝tesserocr遇到的問題及解決辦法了。其實還可以裝pytesseract這個庫。

安裝pytesseract庫

安裝這個pytesseract庫可比tesserocr方便多了，根本不會報錯，直接pip install pytesseract 完事。。pycharm直接搜索庫

Python,安裝,tesserocr

然后下載就完事，多省事。。。。

看一下識別效果，還是同樣的兩張圖片。

代碼：

import pytesseractfrom PIL import Imageimport tesserocrim=Image.open('test.jpg')print(pytesseract.image_to_string(im))im1=Image.open('image.png')print(pytesseract.image_to_string(im1))

運行結果：

Python,安裝,tesserocr

運行結果一樣的，所以我推薦大家使用pytesseract這個庫。

驗證碼識別問題

我打開知乎登錄界面，下載了一張驗證碼圖片： Python,安裝,tesserocr 開始識別它。

代碼如下：

import pytesseractfrom PIL import Imageimport tesserocr#簡單驗證 特別垃圾image=Image.open('3.jpg')result=tesserocr.image_to_text(image)print(result)#完全驗證 也不咋地。。image1=Image.open('3.jpg')image1=image1.convert('L')threshold=127table=[]for i in range(256): if i <threshold:  table.append(0) else:  table.append(1)image2=image1.point(table,'1')image2.show() #二值化灰度處理圖片顯示result=pytesseract.image_to_string(image2)print(result)

運行結果：

Python,安裝,tesserocr ?都識別失敗了，，，