Python中文分詞工具之結巴分詞用法實例總結【經典案例】

2019-11-25 16:14:54

字體：大中小

來源：轉載

供稿：網友

本文實例講述了Python中文分詞工具之結巴分詞用法。分享給大家供大家參考，具體如下：

結巴分詞工具的安裝及基本用法，前面的文章《Python結巴中文分詞工具使用過程中遇到的問題及解決方法》中已經有所描述。這里要說的內容與實際應用更貼近――從文本中讀取中文信息，利用結巴分詞工具進行分詞及詞性標注。

示例代碼如下：

#coding=utf-8import jiebaimport jieba.posseg as psegimport timet1=time.time()f=open("t_with_splitter.txt","r") #讀取文本string=f.read().decode("utf-8")words = pseg.cut(string) #進行分詞result="" #記錄最終結果的變量for w in words:   result+= str(w.word)+"/"+str(w.flag) #加詞性標注f=open("t_with_POS_tag.txt","w") #將結果保存到另一個文檔中f.write(result)f.close()t2=time.time()print("分詞及詞性標注完成，耗時："+str(t2-t1)+"秒。") #反饋結果

其中t_with_splitter.txt文件內容如下：

武林網是國內專業的網站建設資源、腳本編程學習類網站，提供asp、php、asp.net、javascript、jquery、vbscript、dos批處理、網頁制作、網絡編程、網站建設等編程資料。

Python2.7.9平臺運行后出現如下圖所示的錯誤提示：

查閱相關資料后發現，需要在開頭加上：

import sysreload(sys)sys.setdefaultencoding( "utf-8" )

最終代碼應為：

#coding=utf-8import jiebaimport jieba.posseg as psegimport timeimport sysreload(sys)sys.setdefaultencoding( "utf-8" )t1=time.time()f=open("t_with_splitter.txt","r") #讀取文本string=f.read().decode("utf-8")words = pseg.cut(string) #進行分詞result="" #記錄最終結果的變量for w in words:   result+= str(w.word)+"/"+str(w.flag) #加詞性標注f=open("t_with_POS_tag.txt","w") #將結果保存到另一個文檔中f.write(result)f.close()t2=time.time()print("分詞及詞性標注完成，耗時："+str(t2-t1)+"秒。") #反饋結果

運行成功：