python 文本單詞提取和詞頻統(tǒng)計(jì)的實(shí)例

2020-01-04 13:44:53

字體：大中小

來(lái)源：轉(zhuǎn)載

供稿：網(wǎng)友

這些對(duì)文本的操作經(jīng)常用到，那我就總結(jié)一下。陸續(xù)補(bǔ)充。。。

操作：

strip_html(cls, text) 去除html標(biāo)簽

separate_words(cls, text, min_lenth=3) 文本提取

get_words_frequency(cls, words_list) 獲取詞頻

源碼：

class DocProcess(object): @classmethod def strip_html(cls, text):  """   Delete html tags in text.   text is String  """  new_text = " "  is_html = False  for character in text:   if character == "<":    is_html = True   elif character == ">":    is_html = False    new_text += " "   elif is_html is False:    new_text += character  return new_text @classmethod def separate_words(cls, text, min_lenth=3):  """   Separate text into words in list.  """  splitter = re.compile("//W+")  return [s.lower() for s in splitter.split(text) if len(s) > min_lenth] @classmethod def get_words_frequency(cls, words_list):  """   Get frequency of words in words_list.   return a dict.  """  num_words = {}  for word in words_list:   num_words[word] = num_words.get(word, 0) + 1  return num_words

以上這篇python 文本單詞提取和詞頻統(tǒng)計(jì)的實(shí)例就是小編分享給大家的全部?jī)?nèi)容了，希望能給大家一個(gè)參考，也希望大家多多支持VEVB武林網(wǎng)。

注：相關(guān)教程知識(shí)閱讀請(qǐng)移步到python教程頻道。

上一篇：Python裝飾器基礎(chǔ)概念與用法詳解

下一篇：python 刪除字符串中連續(xù)多個(gè)空格并保留一個(gè)的方法