這些對(duì)文本的操作經(jīng)常用到, 那我就總結(jié)一下。 陸續(xù)補(bǔ)充。。。
操作:
strip_html(cls, text) 去除html標(biāo)簽
separate_words(cls, text, min_lenth=3) 文本提取
get_words_frequency(cls, words_list) 獲取詞頻
源碼:
class DocProcess(object): @classmethod def strip_html(cls, text): """ Delete html tags in text. text is String """ new_text = " " is_html = False for character in text: if character == "<": is_html = True elif character == ">": is_html = False new_text += " " elif is_html is False: new_text += character return new_text @classmethod def separate_words(cls, text, min_lenth=3): """ Separate text into words in list. """ splitter = re.compile("//W+") return [s.lower() for s in splitter.split(text) if len(s) > min_lenth] @classmethod def get_words_frequency(cls, words_list): """ Get frequency of words in words_list. return a dict. """ num_words = {} for word in words_list: num_words[word] = num_words.get(word, 0) + 1 return num_words以上這篇python 文本單詞提取和詞頻統(tǒng)計(jì)的實(shí)例就是小編分享給大家的全部?jī)?nèi)容了,希望能給大家一個(gè)參考,也希望大家多多支持VEVB武林網(wǎng)。
新聞熱點(diǎn)
疑難解答
圖片精選