分析Python中解析構建數據知識

2020-01-04 16:11:30

字體：大中小

來源：轉載

供稿：網友

Python 可以通過各種庫去解析我們常見的數據。其中 csv 文件以純文本形式存儲表格數據，以某字符作為分隔值，通常為逗號；xml 可拓展標記語言，很像超文本標記語言 Html ，但主要對文檔和數據進行結構化處理，被用來傳輸數據；json 作為一種輕量級數據交換格式，比 xml 更小巧但描述能力卻不差，其本質是特定格式的字符串；Microsoft Excel 是電子表格，可進行各種數據的處理、統計分析和輔助決策操作，其數據格式為 xls、xlsx。接下來主要介紹通過 Python 簡單解析構建上述數據，完成數據的“珍珠翡翠白玉湯”。

Python 解析構建 csv

通過標準庫中的 csv 模塊，使用函數 reader()、writer() 完成 csv 數據基本讀寫。

import csvwith open('readtest.csv', newline='') as csvfile:reader = csv.reader(csvfile)for row in reader:print(row)with open('writetest.csv', 'w', newline='') as csvfile:writer = csv.writer(csvfile)writer.writerrow("onetest")writer.writerows("someiterable")

其中 reader() 返回迭代器， writer() 通過 writerrow() 或 writerrows() 寫入一行或多行數據。兩者還可通過參數 dialect 指定編碼方式，默認以 excel 方式，即以逗號分隔，通過參數 delimiter 指定分隔字段的單字符，默認為逗號。

在 Python3 中，打開文件對象 csvfile ，需要通過 newline='' 指定換行處理，這樣讀取文件時，新行才能被正確地解釋；而在 Python2 中，文件對象 csvfile 必須以二進制的方式 'b' 讀寫，否則會將某些字節（0x1A）讀寫為文檔結束符（EOF），導致文檔讀取不全。

除此之外，還可使用 csv 模塊中的類 DictReader()、DictWriter() 進行字典方式讀寫。

import csvwith open('readtest.csv', newline='') as csvfile:  reader = csv.DictReader(csvfile)  for row in reader:    print(row['first_test'], row['last_test'])with open('writetest.csv', 'w', newline='') as csvfile:  fieldnames = ['first_test', 'last_test']  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)  writer.writeheader()  writer.writerow({'first_test': 'hello', 'last_test': 'wrold'})  writer.writerow({'first_test': 'Hello', 'last_test': 'World'})  #writer.writerows([{'first_test': 'hello', 'last_test': 'wrold'}, {'first_test': 'Hello', 'last_test': 'World'}])

其中 DictReader() 返回有序字典，使得數據可通過字典的形式訪問，鍵名由參數 fieldnames 指定，默認為讀取的第一行。

DictWriter() 必須指定參數 fieldnames 說明鍵名，通過 writeheader() 將鍵名寫入，通過 writerrow() 或 writerrows() 寫入一行或多行字典數據。

Python 解析構建 xml

通過標準庫中的 xml.etree.ElementTree 模塊，使用 Element、ElementTree 完成 xml 數據的讀寫。

from xml.etree.ElementTree import Element, ElementTreeroot = Element('language')root.set('name', 'python')direction1 = Element('direction')direction2 = Element('direction')direction3 = Element('direction')direction4 = Element('direction')direction1.text = 'Web'direction2.text = 'Spider'direction3.text = 'BigData'direction4.text = 'AI'root.append(direction1)root.append(direction2)root.append(direction3)root.append(direction4)#import itertools#root.extend(chain(direction1, direction2, direction3, direction4))tree = ElementTree(root)tree.write('xmltest.xml')

寫 xml 文件時，通過 Element() 構建節點，set() 設置屬性和相應值，append() 添加子節點，extend() 結合循環器中的 chain() 合成列表添加一組節點，text 屬性設置文本值，ElementTree() 傳入根節點構建樹，write() 寫入 xml 文件。

import xml.etree.ElementTree as ETtree = ET.parse('xmltest.xml')#from xml.etree.ElementTree import ElementTree#tree = ElementTree().parse('xmltest.xml')root = tree.getroot()tag = root.tagattrib = root.attribtext = root.textdirection1 = root.find('direction')direction2 = root[1]directions = root.findall('.//direction')for direction in root.findall('direction'):  print(direction.text)for direction in root.iter('direction'):  print(direction.text)root.remove(direction2)

讀 xml 文件時，通過 ElementTree() 構建空樹，parse() 讀入 xml 文件，解析映射到空樹；getroot() 獲取根節點，通過下標可訪問相應的節點；tag 獲取節點名，attrib 獲取節點屬性字典，text 獲取節點文本；find() 返回匹配到節點名的第一個節點，findall() 返回匹配到節點名的所有節點，find()、findall() 兩者都僅限當前節點的一級子節點，都支持 xpath 路徑提取節點；iter() 創建樹迭代器，遍歷當前節點的所有子節點，返回匹配到節點名的所有節點；remove() 移除相應的節點。

除此之外，還可通過 xml.sax、xml.dom.minidom 去解析構建 xml 數據。其中 sax 是基于事件處理的；dom 是將 xml 數據在內存中解析成一個樹，通過對樹的操作來操作 xml；而 ElementTree 是輕量級的 dom ，具有簡單而高效的API，可用性好，速度快，消耗內存少，但生成的數據格式不美觀，需要手動格式化。

Python 解析構建 json

通過標準庫中的 json 模塊，使用函數 dumps()、loads() 完成 json 數據基本讀寫。

>>> import json>>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])'["foo", {"bar": ["baz", null, 1.0, 2]}]'>>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')['foo', {'bar': ['baz', None, 1.0, 2]}]

json.dumps() 是將 obj 序列化為 json 格式的 str，而 json.loads() 是反向操作。其中 dumps() 可通過參數 ensure_ascii 指定是否使用 ascii 編碼，默認為 True；通過參數 separators=(',', ':') 指定 json 數據格式中的兩種分隔符；通過參數 sort_keys 指定是否使用排序，默認為 False。

除此之外，還可使用 json 模塊中的函數 dump()、load() 進行 json 數據讀寫。

import jsonwith open('jsontest.json', 'w') as jsonfile:json.dump(['foo', {'bar': ('baz', None, 1.0, 2)}], jsonfile)with open('jsontest.json') as jsonfile:json.load(jsonfile)

功能與 dumps()、loads() 相同，但接口不同，需要與文件操作結合，多傳入一個文件對象。

Python 解析構建 excel

通過 pip 安裝第三方庫 xlwt、xlrd 模塊，完成 excel 數據的讀寫。

import xlwtwbook = xlwt.Workbook(encoding='utf-8')wsheet = wbook.add_sheet('sheet1')wsheet.write(0, 0, 'Hello World')wbook.save('exceltest.xls')

寫 excel 數據時，通過 xlwt.Workbook() 指定編碼格式參數 encoding 創建工作表，add_sheet() 添加表單，write() 在相應的行列單元格中寫入數據，save() 保存工作表。

import xlrdrbook = xlrd.open_workbook('exceltest.xls')rsheet = book.sheets()[0]#rsheet = book.sheet_by_index(0)#rsheet = book.sheet_by_name('sheet1')nr = rsheet.nrowsnc = rsheet.ncolsrv = rsheet.row_values(0)cv = rsheet.col_values(0)cell = rsheet.cell_value(0, 0)

讀 excel 數據時，通過 xlrd.open_workbook() 打開相應的工作表，可使用列表下標、表索引 sheet_by_index()、表單名 sheet_by_name() 三種方式獲取表單名，nrows 獲取行數，ncols 獲取列數，row_values() 返回相應行的值列表，col_values() 返回相應列的值列表，cell_value() 返回相應行列的單元格值。

注：相關教程知識閱讀請移步到python教程頻道。

上一篇：學習Python selenium自動化網頁抓取器

下一篇：手把手教你用python搶票回家過年(代碼簡單)