詳解Python中DOM方法的動態(tài)性

2019-11-25 17:46:24

字體：大中小

來源：轉載

供稿：網友

文檔對象模型

xml.dom 模塊對于 Python 程序員來說，可能是使用 XML 文檔時功能最強大的工具。不幸的是，XML-SIG 提供的文檔目前來說還比較少。W3C 語言無關的 DOM 規(guī)范填補了這方面的部分空白。但 Python 程序員最好有一個特定于 Python 語言的 DOM 的快速入門指南。本文旨在提供這樣一個指南。在上一篇專欄文章中，某些樣本中使用了樣本 quotations.dtd 文件，并且這些文件可以與本文中的代碼樣本檔案文件一起使用。

有必要了解 DOM 的確切含義。這方面，正式解釋非常好：

“文檔對象模型”是平臺無關和語言無關的接口，它允許程序和腳本動態(tài)訪問和更新文檔的內容、結構和樣式。可以進一步處理文檔，而處理的結果也可以合并到已顯示的頁面中。（萬維網聯(lián)盟 DOM 工作組）

DOM 將 XML 文檔轉換成樹 -- 或森林 -- 表示。萬維網聯(lián)盟 (W3C) 規(guī)范給出了一個 HTML 表的 DOM 版本作為例子。

2015411151828304.gif (368×205)

如上圖所示，DOM 從一個更加抽象的角度定義了一組可以遍歷、修剪、改組、輸出和操作樹的方法，而這種方法要比 XML 文檔的線性表示更為便利。

將 HTML 轉換成 XML

有效的 HTML 幾乎就是有效的 XML，但又不完全相同。這里有兩個主要的差異，XML 標記是區(qū)分大小寫的，并且所有 XML 標記都需要一個顯式的結束符號（作為結束標記，而這對于某些 HTML 標記是可選的；例如： <img src="X.png" /> ）。使用 xml.dom 的一個簡單示例就是使用 HtmlBuilder() 類將 HTML 轉換成 XML。
try_dom1.py

"""Convert a valid HTML document to XML  USAGE: python try_dom1.py < infile.html > outfile.xml"""        import         sys        from         xml.dom         import         core        from         xml.dom.html_builder         import         HtmlBuilder        # Construct an HtmlBuilder object and feed the data to itb = HtmlBuilder()b.feed(sys.stdin.read())        # Get the newly-constructed document objectdoc = b.document        # Output it as XML        print         doc.toxml()

HtmlBuilder() 類很容易實現(xiàn)它繼承的部分基本 xml.dom.builder 模板的功能，它的源碼值得研究。然而，即使我們自己實現(xiàn)了模板功能，DOM 程序的輪廓還是相似的。在一般情況下，我們將用一些方法構建一個 DOM 實例，然后對該實例進行操作。DOM 實例的 .toxml() 方法是一種生成 DOM 實例的字符串表示的簡單方法（在以上的情況中，只要在生成后將它打印出來）。

將 Python 對象轉換成 XML

Python 程序員可以通過將任意 Python 對象導出為 XML 實例來實現(xiàn)相當多的功能和通用性。這就允許我們以習慣的方式來處理 Python 對象，并且可以選擇最終是否使用實例屬性作為生成 XML 中的標記。只需要幾行（從 building.py 示例派生出），我們就可以將 Python“原生”對象轉換成 DOM 對象，并對包含對象的那些屬性執(zhí)行遞歸處理。
try_dom2.py

"""Build a DOM instance from scratch, write it to XML  USAGE: python try_dom2.py > outfile.xml"""        import         types        from         xml.dom         import         core        from         xml.dom.builder         import         Builder        # Recursive function to build DOM instance from Python instance        defobject_convert        (builder, inst):          # Put entire object inside an elem w/ same name as the class.  builder.startElement(inst.__class__.__name__)          for         attr         in         inst.__dict__.keys():            if         attr[0] ==         '_':           # Skip internal attributes                        continue            value = getattr(inst, attr)            if         type(value) == types.InstanceType:              # Recursively process subobjects      object_convert(builder, value)            else        :              # Convert anything else to string, put it in an element      builder.startElement(attr)      builder.text(str(value))      builder.endElement(attr)  builder.endElement(inst.__class__.__name__)        if         __name__ ==         '__main__':          # Create container classes          classquotations        :         pass  classquotation        :         pass    # Create an instance, fill it with hierarchy of attributes          inst = quotations()  inst.title =         "Quotations file (not quotations.dtd conformant)"  inst.quot1 = quot1 = quotation()  quot1.text =         """'"is not a quine" is not a quine' is a quine"""  quot1.source =         "Joshua Shagam, kuro5hin.org"  inst.quot2 = quot2 = quotation()  quot2.text =         "Python is not a democracy. Voting doesn't help. "+/                 "Crying may..."  quot2.source =         "Guido van Rossum, comp.lang.python"             # Create the DOM Builder  builder = Builder()  object_convert(builder, inst)          print         builder.document.toxml()

函數(shù) object_convert() 有一些限制。例如，不可能用以上的過程生成符合 XML 文檔的 quotations.dtd：#PCDATA 文本不能直接放到 quotation 類中，而只能放到類的屬性中（如 .text ）。一個簡單的變通方法就是讓 object_convert() 以特殊方式處理一個帶有名稱的屬性，例如 .PCDATA 。可以用各種方法使對 DOM 的轉換變得更巧妙，但該方法的妙處在于我們可以從整個 Python 對象開始，以簡明的方式將它們轉換成 XML 文檔。

還應值得注意的是在生成的 XML 文檔中，處于同一個級別的元素沒有什么明顯的順序關系。例如，在作者的系統(tǒng)中使用特定版本的 Python，源碼中定義的第二個 quotation 在輸出中卻第一個出現(xiàn)。但這種順序關系在不同的版本和系統(tǒng)之間會改變。Python 對象的屬性并不是按固定順序排列的，因此這種特性就具有意義。對于與數(shù)據(jù)庫系統(tǒng)相關的數(shù)據(jù)，我們希望它們具有這種特性，但是對于標記為 XML 的文章卻顯然不希望具有這種特性（除非我們想要更新 William Burroughs 的 "cut-up" 方法）。

將 XML 文檔轉換成 Python 對象

從 XML 文檔生成 Python 對象就像其逆向過程一樣簡單。在多數(shù)情況下，用 xml.dom 方法就可以了。但在某些情況下，最好使用與處理所有“類屬”Python 對象相同的技術來處理從 XML 文檔生成的對象。例如，在以下的代碼中，函數(shù) pyobj_printer() 也許是已經用來處理任意 Python 對象的函數(shù)。
try_dom3.py

"""Read in a DOM instance, convert it to a Python object"""        from         xml.dom.utils         import         FileReader        classPyObject        :         passdefpyobj_printer        (py_obj, level=0):          """Return a "deep" string description of a Python object"""               from                   string         import         join, split          import         types  descript =         ''               for                   membname         in         dir(py_obj):    member = getattr(py_obj,membname)            if         type(member) == types.InstanceType:      descript = descript + (        ' '*level) +         '{'+membname+        '}/n'      descript = descript + pyobj_printer(member, level+3)            elif         type(member) == types.ListType:      descript = descript + (        ' '*level) +         '['+membname+        ']/n'                       for                   i         in         range(len(member)):        descript = descript+(        ' '*level)+str(i+1)+        ': '+ /              pyobj_printer(member[i],level+3)            else        :      descript = descript + membname+        '='      descript = descript + join(split(str(member)[:50]))+        '.../n'               return                   descript        defpyobj_from_dom        (dom_node):          """Converts a DOM tree to a "native" Python object"""  py_obj = PyObject()  py_obj.PCDATA =         ''               for                   node         in         dom_node.get_childNodes():            if         node.name ==         '#text':      py_obj.PCDATA = py_obj.PCDATA + node.value            elif         hasattr(py_obj, node.name):      getattr(py_obj, node.name).append(pyobj_from_dom(node))            else        :      setattr(py_obj, node.name, [pyobj_from_dom(node)])          return         py_obj        # Main testdom_obj = FileReader(        "quotes.xml").documentpy_obj = pyobj_from_dom(dom_obj)        if         __name__ ==         "__main__":          print         pyobj_printer(py_obj)

這里的關注焦點應該是函數(shù) pyobj_from_dom() ，特別是起實際作用的 xml.dom 方法 .get_childNodes() 。在 pyobj_from_dom() 中，我們直接抽取標記之間的所有文本，將它放到保留屬性 .PCDATA 中。對于任何遇到的嵌套標記，我們創(chuàng)建一個新屬性，其名稱與標記匹配，并將一個列表分配給該屬性，這樣就可以潛在地包含在在父代塊中多次出現(xiàn)的標記。當然，使用列表要維護在 XML 文檔中遇到的標記的順序。

除了使用舊的 pyobj_printer() 類屬函數(shù)（或者，更復雜和健壯的函數(shù)）之外，我們可以使用正常的屬性記號來訪問 py_obj 的元素。
Python 交互式會話

>>>         from         try_dom3         import         *>>> py_obj.quotations[0].quotation[3].source[0].PCDATA        'Guido van Rossum, '

重新安排 DOM 樹

DOM 的一大優(yōu)點是它可以讓程序員以非線性方式對 XML 文檔進行操作。由相匹配的開／關標記括起的每一塊都只是 DOM 樹中的一個“節(jié)點”。當以類似于列表的方式維護節(jié)點以保留順序信息時，則順序并沒有什么特殊之處，也并非不可改變。我們可以輕易地剪下某個節(jié)點，嫁接到 DOM 樹的另一個位置（如果 DTD 允許，甚至嫁接到另一層上）。或者添加新的節(jié)點、刪除現(xiàn)有節(jié)點，等等。
try_dom4.py

"""Manipulate the arrangement of nodes in a DOM object"""        from         try_dom3         import         *        #-- Var 'doc' will hold the single <quotations> "trunk"doc = dom_obj.get_childNodes()[0]        #-- Pull off all the nodes into a Python list# (each node is a <quotation> block, or a whitespace text node)nodes = []        while         1:          try        : node = doc.removeChild(doc.get_childNodes()[0])          except        :         break          nodes.append(node)        #-- Reverse the order of the quotations using a list method# (we could also perform more complicated operations on the list:# delete elements, add new ones, sort on complex criteria, etc.)nodes.reverse()        #-- Fill 'doc' back up with our rearranged nodes        for         node         in         nodes:          # if second arg is None, insert is to end of list  doc.insertBefore(node, None)        #-- Output the manipulated DOM        print         dom_obj.toxml()

如果我們將 XML 文檔只看作一個文本文件，或者使用一個面向序列的模塊（如 xmllib 或 xml.sax），那么在以上幾行中執(zhí)行對 quotation 節(jié)點的重新安排操作將引出一個值得考慮的問題。然而如果使用 DOM，則問題就如同對 Python 列表執(zhí)行的任何其它操作一樣簡單。

上一篇：在Python下使用Txt2Html實現(xiàn)網頁過濾代理的教程

下一篇：將Python中的數(shù)據(jù)存儲到系統(tǒng)本地的簡單方法