創(chuàng)造一種迅速而又隨性的(quickanddirty)xml解釋器

2019-11-18 16:23:24

字體：大中小

供稿：網(wǎng)友

xml是一種當(dāng)前很受歡迎的數(shù)據(jù)格式, 它的優(yōu)點(diǎn)在于: 人性化,自述性以及使用的方便性.但是,不幸的是,基于java的xml解釋器往往太大了,比如sun的jaXP.jar 和 parser.jar 每個都達(dá)到了1.4mb. 如果你要在只有有限的內(nèi)存容量的運(yùn)行環(huán)境里運(yùn)行你的程序,比如j2me的環(huán)境.或者說帶寬很有限的運(yùn)行環(huán)境里,比如applet,這些大的package不應(yīng)該成為你的選擇對象.
    注意:本篇的所有所需要的所有代碼你可以通過此鏈接下載:
http://www.matrix.org.cn/down_view.asp?id=67
下面是QDParser的代碼:
package qdxml;
import java.io.*;
import java.util.*;

/** Quick and Dirty xml parser.  This parser is, like the SAX parser,
    an event based parser, but with mUCh less functionality.  */
public class QDParser {
  PRivate static int popMode(Stack st) {
    if(!st.empty())
      return ((Integer)st.pop()).intValue();
    else
      return PRE;
  }
  private final static int
    TEXT = 1,
    ENTITY = 2,
    OPEN_TAG = 3,
    CLOSE_TAG = 4,
    START_TAG = 5,
    ATTRIBUTE_LVALUE = 6,
    ATTRIBUTE_EQUAL = 9,
    ATTRIBUTE_RVALUE = 10,
    QUOTE = 7,
    IN_TAG = 8,
    SINGLE_TAG = 12,
    COMMENT = 13,
    DONE = 11,
    DOCTYPE = 14,
    PRE = 15,
    CDATA = 16;
  public static void parse(DocHandler doc,Reader r) throws Exception {
    Stack st = new Stack();
    int depth = 0;
    int mode = PRE;
    int c = 0;
    int quotec = '"';
    depth = 0;
    StringBuffer sb = new StringBuffer();
    StringBuffer etag = new StringBuffer();
    String tagName = null;
    String lvalue = null;
    String rvalue = null;
    Hashtable attrs = null;
    st = new Stack();
    doc.startDocument();
    int line=1, col=0;
    boolean eol = false;
    while((c = r.read()) !
= -1) {

      // We need to map /r, /r/n, and /n to /n
      // See XML spec section 2.11
      if(c == '/n' && eol) {
        eol = false;
        continue;
      } else if(eol) {
        eol = false;
      } else if(c == '/n') {
        line++;
        col=0;
      } else if(c == '/r') {
        eol = true;
        c = '/n';
        line++;
        col=0;
      } else {
        col++;
      }

      if(mode == DONE) {
        doc.endDocument();
        return;

      // We are between tags collecting text.
      } else if(mode == TEXT) {
        if(c == '<') {
          st.push(new Integer(mode));
          mode = START_TAG;
          if(sb.length() > 0) {
            doc.text(sb.toString());
            sb.setLength(0);
          }
        } else if(c == '&') {
          st.push(new Integer(mode));
          mode = ENTITY;
          etag.setLength(0);
        } else
          sb.append((char)c);

      // we are processing a closing tag: e.g. </foo>
      } else if(mode == CLOSE_TAG) {
        if(c == '>') {
          mode = popMode(st);
          tagName = sb.toString();
          sb.setLength(0);
          depth--;
          if(depth==0)
            mode = DONE;
          doc.endElement(tagName);
        } else {
          sb.append((char)c);
        }

      // we are processing CDATA
      } else if(mode == CDATA) {
        if(c == '>'
        && sb.toString().endsWith("]]")) {
          sb.setLength(sb.length()-2);
          doc.text(sb.toString());
          sb.setLength(0);
          mode = popMode(st);
        } else
          sb.append((char)c);

      // we are processing a comment.  We are inside
      // the <!
-- .... --> looking for the -->.
      } else if(mode == COMMENT) {
        if(c == '>'
        && sb.toString().endsWith("--")) {
          sb.setLength(0);
          mode = popMode(st);
        } else
          sb.append((char)c);

      // We are outside the root tag element
      } else if(mode == PRE) {
        if(c == '<') {
          mode = TEXT;
          st.push(new Integer(mode));
          mode = START_TAG;
        }

      // We are inside one of these <? ... ?>
      // or one of these <!DOCTYPE ... >
      } else if(mode == DOCTYPE) {
        if(c == '>') {
          mode = popMode(st);
          if(mode == TEXT) mode = PRE;
        }

      // we have just seen a < and
      // are wondering what we are looking at
      // <foo>, </foo>, , etc.
      } else if(mode == START_TAG) {
        mode = popMode(st);
        if(c == '/') {
          st.push(new Integer(mode));
          mode = CLOSE_TAG;
        } else if (c == '?') {
          mode = DOCTYPE;
        } else {
          st.push(new Integer(mode));
          mode = OPEN_TAG;
          tagName = null;
          attrs = new Hashtable();
          sb.append((char)c);
        }

      // we are processing an entity, e.g. <, », etc.
      } else if(mode == ENTITY) {
        if(c == ';') {
          mode = popMode(st);
          String cent = etag.toString();
          etag.setLength(0);
          if(cent.equals("lt"))
            sb.append('<');
          else if(cent.equals("gt"))
            sb.append('>');
          else if(cent.equals("amp"))
            sb.append('&');
          else if(cent.equals("quot"))
            sb.append('"');
          else if(cent.equals("apos"))
            sb.append('/'');
          // Could parse hex entities if we wanted to
          //else if(cent.startsWith("#x"))
            //sb.append((char)Integer.parseInt(cent.substring(2),16));
          else if(cent.startsWith("#"))
            sb.append((char)Integer.parseInt(cent.substring(1)));
          // Insert custom entity definitions here
          else
            exc("Unknown entity: &"+cent+";",line,col);
        } else {
          etag.append((char)c);
        }

      // we have just seen something like this:
      // <foo a="b"/
      // and are looking for the final >.
      } else if(mode == SINGLE_TAG) {
        if(tagName == null)
          tagName = sb.toString();
        if(c !
= '>')
          exc("Expected > for tag: <"+tagName+"/>",line,col);
        doc.startElement(tagName,attrs);
        doc.endElement(tagName);
        if(depth==0) {
          doc.endDocument();
          return;
        }
        sb.setLength(0);
        attrs = new Hashtable();
        tagName = null;
        mode = popMode(st);

      // we are processing something
      // like this <foo ... >.  It could
      // still be a  or something.
      } else if(mode == OPEN_TAG) {
        if(c == '>') {
          if(tagName == null)
            tagName = sb.toString();
          sb.setLength(0);
          depth++;
          doc.startElement(tagName,attrs);
          tagName = null;
          attrs = new Hashtable();
          mode = popMode(st);
        } else if(c == '/') {
          mode = SINGLE_TAG;
        } else if(c == '-' && sb.toString().equals("!-")) {
          mode = COMMENT;
        } else if(c == '[' && sb.toString().equals("![CDATA")) {
          mode = CDATA;
          sb.setLength(0);
        } else if(c == 'E' && sb.toString().equals("!DOCTYP")) {
          sb.setLength(0);
          mode = DOCTYPE;
        } else if(Character.isWhitespace((char)c)) {
          tagName = sb.toString();
          sb.setLength(0);
          mode = IN_TAG;
        } else {
          sb.append((char)c);
        }

      // We are processing the quoted right-hand side
      // of an element's attribute.
      } else if(mode == QUOTE) {
        if(c == quotec) {
          rvalue = sb.toString();
          sb.setLength(0);
          attrs.put(lvalue,rvalue);
          mode = IN_TAG;
        // See section the XML spec, section 3.3.3
        // on normalization processing.
        } else if(" /r/n/u0009".indexOf(c)>=0) {
          sb.append(' ');
        } else if(c == '&') {
          st.push(new Integer(mode));
          mode = ENTITY;
          etag.setLength(0);
        } else {
          sb.append((char)c);
        }

      } else if(mode == ATTRIBUTE_RVALUE) {
        if(c == '"' c == '/'') {
          quotec = c;
          mode = QUOTE;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          exc("Error in attribute processing",line,col);
        }

      } else if(mode == ATTRIBUTE_LVALUE) {
        if(Character.isWhitespace((char)c)) {
          lvalue = sb.toString();
          sb.setLength(0);
          mode = ATTRIBUTE_EQUAL;
        } else if(c == '=') {
          lvalue = sb.toString();
          sb.setLength(0);
          mode = ATTRIBUTE_RVALUE;
        } else {
          sb.append((char)c);
        }

      } else if(mode == ATTRIBUTE_EQUAL) {
        if(c == '=') {
          mode = ATTRIBUTE_RVALUE;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          exc("Error in attribute processing.",line,col);
        }

      } else if(mode == IN_TAG) {
        if(c == '>') {
          mode = popMode(st);
          doc.startElement(tagName,attrs);
          depth++;
          tagName = null;
          attrs = new Hashtable();
        } else if(c == '/') {
          mode = SINGLE_TAG;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          mode = ATTRIBUTE_LVALUE;
          sb.append((char)c);
        }
      }
    }
    if(mode == DONE)
      doc.endDocument();
    else
      exc("missing end tag",line,col);
  }
  private static void exc(String s,int line,int col)
    throws Exception
  {
    throw new Exception(s+" near line "+line+", column "+col);
  }
}
    為何不使用SAX?
    你可以實現(xiàn)僅還有有限功能的SAX接口, 當(dāng)遇到某些東西你不需要的時候,拋出NotImplemented異常.
     無庸置疑地, 這樣你可以開發(fā)出小于jaxp.jar和parser.jar的類.但是,你可以通過定義自己的類來達(dá)到更加小的size.實際上,我們這里定義的類將會比SAX接口還要小很多.
      我們的迅速而又隨性的xml解釋器有點(diǎn)類似于SAX. 類似于SAX解釋器,它能夠讓你實現(xiàn)接口從而可以捕獲并處理與屬性和開始/結(jié)束標(biāo)簽. 你們?nèi)绻呀?jīng)使用過SAX,你們會發(fā)現(xiàn)它很熟悉.

       限制的XML功能
       很多人都喜歡XML樣式的簡單的,自述的,文本式的數(shù)據(jù)格式. 他們希望很容易地獲取當(dāng)中地元素,屬性以及屬性的值. 順著這種思想,讓我們來考慮一下哪些功能使我們必須的.
       我們的簡單的解釋器只有一個類:QDParser 與一個接口:DocHandler. QDParser擁有一個public的靜態(tài)方法-parse(DocHandler,Reader)—我們把它定義成一個有限狀態(tài)自動機(jī).
        我們的簡單的解釋器會把DTD <!
DOCTYPE> 與 <?xml version="1.0"?>僅僅看成是注釋,所以,他們不會造成混亂,他們的內(nèi)容對我們來說也是無用的.
         因為我們不能處理DOCTYPE, 我們的解釋器不能讀取自定義的實體.只有這些是作為標(biāo)準(zhǔn)可用的: &, <, >, ', and ".如果你覺得這些不夠,那么,可以自己插入代碼來擴(kuò)展自己的定義.或者你也可以再遞交給QDParser之前先預(yù)處理你的xml文件.
          我們的簡單的解釋器也不支持條件選擇:比如, <![INCLUDE[ ... ]]> or <![IGNORE[ ... ]]>.因為我們不能通過DOCTYPE自定義實體,這個功能對我們來說也是毫無意義的.我們可以在數(shù)據(jù)傳遞到我們的有限容量處理設(shè)備之前解決這個條件選擇的問題.
           因為我們的解釋器不會處理任何屬性的聲明,XML規(guī)范要求我們把所有的數(shù)據(jù)類型都看成是CDATA,這樣,我們可以使用java.util.Hashtable來代替org.xml.sax.AttributeList來存儲一個元素的屬性列表.在Hashtable里,我們僅僅有名字/值對應(yīng)的信息,因為我們不需要gettype()因為此時,無論如何都會返回CDATA.
      缺少屬性聲明會導(dǎo)致一些其他的結(jié)果,比如,解釋器不提供默認(rèn)的屬性值.還有,我們也不能通過聲明NMTOKENS來自動減少空閑空間.然后,這些都可以在我們準(zhǔn)備或者生成xml文件的時候處理.這些額外的代碼都可以放到使用我們的Parser的程序外部去.
       實際上,缺少的功能都可以在準(zhǔn)備XML文件的時候補(bǔ)償回來,這樣,你就可以分擔(dān)很多功能-我們的parser失去的功能給準(zhǔn)備XML文件的時候處理.

解釋器功能
既然討論了這么多我們的parser不能做到的事情,那什么是它可以做到的呢?
1.        它能識別所有元素的開始和結(jié)束標(biāo)簽.
2.        它能夠列出所有屬性.
3.        它能夠識別<[CDATA[ ... ]]> 這種結(jié)構(gòu)
4.        它能夠識別標(biāo)準(zhǔn)實體: &, <, >, ", and &apos,與數(shù)字實體.
5.        它能將輸入的/r/n和/r to /n看成是一行的結(jié)束,符合XML規(guī)范里的2.11.
        這個解釋器僅僅帶有很有限的錯誤檢查,當(dāng)遇到錯誤的文法的時候,就會拋出異常,比如遇到它不能識別的實體.

        如何使用這個解釋器
        使用我們這個quick and dirty解釋器是很簡單的,首先,實現(xiàn)DocHandler的接口,然后就可以解釋一個xml文件:
         DocHandler doc = new MyDocHandler();
QDParser.parse(doc,new FileReader("config.xml"));
         源代碼包含有兩個實現(xiàn)了全部DocHandler接口的例子,第一個叫
Reporter,僅僅是輸出它讀到的內(nèi)容,你可以用例子里的xml文件:config.xml來測
試這個例子.
第二個例子conf稍微復(fù)雜,conf實現(xiàn)更新已經(jīng)存在的駐扎的內(nèi)存的數(shù)據(jù).
Conf通過java.lang.reflect來定位config.xml里定義的fields和對象.如果你運(yùn)行這個程序，它會告訴你哪些對象在更新與如何更新.如果遇到要求更新不存在的fields,它會報出錯信息.

         修改這個package
          你可以修改這個類來使之適合你自己的使用,你可以添加你自定義的實體定義-在QDParser.java的第180行.
           你也可以添加我排除的功能到這個解釋器的有限狀態(tài)機(jī)里面去.這個是比較容易實現(xiàn)的,因為原有的代碼量很少.

           Keep it small
           QDParser只有3kb大小當(dāng)你編譯之后或者打包到j(luò)ar文件里去.源代碼也只有300行,還包括注釋在內(nèi),這個對很多小容量的設(shè)備來說是有效的,可以保持符合XML的標(biāo)準(zhǔn),并實現(xiàn)基本的功能.

matrix開源技術(shù)經(jīng)javaworld授權(quán)翻譯并發(fā)布.
如果你對此文章有任何看法或建議,
請到Matrix論壇發(fā)表您的意見.
注明: 如果對matrix的翻譯文章系列感興趣,請點(diǎn)擊oreilly和javaworld文章翻譯計劃查看詳細(xì)情況
您也可以點(diǎn)擊-chris查看翻譯作者的詳細(xì)信息.

（出處：http://m.survivalescaperooms.com）

上一篇：高可靠性移動應(yīng)用程序－移動數(shù)據(jù)庫和J2ME工具(一)

下一篇：從商業(yè)角度分析J2EE與WINDOWSDNA體系結(jié)構(gòu)