這是一個(gè)能幫你從HTML生成有效XHTML的經(jīng)典庫。它還提供對(duì)標(biāo)簽以及屬性過濾的支持。你可以指定允許哪些標(biāo)簽和屬性可在出現(xiàn)在輸出中,而其他的標(biāo)簽過濾掉。你也可以使用這個(gè)庫清理Microsoft Word文檔轉(zhuǎn)化成HTML時(shí)生成的臃腫的HTML。你也在將HTML發(fā)布到博客網(wǎng)站前清理一下,否則像WordPRess、b2evolution等博客引擎會(huì)拒絕的。

里面有兩個(gè)類:HtmlReader和HtmlWriter
HtmlReader拓展了著名的由Chris Clovett開發(fā)的SgmlReader。當(dāng)它讀取HTML時(shí),它跳過所有有前綴的節(jié)點(diǎn)。其中,所有像<o:p>、<o:Document>、<st1:personname>等上百的無用標(biāo)簽被濾除了。這樣你讀取的HTML就剩下核心的HTML標(biāo)簽了。
HtmlWriter拓展了常規(guī)的xmlWriter,XmlWriter生成XML。XHTML本質(zhì)上是XML格式的HTML。所有你熟悉使用的標(biāo)簽——比如<img>、<br>和<hr>,都不是閉合的標(biāo)簽——在XHTML中必需是空元素形式,像<img .. />、<br/>和<hr/>。由于XHTML是常見的XML格式,你可以方便的使用XML解析器讀取XHTML文檔。這使得有了應(yīng)用XPath搜索的機(jī)會(huì)。
HtmlReader很簡(jiǎn)單,下面是完整的類:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ////// This class skips all nodes which has some/// kind of prefix. This trick does the job /// to clean up MS Word/Outlook HTML markups.///public class HtmlReader : Sgml.SgmlReader{ public HtmlReader( TextReader reader ) : base( ) { base.InputStream = reader; base.DocType = "HTML"; } public HtmlReader( string content ) : base( ) { base.InputStream = new StringReader( content ); base.DocType = "HTML"; } public override bool Read() { bool status = base.Read(); if( status ) { if( base.NodeType == XmlNodeType.Element ) { // Got a node with prefix. This must be one // of those "" or something else. // Skip this node entirely. We want prefix // less nodes so that the resultant XML // requires not namespace. if( base.Name.IndexOf(':') > 0 ) base.Skip(); } } return status; }} |
這個(gè)類是有點(diǎn)麻煩。下面是使用技巧:
重寫WriteString方法并避免使用常規(guī)的XML編碼。對(duì)HTML文件手動(dòng)更改編碼。
重寫WriteStartElementis以避免不被允許的標(biāo)簽寫到輸出中。
重寫WriteAttributesis以避免不需求的屬性。
讓我們分部分來看下整個(gè)類:
你可以通過修改下面的部分配置HtmlWriter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | public class HtmlWriter : XmlTextWriter{ ////// If set to true, it will filter the output /// by using tag and attribute filtering, /// space reduce etc ///public bool FilterOutput = false; ////// If true, it will reduce consecutive with one instance ///public bool ReduceConsecutiveSpace = true; ////// Set the tag names in lower case which are allowed to go to output ///public string [] AllowedTags = new string[] { "p", "b", "i", "u", "em", "big", "small", "div", "img", "span", "blockquote", "code", "pre", "br", "hr", "ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"}; ////// If any tag found which is not allowed, it is replaced by this tag. /// Specify a tag which has least impact on output ///public string ReplacementTag = "dd"; ////// New lines /r/n are replaced with space /// which saves space and makes the /// output compact ///public bool RemoveNewlines = true; ////// Specify which attributes are allowed. /// Any other attribute will be discarded ///public string [] AllowedAttributes = new string[] { "class", "href", "target", "border", "src", "align", "width", "height", "color", "size" };} |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ////// The reason why we are overriding/// this method is, we do not want the output to be/// encoded for texts inside attribute/// and inside node elements. For example, all the /// gets converted to   in output. But this does not /// apply to HTML. In HTML, we need to have as it is.//////public override void WriteString(string text){ // Change all non-breaking space to normal space text = text.Replace( " ", " " ); /// When you are reading rss feed and writing Html, /// this line helps remove those CDATA tags text = text.Replace("", ""); // Do some encoding of our own because // we are going to use WriteRaw which won't // do any of the necessary encoding text = text.Replace( "<", "<" ); text = text.Replace( ">", ">" ); text = text.Replace( "'", "'" ); text = text.Replace( "/"", ""e;" ); if( this.FilterOutput ) { text = text.Trim(); // We want to replace consecutive spaces // to one space in order to save horizontal width if( this.ReduceConsecutiveSpace ) text = text.Replace(" ", " "); if( this.RemoveNewlines ) text = text.Replace(Environment.NewLine, " "); base.WriteRaw( text ); } else { base.WriteRaw( text ); }} |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | public override void WriteStartElement(string prefix, string localName, string ns){ if( this.FilterOutput ) { bool canWrite = false; string tagLocalName = localName.ToLower(); foreach( string name in this.AllowedTags ) { if( name == tagLocalName ) { canWrite = true; break; } } if( !canWrite ) localName = "dd"; } base.WriteStartElement(prefix, localName, ns);} |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | bool canWrite = false;string attributeLocalName = reader.LocalName.ToLower();foreach( string name in this.AllowedAttributes ){ if( name == attributeLocalName ) { canWrite = true; break; }}// If allowed, write the attributeif( canWrite ) this.WriteStartAttribute(reader.Prefix, attributeLocalName, reader.NamespaceURI);while (reader.ReadAttributeValue()){ if (reader.NodeType == XmlNodeType.EntityReference) { if( canWrite ) this.WriteEntityRef(reader.Name); continue; } if( canWrite )this.WriteString(reader.Value);}if( canWrite ) this.WriteEndAttribute(); |
示例應(yīng)用是一個(gè)你可以立即用來清理HTML文件的實(shí)用工具。你可以將這個(gè)類應(yīng)用在像博客等需要發(fā)布一些HTML到Web服務(wù)的工具中。
原文地址:http://www.codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a
新聞熱點(diǎn)
疑難解答
圖片精選
網(wǎng)友關(guān)注