用Python編寫博客導(dǎo)出工具

2019-11-14 17:33:37

字體：大中小

供稿：網(wǎng)友

用Python編寫博客導(dǎo)出工具

CC 許可，轉(zhuǎn)載請(qǐng)注明出處

寫在前面的話

我在 github 上用 octoPRess 搭建了個(gè)人博客，octopress 使用Markdown語(yǔ)法編寫博文。之前我在CSDN博客上也寫過(guò)不少的技術(shù)博文，都說(shuō)自己的孩子再丑也是個(gè)寶，所以就起了把CSDN博客里面的文章導(dǎo)出到個(gè)人博客上的念頭。剛開始想找個(gè)工具把CSDN博客導(dǎo)出為xml或文本，然后再把xml或文本轉(zhuǎn)換為Markdown博文。可惜搜了一下現(xiàn)有博客導(dǎo)出工具，大部分要收費(fèi)才能將全部博文導(dǎo)出為xml格式，所以就只好發(fā)明輪子了：寫個(gè)工具將全部博文導(dǎo)出為Markdown博文（也是txt格式的）。

我將詳細(xì)介紹這個(gè)工具的編寫過(guò)程，希望沒有學(xué)習(xí)過(guò)編程的人也能夠?qū)W會(huì)一些簡(jiǎn)單的Python語(yǔ)法來(lái)修改這個(gè)腳本工具，以滿足他們將其他類型的博客導(dǎo)出為文本格式。這也是我第一次學(xué)習(xí)和使用Python，所以相信我，你一定也可以將自己的博客導(dǎo)出為想要的文本格式。

本文源代碼在這里：ExportCSDNBlog.py

考慮到大部分非程序員使用Windows系統(tǒng)，下面將介紹在Windows下如何編寫這個(gè)工具。

下載工具

在 Windows 下安裝Python開發(fā)環(huán)境（linux/Mac下用pip安裝相應(yīng)包即可，程序員自己解決咯）：

Python 2.7.3
請(qǐng)安裝這個(gè)版本，更高版本的Python與一些庫(kù)不兼容。
下載頁(yè)面
下載完畢雙擊可執(zhí)行文件進(jìn)行安裝，默認(rèn)安裝在C:/Python2.7。

six
下載頁(yè)面下載完畢，解壓到Python安裝目錄下，如C:/Python2.7/six-1.8.0目錄下。

BeautifulSoup 4.3.2
下載頁(yè)面，下載完畢，解壓到Python安裝目錄下，如C:/Python2.7/BeautifulSoup目錄下。

html5lib
下載頁(yè)面下載完畢，解壓到Python安裝目錄下，如C:/Python2.7/html5lib-0.999目錄下。

安裝工具

Windows下啟動(dòng)命令行，依次進(jìn)入如下目錄，執(zhí)行setup.py install進(jìn)行安裝：

C:/Python2.7/six-1.8.0>setup.py install  C:/Python2.7/html5lib-0.999>setup.py install  C:/Python2.7/BeautifulSoup>setup.py install

參考文檔

Python 2.X文檔
 BeautifulSoup文檔
 正則表達(dá)式文檔
 正則表達(dá)式在線測(cè)試

用到的Python語(yǔ)法

這個(gè)工具只用到了一些基本的Python語(yǔ)法，如果你沒有Python基礎(chǔ)，稍微了解一下如下博文是很有好處的。

string: 字符串操作，參考python: string的操作函數(shù)
list: 列表操作，參考Python list 操作
dictionary: 字典操作，參考Python中dict詳解
datetime: 日期時(shí)間，參考python datetime處理時(shí)間

編寫博客導(dǎo)出工具

分析

首先來(lái)分析這樣一個(gè)工具的需求：

導(dǎo)出所有CSDN博客文章為Markdown文本。

這個(gè)總需求其實(shí)可以分兩步來(lái)做：

* 獲得CSDN博客文章* 將文章轉(zhuǎn)換為Markdown文本

針對(duì)第一步：如何獲取博客文章呢？

打開任何一個(gè)CSDN博客，我們都可以看到下方的頁(yè)面導(dǎo)航顯示“XXX條數(shù)據(jù) 共XXX頁(yè) 1 2 3 … 尾頁(yè)”，我們從這個(gè)地方入手考慮。每個(gè)頁(yè)面上都會(huì)顯示屬于該頁(yè)的文章標(biāo)題及文章鏈接，如果我們依次訪問(wèn)這些頁(yè)面鏈接，就能從每個(gè)頁(yè)面鏈接中找出屬于該頁(yè)面的文章標(biāo)題及文章鏈接。這樣所有的文章標(biāo)題以及文章鏈接就都獲取到了，有了這些文章鏈接，我們就能獲取對(duì)應(yīng)文章的html內(nèi)容，然后通過(guò)解析這些html頁(yè)面來(lái)生成相應(yīng)Markdown文本了。

實(shí)現(xiàn)

從上面的分析可以看出，首先我們需要根據(jù)首頁(yè)獲取所有的頁(yè)面鏈接，然后遍歷每一個(gè)頁(yè)面鏈接來(lái)獲取文章鏈接。

獲取頁(yè)面鏈接的代碼：

   def getPageUrlList(url):      # 獲取所有的頁(yè)面的 url      request = urllib2.Request(url, None, header)      response = urllib2.urlopen(request)      data = response.read()      #print data      soup = BeautifulSoup(data)      lastArticleHref = None      pageListDocs = soup.find_all(id="papelist")      for pageList in pageListDocs:          hrefDocs = pageList.find_all("a")          if len(hrefDocs) > 0:              lastArticleHrefDoc = hrefDocs[len(hrefDocs) - 1]              lastArticleHref = lastArticleHrefDoc["href"].encode('UTF-8')      if lastArticleHref == None:          return []        #print " > last page href:" + lastArticleHref      lastPageIndex = lastArticleHref.rfind("/")      lastPageNum = int(lastArticleHref[lastPageIndex+1:])      urlInfo = "http://blog.csdn.net" + lastArticleHref[0:lastPageIndex]      pageUrlList = []      for x in xrange(1, lastPageNum + 1):          pageUrl = urlInfo + "/" + str(x)          pageUrlList.append(pageUrl)          log(" > page " + str(x) + ": " + pageUrl)      log("total pages: " + str(len(pageUrlList)) + "/n")      return pageUrlList

參數(shù) url = “http://blog.csdn.net/” + username，即你首頁(yè)的網(wǎng)址。通過(guò)urllib2庫(kù)打開這個(gè)url發(fā)起一個(gè)web請(qǐng)求，從response中獲取返回的html頁(yè)面內(nèi)容保存到data中。你可以被注釋的 print data 來(lái)查看到底返回了什么內(nèi)容。

有了html頁(yè)面內(nèi)容，接下來(lái)就用BeautifulSoup來(lái)解析它。BeautifulSoup極大地減少了我們的工作量。我會(huì)詳細(xì)在這里介紹它的使用，后面再次出現(xiàn)類似的解析就會(huì)從略了。soup.find_all(id=“papelist”) 將會(huì)查找html頁(yè)面中所有id=“papelist”的tag，然后返回包含這些tag的list。對(duì)應(yīng) CSDN 博文頁(yè)面來(lái)說(shuō)，只有一處地方：

<div id="papelist" class="pagelist">  <span> 236條數(shù)據(jù)  共12頁(yè)</span>  <strong>1</strong>  <a href="/kesalin/article/list/2">2</a>  <a href="/kesalin/article/list/3">3</a>  <a href="/kesalin/article/list/4">4</a>  <a href="/kesalin/article/list/5">5</a>  <a href="/kesalin/article/list/6">...</a>  <a href="/kesalin/article/list/2">下一頁(yè)</a>  <a href="/kesalin/article/list/12">尾頁(yè)</a></div>

好，我們獲得了papelist 的tag對(duì)象，通過(guò)這個(gè)tag對(duì)象我們能夠找出尾頁(yè)tag a對(duì)象，從這個(gè)tag a解析出對(duì)應(yīng)的href屬性，獲得尾頁(yè)的編號(hào)12，然后自己拼出所有page頁(yè)面的訪問(wèn)url來(lái)，并保存在pageUrlList中返回。page頁(yè)面的訪問(wèn)url形式示例如下：

> page 1: http://blog.csdn.net/kesalin/article/list/1

根據(jù)page來(lái)獲取文章鏈接的代碼：

   def getArticleList(url):      # 獲取所有的文章的 url/title      pageUrlList = getPageUrlList(url)        articleListDocs = []      strPage = " > parsing page {0}"      pageNum = 0      global gRetryCount      for pageUrl in pageUrlList:          retryCount = 0          pageNum = pageNum + 1          pageNumStr = strPage.format(pageNum)          print pageNumStr          while retryCount <= gRetryCount:              try:                  retryCount = retryCount + 1                  time.sleep(1.0) #訪問(wèn)太快會(huì)不響應(yīng)                  request = urllib2.Request(pageUrl, None, header)                  response = urllib2.urlopen(request)                  data = response.read().decode('UTF-8')                    #print data                  soup = BeautifulSoup(data)                    topArticleDocs = soup.find_all(id="article_toplist")                  articleDocs = soup.find_all(id="article_list")                  articleListDocs = articleListDocs + topArticleDocs + articleDocs                  break              except Exception, e:                  print "getArticleList exception:%s, url:%s, retry count:%d" % (e, pageUrl, retryCount)                  pass        artices = []      topTile = "[置頂]"      for articleListDoc in articleListDocs:          linkDocs = articleListDoc.find_all("span", "link_title")          for linkDoc in linkDocs:              #print linkDoc.prettify().encode('UTF-8')              link = linkDoc.a              url = link["href"].encode('UTF-8')              title = link.get_text().encode('UTF-8')              title = title.replace(topTile, '').strip()              oneHref = "http://blog.csdn.net" + url              #log("   > title:" + title + ", url:" + oneHref)              artices.append([oneHref, title])      log("total articles: " + str(len(artices)) + "/n")      return artices

從第一步獲得所有的page鏈接保存在pageUrlList中，接下來(lái)就根據(jù)這些page 頁(yè)面來(lái)獲取對(duì)應(yīng)page的article鏈接和標(biāo)題。關(guān)鍵代碼是下面這三行：

topArticleDocs = soup.find_all(id="article_toplist")articleDocs = soup.find_all(id="article_list")articleListDocs = articleListDocs + topArticleDocs + articleDocs

從page的html內(nèi)容中查找置頂?shù)奈恼拢╝rticle_toplist）以及普通的文章（article_list）的tag對(duì)象，然后將這些tag保存到articleListDocs中。

article_toplist示例：(article_list的格式是類似的)

<div id="article_toplist" class="list">    <div class="list_item article_item">        <div class="article_title">            <span class="ico ico_type_Original"></span>            <h1>                <span class="link_title">                <a href="/kesalin/article/details/10474007">                <font color="red">[置頂]</font>                招聘：有興趣做一個(gè)與Android對(duì)等的操作系統(tǒng)么？                </a>                </span>            </h1>        </div>        ... ...    </div>    ... ...</div>

然后遍歷所有的保存到articleListDocs里的tag對(duì)象，從中解析出link_title的span tag對(duì)象保存到linkDocs中；然后從中解析出鏈接的url和標(biāo)題，這里去掉了置頂文章標(biāo)題中的“置頂”兩字；最后將url和標(biāo)題保存到artices列表中返回。artices列表中的每一項(xiàng)內(nèi)容示例：

title:招聘：有興趣做一個(gè)與Android對(duì)等的操作系統(tǒng)么？
url:http://blog.csdn.net/kesalin/article/details/10474007

根據(jù)文章鏈接獲取文章html內(nèi)容并解析轉(zhuǎn)換為Markdown文本

   def download(url, output):      # 下載文章，并保存為 markdown 格式      log(" >> download: " + url)      data = None      title = ""      categories = ""      content = ""      postDate = datetime.datetime.now()        global gRetryCount      count = 0      while True:          if count >= gRetryCount:              break          count = count + 1          try:              time.sleep(2.0) #訪問(wèn)太快會(huì)不響應(yīng)              request = urllib2.Request(url, None, header)              response = urllib2.urlopen(request)              data = response.read().decode('UTF-8')              break          except Exception,e:              exstr = traceback.format_exc()              log(" >> failed to download " + url + ", retry: " + str(count) + ", error:" + exstr)              pass      if data == None:          log(" >> failed to download " + url)          return      #print data      soup = BeautifulSoup(data)      topTile = "[置頂]"      titleDocs = soup.find_all("div", "article_title")      for titleDoc in titleDocs:          titleStr = titleDoc.a.get_text().encode('UTF-8')          title = titleStr.replace(topTile, '').strip()          #log(" >> title: " + title)      manageDocs = soup.find_all("div", "article_manage")      for managerDoc in manageDocs:          categoryDoc = managerDoc.find_all("span", "link_categories")          if len(categoryDoc) > 0:              categories = categoryDoc[0].a.get_text().encode('UTF-8').strip()            postDateDoc = managerDoc.find_all("span", "link_postdate")          if len(postDateDoc) > 0:              postDateStr = postDateDoc[0].string.encode('UTF-8').strip()              postDate = datetime.datetime.strptime(postDateStr, '%Y-%m-%d %H:%M')      contentDocs = soup.find_all(id="article_content")      for contentDoc in contentDocs:          htmlContent = contentDoc.prettify().encode('UTF-8')          content = htmlContent2String(htmlContent)      exportToMarkdown(output, postDate, categories, title, content)

同前面的分析類似，在這里通過(guò)訪問(wèn)具體文章頁(yè)面獲得html內(nèi)容，從中解析出文章標(biāo)題，分類，發(fā)表時(shí)間，文章內(nèi)容信息。然后把這些內(nèi)容傳遞給函數(shù)exportToMarkdown，在其中生成相應(yīng)的Markdown文本文件。值得一提的是，在解析文章內(nèi)容信息時(shí)，由于html文檔內(nèi)容有一些特殊的標(biāo)簽或轉(zhuǎn)義符號(hào)，需要作特殊處理，這些特殊處理在函數(shù)htmlContent2String中進(jìn)行。目前只導(dǎo)出了所有的文本內(nèi)容，圖片，url鏈接以及表格都沒有處理，后續(xù)我會(huì)盡量完善這些轉(zhuǎn)換。

   def htmlContent2String(contentStr):      patternImg = re.compile(r'(<img.+?src=")(.+?)(".+ />)')      patternHref = re.compile(r'(<a.+?href=")(.+?)(".+?>)(.+?)(</a>)')      patternRemoveHtml = re.compile(r'</?[^>]+>')      resultContent = patternImg.sub(r'![image_mark](/2)', contentStr)      resultContent = patternHref.sub(r'[/4](/2)', resultContent)      resultContent = re.sub(patternRemoveHtml, r'', resultContent)      resultContent = decodeHtmlSpecialCharacter(resultContent)      return resultContent

目前僅僅是刪除所有的html標(biāo)簽，并在函數(shù)decodeHtmlSpecialCharacter中轉(zhuǎn)換轉(zhuǎn)義字符。

生成Markdown文本文件

   def exportToMarkdown(exportDir, postdate, categories, title, content):      titleDate = postdate.strftime('%Y-%m-%d')      contentDate = postdate.strftime('%Y-%m-%d %H:%M:%S %z')      filename = titleDate + '-' + title      filename = repalceInvalidCharInFilename(filename)      filepath = exportDir + '/' + filename + '.markdown'      log(" >> save as " + filename)      newFile = open(unicode(filepath, "utf8"), 'w')      newFile.write('---' + '/n')      newFile.write('layout: post' + '/n')      newFile.write('title: /"' + title + '/"/n')      newFile.write('date: ' + contentDate + '/n')      newFile.write('comments: true' + '/n')      newFile.write('categories: [' + categories + ']' + '/n')      newFile.write('tags: [' + categories + ']' + '/n')      newFile.write('description: /"' + title + '/"/n')      newFile.write('keyWords: ' + categories + '/n')      newFile.write('---' + '/n/n')      newFile.write(content)      newFile.write('/n')      newFile.close()