国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Java > 正文

Java爬蟲框架WebMagic的使用總結

2019-11-06 09:17:18
字體:
來源:轉載
供稿:網友
最近,項目做一個公司新聞網站,分為PC&移動端(h5),數據來源是從HSZX與huanqiu2個網站爬取,主要使用java編寫的WebMagic作為爬蟲框架,數據分為批量抓取、增量抓取,批量抓當前所有歷史數據,增量需要每10分鐘定時抓取一次,由于從2個網站抓取,并且頻道很多,數據量大,更新頻繁;開發過程中遇到很多的坑,今天騰出時間,感覺有必要做以總結。

工具說明:

           1、WebMagic是一個簡單靈活的爬蟲框架?;赪ebMagic,你可以快速開發出一個高效、易維護的爬蟲。

                  官網地址:http://webmagic.io/

                  文檔說明:http://webmagic.io/docs/zh/

            2、jsoup是Java的一個html解析工作,解析性能很不錯。

                    文檔地址:http://www.open-open.com/jsoup/

            3、Jdiy一款超輕量的java極速開發框架,javaEE/javaSE環境均適用,便捷的數據庫CRUD操作API。支持各大主流數據庫。

                    官網地址:http://www.jdiy.org/jdiy.jd

一、使用到的技術,如下:       WebMagic作為爬蟲框架、httpclient作為獲取網頁工具、Jsoup作為分析頁面定位抓取內容、ExecutorService線程池作為定時增量抓取、Jdiy作為持久層框架       二、歷史抓取代碼,如下:

[java] view plain copy 在CODE上查看代碼片package com.spider.huanqiu.history;    import java.util.ArrayList;  import java.util.List;  import org.apache.commons.lang3.StringUtils;  import org.jdiy.core.Rs;  import org.jsoup.Jsoup;  import org.jsoup.nodes.Document;  import org.jsoup.nodes.Element;  import org.jsoup.select.Elements;  import us.codecraft.webmagic.Page;  import us.codecraft.webmagic.Site;  import us.codecraft.webmagic.Spider;  import us.codecraft.webmagic.package com.spider.huanqiu.task;    import java.util.ArrayList;  import java.util.List;  import org.apache.commons.lang3.StringUtils;  import org.jdiy.core.Rs;  import org.jsoup.Jsoup;  import org.jsoup.nodes.Document;  import org.jsoup.nodes.Element;  import org.jsoup.select.Elements;  import us.codecraft.webmagic.Page;  import us.codecraft.webmagic.Site;  import us.codecraft.webmagic.Spider;  import us.codecraft.webmagic.processor.PageProcessor;  import com.spider.huasheng.history.Pindao;  import com.spider.utils.Config;  import com.spider.utils.ConfigBase;  import com.spider.utils.DateUtil;  import com.spider.utils.HttpClientUtil;  import com.spider.utils.service.CommService;    public class HQNewsTaskDao extends ConfigBase  implements PageProcessor{         public static final String index_list = "(.*).huanqiu.com/(.*)pindao=(.*)";         public static String pic_dir = fun.getProValue(PINDAO_PIC_FILE_PATH);         public static String new_id="";                 // 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等      private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(6000)                         .addHeader("Accept-Encoding", "/").setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36");      @Override      public Site getSite() {            return site;      }             @Override          public void process(Page page) {          try {              //列表頁              if (page.getUrl().regex(index_list).match()) {                  List<String> Urllist =new ArrayList<String>();                  String url =page.getUrl().toString();                  String pageUrl = url.substring(0,url.lastIndexOf("?"));                  String pindaoId =url.substring(url.lastIndexOf("=")+1);                  Rs isFlag = CommService.checkPd(pindaoId,pageUrl,Config.SITE_HQ);                  if(!isFlag.isNull()){                       new_id=isFlag.getString("news_id");                  }                  Urllist = saveNewsListData(pageUrl,pindaoId);                  page.addTargetRequests(Urllist);              }          } catch (Exception e) {              e.printStackTrace();          }       }        private List<String> saveNewsListData(String pageUrl,String pindaoId) {      List<String> urlList = new ArrayList<String>();      Document docList = null;      String pageListStr=HttpClientUtil.getPage(pageUrl);      if(StringUtils.isNotEmpty(pageListStr)){       try {          docList = Jsoup.parse(pageListStr);          Elements fallsFlow=docList.getElementsByClass("fallsFlow");          if(!fallsFlow.isEmpty()){              String newsIdFirst="";              Boolean isIng = true;              Elements liTag=fallsFlow.get(0).getElementsByTag("li");              if(!liTag.isEmpty()){                  for(int i=0;i<liTag.size() && isIng;i++){                      String  title="",contentUrl="",newsId="",pic="",absContent="",pushTime="";                      Element obj=liTag.get(i);                      try{                            contentUrl=obj.getElementsByTag("h3").select("a").attr("href");                           if(StringUtils.isNotEmpty(contentUrl)){                                title=obj.getElementsByTag("h3").select("a").attr("title");//標題                                Rs isTitle = CommService.checkNewsName(title); //校驗新聞標題                                  if(!isTitle.isNull()){                                      continue;                                  }                                System.err.println("---------當前抓取文章為(增量):"+title+"------------");                                newsId =  contentUrl.substring(contentUrl.lastIndexOf("/") + 1,contentUrl.lastIndexOf(".html"));                              if(!newsId.equals(new_id)){                                if(!pageUrl.contains(".htm") && i == 0){                                    newsIdFirst = newsId;                                }                               //圖片                               if(!obj.getElementsByTag("img").attr("src").isEmpty()){                                      pic=obj.getElementsByTag("img").first().attr("src");                                      if(StringUtils.isNotEmpty(pic) ){                                          pic = fun.downloadPic(pic,pic_dir+"list/"+newsId+"/");                                      }                                  }                               if(!obj.getElementsByTag("h5").isEmpty()){                                   //簡介                                   absContent = obj.getElementsByTag("h5").first().text();                                   if(StringUtils.isNotEmpty(absContent) && absContent.indexOf("[")>0){                                        absContent = absContent.substring(0, absContent.indexOf("["));                                    }                                }                              if(!obj.getElementsByTag("h6").isEmpty()){                                  pushTime = obj.getElementsByTag("h6").text();                              }                              String hrmlStr=HttpClientUtil.getPage(contentUrl);                              if(StringUtils.isNotEmpty(hrmlStr)){                                Document docPage = Jsoup.parse(hrmlStr);                                Elements pageContent = docPage.getElementsByClass("conText");                                  if(!pageContent.isEmpty()){                                      String comefrom = pageContent.get(0).getElementsByClass("fromSummary").text();//來源                                      if(StringUtils.isNotEmpty(comefrom) && comefrom.contains("環球")){                                          String author=pageContent.get(0).getElementsByClass("author").text();//作者                                          Element contentDom = pageContent.get(0).getElementById("text");                                          if(!contentDom.getElementsByTag("a").isEmpty()){                                              contentDom.getElementsByTag("a").removeAttr("href");//移除外跳連接                                          }                                          if(!contentDom.getElementsByClass("reTopics").isEmpty()){                                               contentDom.getElementsByClass("reTopics").remove();//推薦位                                          }                                          if(!contentDom.getElementsByClass("spTopic").isEmpty()){                                             contentDom.getElementsByClass("spTopic").remove();                                          }                                          if(!contentDom.getElementsByClass("editorSign").isEmpty()){                                             contentDom.getElementsByClass("editorSign").remove();//移除編輯                                          }                                          String content = contentDom.toString();                                          if(!StringUtils.isEmpty(content)){                                              content = content.replaceAll("/r/n|/r|/n|/t|/b|~|/f", "");//去掉回車換行符                                              content = replaceForNews(content,pic_dir+"article/"+newsId+"/");//替換內容中的圖片                                              while (true) {                                                   if(content.indexOf("<!--") ==-1 && content.indexOf("<script") == -1){                                                       break;                                                   }else{                                                     if(content.indexOf("<!--") >0 && content.lastIndexOf("-->")>0){                                                            String moveContent= content.substring(content.indexOf("<!--"), content.indexOf("-->")+3);//去除注釋                                                            content = content.replace(moveContent, "");                                                              }                                                           if(content.indexOf("<script") >0 && content.lastIndexOf("</script>")>0){                                                            String moveContent= content.substring(content.indexOf("<script"), content.indexOf("</script>")+9);//去除JS                                                            content = content.replace(moveContent, "");                                                            }                                                 }                                              }                                          }                                          if(StringUtils.isNotEmpty(content) && StringUtils.isNotEmpty(title)){                                              Rs news= new Rs("News");                                              news.set("title", title);                                              news.set("shortTitle",title);                                              news.set("beizhu",absContent);                                              news.set("savetime", pushTime);                                              if(StringUtils.isNotEmpty(pic)){                                                  news.set("path", pic);                                                  news.set("mini_image", pic);                                              }                                              news.set("pindaoId", pindaoId);                                              news.set("status", 1);//不顯示                                              news.set("canComment", 1);//是否被評論                                              news.set("syn", 1);//是否異步                                              news.set("type", 1);//是否異步                                              news.set("comefrom",comefrom);                                              news.set("author", author);                                              news.set("content", content);                                              news.set("content2", content);                                              CommService.save(news);                                          }                                        }                                    }                              }                              }else{                                  isIng=false;                                  break;                              }                                }                      }catch (Exception e) {                          e.printStackTrace();                        }                  }              }              if(!pageUrl.contains(".htm")){                  //增量標識                   Rs flag = CommService.checkPd(pindaoId,pageUrl,Config.SITE_HQ);                      //初始化                      if(flag.isNull()){                          Rs task= new Rs("TaskInfo");                          task.set("pindao_id", pindaoId);                          task.set("news_id", newsIdFirst);                          task.set("page_url", pageUrl);                          task.set("site", Config.SITE_HQ);                          task.set("create_time", DateUtil.fullDate());                          CommService.save(task);                       }else if(StringUtils.isNotEmpty(newsIdFirst)){                          flag.set("news_id", newsIdFirst);                          flag.set("update_time", DateUtil.fullDate());                          CommService.save(flag);                      }               }          }        } catch (Exception e) {          e.printStackTrace();        }      }      return urlList;      }        public static void main(String[] args) {          List<String> strList=new ArrayList<String>();              strList.add("http://www.xxx/exclusive/?pindao="+Pindao.getKey("國際"));              //滾動新聞              strList.add("http://www.xxx/article/?pindao="+Pindao.getKey("國際"));                    for(String str:strList){              Spider.create(new HQNewsTaskDao()).addUrl(str).thread(1).run();           }   }            //所有頻道Action      public static void runNewsList(List<String> strList){          for(String str:strList){              Spider.create(new HQNewsTaskDao()).addUrl(str).thread(1).run();               }      }  }  四、定時抓取,配置如下:          %201、web.xml重配置監聽       [java]%20view%20plain%20copy%20<!-- 添加:增量數據抓取監聽 -->     <listener>          <listener-class>com.spider.utils.AutoRun</listener-class>      </listener>         %202、定時代碼

       %20

[java]%20view%20plain%20copy%20package com.spider.utils;    import java.util.concurrent.Executors;  import java.util.concurrent.ScheduledExecutorService;  import java.util.concurrent.TimeUnit;  import javax.servlet.ServletContextEvent;  import javax.servlet.ServletContextListener;  import com.spider.huanqiu.timer.HQJob1;  import com.spider.huanqiu.timer.HQJob2;  import com.spider.huanqiu.timer.HQJob3;  import com.spider.huanqiu.timer.HQJob4;  import com.spider.huasheng.timer.HSJob1;  import com.spider.huasheng.timer.HSJob2;  /**  * 描        述:監聽增量抓取Job  * 創建時間:2016-11-4  * @author Jibaole  */  public class AutoRun implements ServletContextListener {         public void contextInitialized(ServletContextEvent event) {         ScheduledExecutorService scheduExec =  Executors.newScheduledThreadPool(6);      /*       * 這里開始循環執行 HSJob()方法了       * schedule(param1, param2,param3)這個函數的三個參數的意思分別是:       *    param1:你要執行的方法;param2:延遲執行的時間,單位毫秒;param3:循環間隔時間,單位毫秒       */       scheduExec.scheduleAtFixedRate(new HSJob1(), 1*1000*60,1000*60*10,TimeUnit.MILLISECONDS);  //延遲1分鐘,設置沒10分鐘執行一次      scheduExec.scheduleAtFixedRate(new HSJob2(), 3*1000*60,1000*60*10,TimeUnit.MILLISECONDS);  //延遲3分鐘,設置沒10分鐘執行一次            scheduExec.scheduleAtFixedRate(new HQJob1(), 5*1000*60,1000*60*10,TimeUnit.MILLISECONDS);  //延遲5分鐘,設置沒10分鐘執行一次      scheduExec.scheduleAtFixedRate(new HQJob2(), 7*1000*60,1000*60*10,TimeUnit.MILLISECONDS);  //延遲7分鐘,設置沒10分鐘執行一次      scheduExec.scheduleAtFixedRate(new HQJob3(), 9*1000*60,1000*60*14,TimeUnit.MILLISECONDS);  //延遲9分鐘,設置沒10分鐘執行一次      scheduExec.scheduleAtFixedRate(new HQJob4(), 11*1000*60,1000*60*10,TimeUnit.MILLISECONDS);  //延遲11分鐘,設置沒10分鐘執行一次    }     public void contextDestroyed(ServletContextEvent event) {         System.out.println("=======timer銷毀==========");      //timer.cancel();     }   }         %203、具體執行業務(舉一個例子)

         %20

[java]%20view%20plain%20copy%20package com.spider.huasheng.timer;    import java.util.ArrayList;  import java.util.List;  import java.util.TimerTask;  import com.spider.huasheng.task.HSTaskDao;  import com.spider.huasheng.task.HSTaskDao1;  import com.spider.huasheng.task.HSTaskDao2;    /**  * 描        述:國際、社會、國內、評論等頻道定時任務  * 創建時間:2016-11-9  * @author Jibaole  */  public class HSJob1 implements Runnable{       @Override       public void run() {           System.out.println("======>>>開始:xxx-任務1====");    try {         runNews();          runNews1();          runNews2();       } catch (Throwable t) {             System.out.println("Error");         }          System.out.println("======xxx-任務1>>>結束?。?!====");      }       /**      * 抓取-新聞 頻道列表      */      public void runNews(){          List<String> strList=new ArrayList<String>();          /**##############>>>16、國際<<<##################*/          //國際視野          strList.add("http://xxx/class/2199.html?pindao=國際");                    /**##############>>>17、社會<<<##################*/          //社會          strList.add("http://xxx/class/2200.html?pindao=社會");                    /**##############>>>18、國內<<<##################*/          //國內動態          strList.add("http://xxx/class/1922.html?pindao=國內");          HQNewsTaskDao.runNewsList(strList);      }            /**      * 抓取-新聞 頻道列表      */      public void runNews1(){          List<String> strList=new ArrayList<String>();          /**##############>>>19、評論<<<##################*/          //華聲視點          strList.add("http://xxx/class/709.html?pindao=評論");          //財經觀察          strList.add("http://xxx/class/2557.html?pindao=評論");          /**##############>>>20、軍事<<<##################*/          //軍事          strList.add("http://xxx/class/2201.html?pindao=軍事");          HQNewsTaskDao.runNewsList(strList);      }      /**      * 抓取-新聞 頻道列表      */      public void runNews2(){          List<String> strList=new ArrayList<String>();          /**##############>>>24、財經<<<##################*/          //財訊          strList.add("http://xxx/class/2353.html?pindao=財經");          //經濟觀察          strList.add("http://xxx/class/2348.html?pindao=財經");          /**##############>>>30、人文<<<##################*/          //歷史上的今天          strList.add("http://xxx/class/1313.html?pindao=人文");          //正史風云          strList.add("http://xxx/class/1362.html?pindao=人文");          HSTaskDao2.runNewsList(strList);      }  }  

 五、使用到的工具類

      %201、HttpClientUtil工具類

[java]%20view%20plain%20copy%20package com.spider.utils;      import java.io.BufferedReader;  import java.io.File;  import java.io.IOException;  import java.io.InputStreamReader;  import java.net.URL;  import java.util.ArrayList;  import java.util.List;  import java.util.Map;  import org.apache.commons.httpclient.HttpClient;  import org.apache.commons.httpclient.HttpStatus;  import org.apache.commons.httpclient.methods.GetMethod;  import org.apache.http.HttpEntity;  import org.apache.http.NameValuePair;  import org.apache.http.client.config.RequestConfig;  import org.apache.http.client.entity.UrlEncodedFormEntity;  import org.apache.http.client.methods.CloseableHttpResponse;  import org.apache.http.client.methods.HttpGet;  import org.apache.http.client.methods.HttpPost;  import org.apache.http.conn.ssl.DefaultHostnameVerifier;  import org.apache.http.conn.util.PublicSuffixMatcher;  import org.apache.http.conn.util.PublicSuffixMatcherLoader;  import org.apache.http.entity.ContentType;  import org.apache.http.entity.StringEntity;  import org.apache.http.entity.mime.MultipartEntityBuilder;  import org.apache.http.entity.mime.content.FileBody;  import org.apache.http.entity.mime.content.StringBody;  import org.apache.http.impl.client.CloseableHttpClient;  import org.apache.http.impl.client.HttpClients;  import org.apache.http.message.BasicNameValuePair;  import org.apache.http.util.EntityUtils;      public class HttpClientUtil {      private final static String charset = "UTF-8";      private RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(15000)              .setConnectTimeout(15000)              .setConnectionRequestTimeout(15000)              .build();            private static HttpClientUtil instance = null;      private HttpClientUtil(){}      public static HttpClientUtil getInstance(){          if (instance == null) {              instance = new HttpClientUtil();          }          return instance;      }            /**      * 發送 post請求      * @param httpUrl 地址      */      public String sendHttpPost(String httpUrl) {          HttpPost httpPost = new HttpPost(httpUrl);// 創建httpPost            return sendHttpPost(httpPost);      }            /**      * 發送 post請求      * @param httpUrl 地址      * @param params 參數(格式:key1=value1&key2=value2)      */      public String sendHttpPost(String httpUrl, String params) {          HttpPost httpPost = new HttpPost(httpUrl);// 創建httpPost            try {              //設置參數              StringEntity stringEntity = new StringEntity(params, "UTF-8");              stringEntity.setContentType("application/x-www-form-urlencoded");               httpPost.setEntity(stringEntity);          } catch (Exception e) {              e.printStackTrace();          }          return sendHttpPost(httpPost);      }             /**      * 發送 post請求      * @param httpUrl 地址      * @param maps 參數      */      public String sendHttpPost(String httpUrl, Map<String, String> maps) {          HttpPost httpPost = new HttpPost(httpUrl);// 創建httpPost            httpPost.setHeader("Content-Type","application/x-www-form-urlencoded;charset="+charset);          httpPost.setHeader("User-Agent","Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.");          // 創建參數隊列            List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair>();          for (String key : maps.keySet()) {              nameValuePairs.add(new BasicNameValuePair(key, maps.get(key)));          }          try {              httpPost.setEntity(new UrlEncodedFormEntity(nameValuePairs, "UTF-8"));          } catch (Exception e) {              e.printStackTrace();          }          return sendHttpPost(httpPost);      }                  /**      * 發送 post請求(帶文件)      * @param httpUrl 地址      * @param maps 參數      * @param fileLists 附件      */      public String sendHttpPost(String httpUrl, Map<String, String> maps, List<File> fileLists) {          HttpPost httpPost = new HttpPost(httpUrl);// 創建httpPost          MultipartEntityBuilder meBuilder = MultipartEntityBuilder.create();          for (String key : maps.keySet()) {              meBuilder.addPart(key, new StringBody(maps.get(key), ContentType.TEXT_PLAIN));          }          for(File file : fileLists) {              FileBody fileBody = new FileBody(file);              meBuilder.addPart("files", fileBody);          }          HttpEntity reqEntity = meBuilder.build();          httpPost.setEntity(reqEntity);          return sendHttpPost(httpPost);      }            /**      * 發送Post請求      * @param httpPost      * @return      */      private String sendHttpPost(HttpPost httpPost) {          CloseableHttpClient httpClient = null;          CloseableHttpResponse response = null;          HttpEntity entity = null;          String responseContent = null;          try {              // 創建默認的httpClient實例.              httpClient = HttpClients.createDefault();              httpPost.setConfig(requestConfig);              // 執行請求              response = httpClient.execute(httpPost);              entity = response.getEntity();              responseContent = EntityUtils.toString(entity, "UTF-8");          } catch (Exception e) {              e.printStackTrace();          } finally {              try {                  // 關閉連接,釋放資源                  if (response != null) {                      response.close();                  }                  if (httpClient != null) {                      httpClient.close();                  }              } catch (IOException e) {                  e.printStackTrace();              }          }          return responseContent;      }        /**      * 發送 get請求      * @param httpUrl      */      public String sendHttpGet(String httpUrl) {          HttpGet httpGet = new HttpGet(httpUrl);// 創建get請求          return sendHttpGet(httpGet);      }        /**      * 發送 get請求Https      * @param httpUrl      */      public String sendHttpsGet(String httpUrl) {          HttpGet httpGet = new HttpGet(httpUrl);// 創建get請求          return sendHttpsGet(httpGet);      }            /**      * 發送Get請求      * @param httpPost      * @return      */      private String sendHttpGet(HttpGet httpGet) {          CloseableHttpClient httpClient = null;          CloseableHttpResponse response = null;          HttpEntity entity = null;          String responseContent = null;          try {              // 創建默認的httpClient實例.              httpClient = HttpClients.createDefault();              httpGet.setConfig(requestConfig);              // 執行請求              response = httpClient.execute(httpGet);              entity = response.getEntity();              responseContent = EntityUtils.toString(entity, "UTF-8");          } catch (Exception e) {              e.printStackTrace();          } finally {              try {                  // 關閉連接,釋放資源                  if (response != null) {                      response.close();                  }                  if (httpClient != null) {                      httpClient.close();                  }              } catch (IOException e) {                  e.printStackTrace();              }          }          return responseContent;      }            /**      * 發送Get請求Https      * @param httpPost      * @return      */      private String sendHttpsGet(HttpGet httpGet) {          CloseableHttpClient httpClient = null;          CloseableHttpResponse response = null;          HttpEntity entity = null;          String responseContent = null;          try {              // 創建默認的httpClient實例.              PublicSuffixMatcher publicSuffixMatcher = PublicSuffixMatcherLoader.load(new URL(httpGet.getURI().toString()));              DefaultHostnameVerifier hostnameVerifier = new DefaultHostnameVerifier(publicSuffixMatcher);              httpClient = HttpClients.custom().setSSLHostnameVerifier(hostnameVerifier).build();              httpGet.setConfig(requestConfig);              // 執行請求              response = httpClient.execute(httpGet);              entity = response.getEntity();              responseContent = EntityUtils.toString(entity, "UTF-8");          } catch (Exception e) {              e.printStackTrace();          } finally {              try {                  // 關閉連接,釋放資源                  if (response != null) {                      response.close();                  }                  if (httpClient != null) {                      httpClient.close();                  }              } catch (IOException e) {                  e.printStackTrace();              }          }          return responseContent;      }            /**      * 利用httpClient獲取頁面      * @param url      * @return      */       public static String getPage(String url){           String result="";          HttpClient httpClient = new HttpClient();          GetMethod getMethod = new GetMethod(url+"?date=" + new Date().getTime());//加時間戳,防止頁面緩存          try {              int statusCode = httpClient.executeMethod(getMethod);              httpClient.setTimeout(5000);              httpClient.setConnectionTimeout(5000);              if (statusCode != HttpStatus.SC_OK) {                  System.err.println("Method failed: "+ getMethod.getStatusLine());              }                            // 讀取內容              //byte[] responseBody = getMethod.getResponseBody();              BufferedReader reader = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream()));                StringBuffer stringBuffer = new StringBuffer();                String str = "";                while((str = reader.readLine())!=null){                    stringBuffer.append(str);                }                // 處理內容              result = stringBuffer.toString();          } catch (Exception e) {              System.err.println("頁面無法訪問");          }          getMethod.releaseConnection();          return result;    }  }  2、下載圖片方法

 

[java]%20view%20plain%20copy%20/**      * 下載圖片到本地      * @param picUrl 圖片Url      * @param localPath 本地保存圖片地址      * @return      */      public String downloadPic(String picUrl,String localPath){          String filePath = null;          String url = null;          try {                URL httpurl = new URL(picUrl);                String fileName = getFileNameFromUrl(picUrl);                filePath = localPath + fileName;              File f = new File(filePath);                FileUtils.copyURLToFile(httpurl, f);               Function fun = new Function();              url = filePath.replace("/www/web/imgs", fun.getProValue("IMG_PATH"));          } catch (Exception e) {                logger.info(e);                return null;            }           return url;      }           %201、替換咨詢內容圖片方法

         %20

[java]%20view%20plain%20copy%20/**      * 替換內容中圖片地址為本地地址      * @param content html內容      * @param pic_dir 本地地址文件路徑      * @return html內容      */      public static String replaceForNews(String content,String pic_dir){          String str = content;          String cont = content;          while (true) {              int i = str.indexOf("src=/"");              if (i != -1) {                  str = str.substring(i+5, str.length());                  int j = str.indexOf("/"");                  String pic_url = str.substring(0, j);                  //下載圖片到本地并返回圖片地址                  String pic_path = fun.downloadPicForNews(pic_url,pic_dir);                  if(StringUtils.isNotEmpty(pic_url) && StringUtils.isNotEmpty(pic_path)){                  cont = cont.replace(pic_url, pic_path);                  str = str.substring(j,str.length());                  }              } else{                  break;              }          }          return cont;      }           [java]%20view%20plain%20copy%20/**      * 下載圖片到本地      * @param picUrl 圖片Url      * @param localPath 本地保存圖片地址      * @return      */      public String downloadPicForNews(String picUrl,String localPath){          String filePath = "";          String url = "";          try {                URL httpurl = new URL(picUrl);             HttpURLConnection urlcon = (HttpURLConnection) httpurl.openConnection();             urlcon.setReadTimeout(3000);             urlcon.setConnectTimeout(3000);             int state = urlcon.getResponseCode(); //圖片狀態             if(state == 200){                 String fileName = getFileNameFromUrl(picUrl);                   filePath = localPath + fileName;                 File f = new File(filePath);                   FileUtils.copyURLToFile(httpurl, f);                  Function fun = new Function();                 url = filePath.replace("/www/web/imgs", fun.getProValue("IMG_PATH"));             }          } catch (Exception e) {                logger.info(e);                return null;            }           return url;      }      %20獲取文件名稱,根絕時間戳自定義[java]%20view%20plain%20copy%20/**      * 根據url獲取文件名      * @param url       * @return 文件名      */      public static String getFileNameFromUrl(String url){            //獲取后綴          String sux = url.substring(url.lastIndexOf("."));          if(sux.length() > 4){              sux = ".jpg";          }          int i = (int)(Math.random()*1000);          //隨機時間戳文件名稱          String name = new Long(System.currentTimeMillis()).toString()+ i + sux;           return name;        }  

   五、遇到的坑  1、增量抓取經常遇到這2個異常,如下        抓取超時:Jsoup 獲取頁面內容,替換為 httpclient獲取,Jsoup去解析

   頁面gzip異常(這個問題特別坑,導致歷史、增量抓取數據嚴重缺失,線上一直有問題)

     解決方案:                      增加:Site..addHeader("Accept-Encoding", "/")

                      這個是WebMagic的框架源碼有點小Bug,如果沒有設置Header,默認頁面Accept-Encoding為:gzip

      

       2、定時抓取     由ScheduledExecutorService多線程并行執行任務,替換Timer單線程串行

      原方式代碼,如下:

[java] view plain copy 在CODE上查看代碼片派生到我的代碼片package com.spider.utils;    import java.util.Timer;   import javax.servlet.ServletContextEvent;  import javax.servlet.ServletContextListener;  import com.spider.huanqiu.timer.HQJob1;  import com.spider.huanqiu.timer.HQJob2;  import com.spider.huanqiu.timer.HQJob3;  import com.spider.huanqiu.timer.HQJob4;  import com.spider.huasheng.timer.HSJob1;  import com.spider.huasheng.timer.HSJob2;  /**  * 描    述:監聽增量抓取Job  * 創建時間:2016-11-4  * @author Jibaole  */  public class AutoRun implements ServletContextListener {     //HS-job     private Timer hsTimer1 = null;     private Timer hsTimer2 = null;     //HQZX-job    private Timer hqTimer1 = null;     private Timer hqTimer2 = null;     private Timer hqTimer3 = null;     private Timer hqTimer4 = null;         public void contextInitialized(ServletContextEvent event) {       hsTimer1 = new Timer(true);       hsTimer2 = new Timer(true);            hqTimer1 = new Timer(true);      hqTimer2 = new Timer(true);      hqTimer3 = new Timer(true);      hqTimer4 = new Timer(true);      /*       * 這里開始循環執行 HSJob()方法了       * schedule(param1, param2,param3)這個函數的三個參數的意思分別是:       *    param1:你要執行的方法;param2:延遲執行的時間,單位毫秒;param3:循環間隔時間,單位毫秒       */       hsTimer1.scheduleAtFixedRate(new HSJob1(), 1*1000*60,1000*60*10);  //延遲1分鐘,設置沒10分鐘執行一次      hsTimer2.scheduleAtFixedRate(new HSJob2(), 3*1000*60,1000*60*10);  //延遲3分鐘,設置沒10分鐘執行一次            hqTimer1.scheduleAtFixedRate(new HQJob1(), 5*1000*60,1000*60*10);  //延遲5分鐘,設置沒10分鐘執行一次      hqTimer2.scheduleAtFixedRate(new HQJob2(), 7*1000*60,1000*60*10);  //延遲7分鐘,設置沒10分鐘執行一次      hqTimer3.scheduleAtFixedRate(new HQJob3(), 9*1000*60,1000*60*10);  //延遲9分鐘,設置沒10分鐘執行一次      hqTimer4.scheduleAtFixedRate(new HQJob4(), 11*1000*60,1000*60*10);  //延遲11分鐘,設置沒10分鐘執行一次       }     public void contextDestroyed(ServletContextEvent event) {         System.out.println("=======timer銷毀==========");      //timer.cancel();     }   }   

3、定時多個任務時,使用多線程,遇到某個線程拋異常終止任務

     解決方案:在多線程run()方法里面,增加try{}catch{}

4、通過HttpClient定時獲取頁面內容時,頁面緩存,抓不到最新內容

     解決方案:在工具類請求URL地址后面增加:url+"?date=" + new Date().getTime()

    六、一些方面的處理       1、頁面抓取規則調整     先抓列表,在抓內容;改為 抓取列表的同時,需要獲取內容詳情  2、保存數據方式作調整      先抓取標題等概要信息,保存數據庫,然后,更新內容信息,根據業務需求再刪除一些非來源文章(版權問題);改為:直接控制來源,得到完整數據,再做批量保存; 3、頁面有一個不想要的內容,處理方法       注釋、JS代碼、移除無用標簽塊  
發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 吉林市| 资溪县| 凉山| 和林格尔县| 桃园县| 大邑县| 定陶县| 德清县| 泰来县| 曲周县| 崇文区| 蒙城县| 汉川市| 体育| 托克托县| 尉氏县| 内乡县| 永安市| 密云县| 武定县| 宁城县| 昭苏县| 伊川县| 治县。| 灵寿县| 英山县| 麻江县| 赣州市| 桂东县| 成武县| 莱西市| 黄梅县| 新巴尔虎左旗| 柳河县| 寿阳县| 农安县| 公主岭市| 尼木县| 东辽县| 罗源县| 普定县|