說明:以下的代碼基于httpclient4.5.2實現。
我們要使用java的HttpClient實現get請求抓取網頁是一件比較容易實現的工作:
public static String get(String url) { CloseableHttpResponseresponse = null; BufferedReader in = null; String result = ""; try { CloseableHttpClienthttpclient = HttpClients.createDefault(); HttpGethttpGet = new HttpGet(url); response = httpclient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; }要多線程執行get請求時上面的方法也堪用。不過這種多線程請求是基于在每次調用get方法時創建一個HttpClient實例實現的。每個HttpClient實例使用一次即被回收。這顯然不是一種最優的實現。
HttpClient提供了多線程請求方案,可以查看官方文檔的《 Pooling connection manager 》這一節。HttpCLient實現多線程請求是基于內置的連接池實現的,其中有一個關鍵的類即PoolingHttpClientConnectionManager,這個類負責管理HttpClient連接池。在PoolingHttpClientConnectionManager中提供了兩個關鍵的方法:setMaxTotal和setDefaultMaxPerRoute。setMaxTotal設置連接池的最大連接數,setDefaultMaxPerRoute設置每個路由上的默認連接個數。此外還有一個方法setMaxPerRoute――單獨為某個站點設置最大連接個數,像這樣:
HttpHosthost = new HttpHost("locahost", 80); cm.setMaxPerRoute(new HttpRoute(host), 50);根據文檔稍稍調整下我們的get請求實現:
package com.zhyea.robin; import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader; public class HttpUtil { private static CloseableHttpClienthttpClient; static { PoolingHttpClientConnectionManagercm = new PoolingHttpClientConnectionManager(); cm.setMaxTotal(200); cm.setDefaultMaxPerRoute(20); cm.setDefaultMaxPerRoute(50); httpClient = HttpClients.custom().setConnectionManager(cm).build(); } public static String get(String url) { CloseableHttpResponseresponse = null; BufferedReaderin = null; String result = ""; try { HttpGethttpGet = new HttpGet(url); response = httpClient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffersb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); }}這樣就差不多了。不過對于我自己而言,我更喜歡httpclient的fluent實現,比如我們剛才實現的http get請求完全可以這樣簡單的實現:
package com.zhyea.robin; import org.apache.http.client.fluent.Request;import java.io.IOException; public class HttpUtil { public static String get(String url) { String result = ""; try { result = Request.Get(url) .connectTimeout(1000) .socketTimeout(1000) .execute().returnContent().asString(); } catch (IOException e) { e.printStackTrace(); } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); }}我們要做的只是將以前的httpclient依賴替換為fluent-hc依賴:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>fluent-hc</artifactId> <version>4.5.2</version></dependency>
并且這個fluent實現天然就是采用PoolingHttpClientConnectionManager完成的。它設置的maxTotal和defaultMaxPerRoute的值分別是200和100:
CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute(100); CONNMGR.setMaxTotal(200);
唯一一點讓人不爽的就是Executor沒有提供調整這兩個值的方法。不過這也完全夠用了,實在不行的話,還可以考慮重寫Executor方法,然后直接使用Executor執行get請求:
Executor.newInstance().execute(Request.Get(url)) .returnContent().asString();
就這樣!
新聞熱點
疑難解答