java獲得采集網頁內容的方法小結

2019-11-14 21:02:34

字體：大中小

來源：轉載

供稿：網友

java獲得采集網頁內容的方法小結

為了寫一個java的采集程序，從網上學習到3種方法可以獲取單個網頁內容的方法，主要是運用到是java IO流方面的知識，對其不熟悉，因此寫個小結。

import java.io.BufferedReader;import java.io.ByteArrayOutputStream;import java.io.IOException;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.URL;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Get_Html {    public static void main(String[] args) throws Exception    {    long start= System.currentTimeMillis();        String str_url="http://www.hiphop8.com/city/guangdong/guangzhou.php";        Pattern p = Pattern.compile(">(13//d{5}|15//d{5}|18//d{5}|147//d{4})<");        //String html = get_Html_2(str_url);        //String html = get_Html_1(str_url);        String html = get_Html_3(str_url);        Matcher m = p.matcher(html);                int num = 0;       while(m.find()){System.out.PRintln("打印出的號碼段落："+m.group(1)+"  編號"+(++num));}       System.out.println(num);              long end = System.currentTimeMillis();System.out.println("花費的時間"+(end-start)+"毫秒");    }     public static String get_Html_2(String str_url) throws IOException{    URL url = new URL(str_url);    String content="";StringBuffer page = new StringBuffer();try {BufferedReader in = new BufferedReader(new InputStreamReader(url                    .openStream(), "utf-8"));while((content = in.readLine()) != null){page.append(content);}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}        return page.toString();    }        public static String get_Html_1(String str_url) throws IOException{        URL url = new URL(str_url);        HttpURLConnection conn = (HttpURLConnection)url.openConnection();        InputStreamReader input = new InputStreamReader(conn.getInputStream(), "utf-8");          BufferedReader bufReader = new BufferedReader(input);          String line = "";          StringBuilder contentBuf = new StringBuilder();          while ((line = bufReader.readLine()) != null) {              contentBuf.append(line);          }        return contentBuf.toString();    }        /**     * 通過網站域名URL獲取該網站的源碼     * @param url     * @return String     * @throws Exception     */    public static String get_Html_3(String str_url) throws Exception    {    URL url = new URL(str_url);        HttpURLConnection conn = (HttpURLConnection)url.openConnection();        conn.setRequestMethod("GET");        conn.setConnectTimeout(5 * 1000);                        //設置連接超時        java.io.InputStream inStream = conn.getInputStream();  //通過輸入流獲取html二進制數據                      byte[] data = readInputStream(inStream);  //把二進制數據轉化為byte字節數據        String htmlSource = new String(data);        return htmlSource;    }        /**     * 把二進制流轉化為byte字節數組     * @param inStream     * @return byte[]     * @throws Exception     */    public static byte[] readInputStream(java.io.InputStream inStream) throws Exception {        ByteArrayOutputStream outStream = new ByteArrayOutputStream();        byte[]  buffer = new byte[1204];        int len = 0;        while ((len = inStream.read(buffer)) != -1){            outStream.write(buffer,0,len);        }        inStream.close();        return outStream.toByteArray();             } }

【分別測試6次的結果】不知道是不是獲取的網頁數量內容較小，采集效率差不多，不過方法2應該是最好最簡便的。

//get_Html_1 967 2658 1132 1199 988 1236 //get_Html_2 2323 2244 1202 1166 1081 1011 //get_Html_3 978 1219 1527 1133 1192 1774

1、關于url .openStream()和conn.getInputStream()。

二者返回的的都是InputStrema對象，且都是通過openConnection()方法獲取URLConnection對象，然后調用getInputStream()方法，所以方法2和方法1是一樣的，但前者更方便。

2、關于BufferedReader類。

【該類的功能】：能將字符流放入緩沖區（內存中的一塊小區域），以便實現高效的讀取。

【看構造方法】：

BufferedReader(Reader in) 創建一個使用默認大小輸入緩沖區來緩沖字符輸入流。

BufferedReader(Reader in, int sz) 創建一個使用指定大小輸入緩沖區的緩沖字符輸入流。

【常用方法】：readLine()可以快速的實現文本字符的行讀取。

3、關于InputStreamReader 類

InputStreamReader 是從字節流到字符流的橋梁：它讀入字節，并根據指定的編碼方式，將之轉換為字符流，它是Reader的子類。

而為了達到更高效率，我們經常用 BufferedReader 封裝 InputStreamReader ，所以我們經常看到的用法是

BufferedReader Buf = new BufferedReader(new InputStreamReader(System.in);

這里的InputStreamReader類的功能是將字節流轉換為字符流，所以以上語句實現了：將 字節輸入流 轉換為 字符輸入流 且放置緩沖區。

引用一張圖：

4、關于 ByteArrayOutputStream類

它是OutputStream類的擴展類，其構造函數是byteArrayInputStream（byte []buf），作用是把字節數組buf 變成輸入流的形式，并通過toString()或者toByteArray()方法或得想要的數據形式。方法3中的readInputStream方法可改為返回String類型，將后面的outStream.toByteArray()改為outStream.toString()方法，這樣又精簡了代碼。

5、關于InputStream類

InputStream與OutputStream: 是 8位字節輸入/輸出流類的基類，主要用在處理二進制數據，它是按字節來處理的。文件在硬盤或在傳輸時都是以字節的方式進行的，包括圖片等都是按字節的方式存儲的，其余的字節流的處理類都是對該類的擴展，如等上面講ByteArrayInputStream類。

由于InputStream.read()方法是每次從流里只讀取讀取一個字節，效率會非常低。而InputStream.read(byte[] b)或者InputStream.read(byte[] b,int off,int len)方法，一次可以讀取多個字節，效率較高，所以方法3中創建了一個byte字節數組，以便一次性讀取更多的字節。當read()方法讀取內容為空的時候，返回-1.

另外字符輸入輸出流的基類 Reader/Writer，且要知道1個字符= 2字節，字符都是在內存中生成的，一個中文占兩個字節，其子類包含有上面講的的InputStreamRead類與BufferReader類。

寫了幾點總結，都是和java的IO流有關的，是不是應該改個標題，想想還是算了，畢竟采集程序中很重要的一部分就是IO流方面的，java在IO流方面提供了豐富的類庫，邊學邊積累吧。

上一篇：小康陪你學JAVA--------面向對象程序設計（緒）

下一篇：SLF4J日志門面