前段時(shí)間,在學(xué)習(xí)lucene的時(shí)候,遇到了讀取txt文檔遇到編碼錯(cuò)誤的問題。學(xué)了幾個(gè)解決方案,大部分是將文件轉(zhuǎn)十六進(jìn)制(可以使用UE的Ctrl+H來查看),讀取開頭的四個(gè)標(biāo)志位來判斷。可是總有些文本文件無法識(shí)別(我遇到的是部分使用UTF-8編碼的文件),后來發(fā)現(xiàn)了JCharDet。JCharDet是mozilla(就是Firefox那家)的編碼識(shí)別算法的java實(shí)現(xiàn),算了,這里是官網(wǎng),自己看吧。
上代碼:
package com.zhyea.util;import java.io.BufferedInputStream;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import org.mozilla.intl.chardet.nsDetector;import org.mozilla.intl.chardet.nsICharsetDetectionObserver;/** * 借助JCharDet獲取文件字符集 * * @author robin * */public class FileCharsetDetector { /** * 字符集名稱 */ PRivate static String encoding; /** * 字符集是否已檢測(cè)到 */ private static boolean found; private static nsDetector detector; private static nsICharsetDetectionObserver observer; /** * 適應(yīng)語言枚舉 * @author robin * */ enum Language{ Japanese(1), Chinese(2), SimplifiedChinese(3), TraditionalChinese(4), Korean(5), DontKnow(6); private int hint; Language(int hint){ this.hint = hint; } public int getHint(){ return this.hint; } } /** * 傳入一個(gè)文件(File)對(duì)象,檢查文件編碼 * * @param file * File對(duì)象實(shí)例 * @return 文件編碼,若無,則返回null * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(File file) throws FileNotFoundException, IOException { return checkEncoding(file, getNsdetector()); } /** * 獲取文件的編碼 * * @param file * File對(duì)象實(shí)例 * @param language * 語言 * @return 文件編碼 * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(File file, Language lang) throws FileNotFoundException, IOException { return checkEncoding(file, new nsDetector(lang.getHint())); } /** * 獲取文件的編碼 * * @param path * 文件路徑 * @return 文件編碼,eg:UTF-8,GBK,GB2312形式,若無,則返回null * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(String path) throws FileNotFoundException, IOException { return checkEncoding(new File(path)); } /** * 獲取文件的編碼 * * @param path * 文件路徑 * @param language * 語言 * @return * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(String path, Language lang) throws FileNotFoundException, IOException { return checkEncoding(new File(path), lang); } /** * 獲取文件的編碼 * * @param file * @param det * @return * @throws FileNotFoundException * @throws IOException */ private static String checkEncoding(File file, nsDetector detector) throws FileNotFoundException, IOException { detector.Init(getCharsetDetectionObserver()); if (isAscii(file, detector)) { encoding = "ASCII"; found = true; } if (!found) { String prob[] = detector.getProbableCharsets(); if (prob.length > 0) { encoding = prob[0]; } else { return null; } } return encoding; } /** * 檢查文件編碼類型是否是ASCII型 * @param file * 要檢查編碼的文件 * @param detector * @return * @throws IOException */ private static boolean isAscii(File file, nsDetector detector) throws IOException{ BufferedInputStream input = null; try{ input = new BufferedInputStream(new FileInputStream(file)); byte[] buffer = new byte[1024]; int hasRead; boolean done = false; boolean isAscii = true; while ((hasRead=input.read(buffer)) != -1) { if (isAscii) isAscii = detector.isAscii(buffer, hasRead); if (!isAscii && !done) done = detector.DoIt(buffer, hasRead, false); } return isAscii; }finally{ detector.DataEnd(); if(null!=input)input.close(); } } /** * nsDetector單例創(chuàng)建 * @return */ private static nsDetector getNsdetector(){ if(null == detector){ detector = new nsDetector(); } return detector; } /** * nsICharsetDetectionObserver 單例創(chuàng)建 * @return */ private static nsICharsetDetectionObserver getCharsetDetectionObserver(){ if(null==observer){ observer = new nsICharsetDetectionObserver() { public void Notify(String charset) { found = true; encoding = charset; } }; } return observer; }}這個(gè)還存一個(gè)問題,就是識(shí)別Unicode編碼的文件,會(huì)返回windows-1252。我使用windows-1252作為編碼的時(shí)候會(huì)報(bào)錯(cuò)。
對(duì)了,再提供一個(gè)這個(gè)jar包下載的地址,官網(wǎng)有時(shí)會(huì)抽風(fēng),不能訪問。
下載地址:http://download.csdn.net/detail/tianxiexingyun/8286849
就這樣。
新聞熱點(diǎn)
疑難解答
圖片精選
網(wǎng)友關(guān)注