Java读取utf-8文件注意有无BOM

53873039oycg

浏览: 825291 次
性别:

最近访客更多访客>>

nessiah

hao___feng

China丰

Rorlay

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java

各位看标题就知道我要写什么了，先写结论，读取utf-8格式的文件时候，注意文件开头可能含有BOM标识符，结论写完了，下面的没必要看了，我记录下问题解决方法而已。

虽然早知道utf-8文件可能含有bom标识符，我一直没碰到过，知道今天，我测试时候发现字符串长度不对劲，如下所示：

长度不对劲，我就开始怀疑我碰上了传说中的BOM标识符了，下面可是验证，上代码：

public class UTFBOM文件处理 {
	public static void main(String[] args) throws Exception {
		List<String> resultList = readFileByLine("f:/saveFile/temp/name2.txt");
		String tmpStr = resultList.get(0);
		System.out.println(tmpStr + "----len=" + tmpStr.length());
		String tmpStr2 = new String(tmpStr.substring(0, 1));
		System.out.println(tmpStr2 + "----hex=" + strtoHex(tmpStr2));
	}

	public static String strtoHex(String s) {
		String str = "";
		for (int i = 0; i < s.length(); i++) {
			int ch = (int) s.charAt(i);
			String s4 = Integer.toHexString(ch);
			str = str + s4;
		}
		return "0x" + str;// 0x表示十六进制
	}

	public static List<String> readFileByLine(String filename) throws Exception {
		List<String> nameList = new ArrayList<String>();
		File file = new File(filename);
		InputStreamReader isr = new InputStreamReader(
				new FileInputStream(file), "UTF-8");
		BufferedReader reader = new BufferedReader(isr);
		String tmp = reader.readLine();
		while (tmp != null && tmp.trim().length() > 0) {
			nameList.add(tmp);
			tmp = reader.readLine();
		}
		reader.close();
		return nameList;
	}
}

结果为：

掬水月在手----len=6
----hex=0xfeff

看到好熟悉的feff。

使用16进制打开文件，可以看到BOM标识符了。

BOM是什么，可以参考链接：

http://mindprod.com/jgloss/bom.html

ef bb bf

写道

UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation.

解决方法：

http://koti.mbnet.fi/akini/java/unicodereader/

public static List<String> readFileByLineWithOutBom(String filename)
			throws Exception {
		List<String> nameList = new ArrayList<String>();
		File file = new File(filename);
		InputStream in = new FileInputStream(file);
		BufferedReader reader = new BufferedReader(new UnicodeReader(in,
				"utf-8"));
		String tmp = reader.readLine();
		while (tmp != null && tmp.trim().length() > 0) {
			nameList.add(tmp);
			tmp = reader.readLine();
		}
		reader.close();
		return nameList;
	}

带BOM标识符的文件怎么产生的呢？

系统默认另存为utf-8保存的就是带bom的，以前一直用nodepad++另存为的，今天犯二了，忘了。

全文完。