在 java 中处理文本文件中的 UTF-8 字符
Processing UTF-8 characters in text file in java
我有一个文本文件,其中包含以下示例 UTF-8 文本:
ኣእምሮኣዊ/ADJ ጥዕና/N ።/PUN
ቅድሚ/PRE ብዙሕ/ADJ ዓመታት/N “/PUN ኣእምሮኣዊ/ADJ ስንክልና/N ብጋኔን/N ወይ/CON እከይ/ADJ መናፍስቲ/N ኢዩ/V_AUX ዝመጽእ/V_REL “/PUN ዝብል/V_REL ግጉይ/ADJ ኣመለኻኽታ/N ነይሩ/V_GER ።/PUN
ከም/CON ውጺኢቱ/N ድማ/CON ኣእምሮኣዊ/ADJ ስንክልና/N ዘጋጠሞም/ADJ ኣባላት/N ናይ/PRE ሓደ/NUM ሕብረተ-ሰብ/N ብኣሰቃቕን/ADJ ኢሰብኣውን/ADJ ኣገባብ/N ይተሓዙ/V_IMF ነይሮም/V_AUX ።/PUN
用于 Brown Corpus 的 HMM 词性标注器的 Lingpipe 实现:
BrownCorpus
class读取压缩后的POS语料库如下:
public class BrownPosCorpus implements PosCorpus {
private final File mBrownZipFile;
public BrownPosCorpus(File brownZipFile) {
mBrownZipFile = brownZipFile;
}
public Parser<ObjectHandler<Tagging<String>>> parser() {
return new BrownPosParser();
}
public Iterator<InputSource> sourceIterator() throws IOException {
return new BrownSourceIterator(mBrownZipFile);
}
static class BrownSourceIterator extends Iterators.Buffered<InputSource> {
private ZipInputStream mZipIn = null;
public BrownSourceIterator(File brownZipFile) throws IOException {
FileInputStream fileIn = new FileInputStream(brownZipFile);
mZipIn = new ZipInputStream(fileIn);
}
public InputSource bufferNext() {
ZipEntry entry = null;
try {
while ((entry = mZipIn.getNextEntry()) != null) {
if (entry.isDirectory()) continue;
String name = entry.getName();
if (name.equals("brown/CONTENTS")
|| name.equals("brown/README")) continue;
return new InputSource(mZipIn);
}
} catch (IOException e) {
// ignore and close and return null
}
Streams.closeQuietly(mZipIn);
return null;
}
}
}
BrownPosParser.javaclass解析压缩的brown pos语料库如下:
public class BrownPosParser
extends StringParser<ObjectHandler<Tagging<String>>> {
@Override
public void parseString(char[] cs, int start, int end) {
String in = new String(cs,start,end-start);
String[] sentences = in.split("\n");
for (int i = 0; i < sentences.length; ++i)
if (!Strings.allWhitespace(sentences[i]))
processSentence(sentences[i]);
}
public String normalizeTag(String rawTag) {
String tag = rawTag;
String startTag = tag;
// remove plus, default to first
int splitIndex = tag.indexOf('+');
if (splitIndex >= 0)
tag = tag.substring(0,splitIndex);
int lastHyphen = tag.lastIndexOf('-');
if (lastHyphen >= 0) {
String first = tag.substring(0,lastHyphen);
String suffix = tag.substring(lastHyphen+1);
if (suffix.equalsIgnoreCase("HL")
|| suffix.equalsIgnoreCase("TL")
|| suffix.equalsIgnoreCase("NC")) {
tag = first;
}
}
int firstHyphen = tag.indexOf('-');
if (firstHyphen > 0) {
String prefix = tag.substring(0,firstHyphen);
String rest = tag.substring(firstHyphen+1);
if (prefix.equalsIgnoreCase("FW")
|| prefix.equalsIgnoreCase("NC")
|| prefix.equalsIgnoreCase("NP"))
tag = rest;
}
// neg last, and only if not whole thing
int negIndex = tag.indexOf('*');
if (negIndex > 0) {
if (negIndex == tag.length()-1)
tag = tag.substring(0,negIndex);
else
tag = tag.substring(0,negIndex)
+ tag.substring(negIndex+1);
}
// multiple runs to normalize
return tag.equals(startTag) ? tag : normalizeTag(tag);
}
private void processSentence(String sentence) {
String[] tagTokenPairs = sentence.split(" ");
List<String> tokenList = new ArrayList<String>(tagTokenPairs.length);
List<String> tagList = new ArrayList<String>(tagTokenPairs.length);
for (String pair : tagTokenPairs) {
int j = pair.lastIndexOf('/');
String token = pair.substring(0,j);
String tag = normalizeTag(pair.substring(j+1));
tokenList.add(token);
tagList.add(tag);
}
Tagging<String> tagging
= new Tagging<String>(tokenList,tagList);
getHandler().handle(tagging);
}
}
问题是在解析UTF-8语料库时出现了以下错误:
关键问题在BrownPosParser.java:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
堆栈跟踪如下:
C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags>ant eval-brown
Buildfile: C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build.xml
compile:
[javac] Compiling 11 source files to C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build\classes
eval-brown:
[java] COMMAND PARAMETERS:
[java] Sent eval rate=5
[java] Toks before eval=1000000
[java] Max n-best eval=32
[java] Max n-gram=8
[java] Num chars=128
[java] Lambda factor=8.0
[java] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
[java] at BrownPosParser.parseString(BrownPosParser.java:20)
[java] at com.aliasi.corpus.StringParser.parse(StringParser.java:71)
[java] at EvaluatePos.parseCorpus(EvaluatePos.java:123)
[java] at EvaluatePos.run(EvaluatePos.java:75)
[java] at EvaluatePos.main(EvaluatePos.java:183)
[java] Java Result: 1
我应该修改代码的哪一部分才能正确解析 UTF-8 pos 语料库?
非常感谢任何帮助。
不确定是否能解决您的问题;但要设置字符集,请更改此行:
mZipIn = new ZipInputStream(fileIn);
至
mZipIn = new ZipInputStream(new BufferedInputStream(fileIn), Charset.forName("UTF-8"));
找到并消除连续的spaces,一个space在行的开头或结尾,并检查语料库中所有标记是否都有/。
有效。
我有一个文本文件,其中包含以下示例 UTF-8 文本:
ኣእምሮኣዊ/ADJ ጥዕና/N ።/PUN
ቅድሚ/PRE ብዙሕ/ADJ ዓመታት/N “/PUN ኣእምሮኣዊ/ADJ ስንክልና/N ብጋኔን/N ወይ/CON እከይ/ADJ መናፍስቲ/N ኢዩ/V_AUX ዝመጽእ/V_REL “/PUN ዝብል/V_REL ግጉይ/ADJ ኣመለኻኽታ/N ነይሩ/V_GER ።/PUN
ከም/CON ውጺኢቱ/N ድማ/CON ኣእምሮኣዊ/ADJ ስንክልና/N ዘጋጠሞም/ADJ ኣባላት/N ናይ/PRE ሓደ/NUM ሕብረተ-ሰብ/N ብኣሰቃቕን/ADJ ኢሰብኣውን/ADJ ኣገባብ/N ይተሓዙ/V_IMF ነይሮም/V_AUX ።/PUN
用于 Brown Corpus 的 HMM 词性标注器的 Lingpipe 实现:
BrownCorpus
class读取压缩后的POS语料库如下:
public class BrownPosCorpus implements PosCorpus {
private final File mBrownZipFile;
public BrownPosCorpus(File brownZipFile) {
mBrownZipFile = brownZipFile;
}
public Parser<ObjectHandler<Tagging<String>>> parser() {
return new BrownPosParser();
}
public Iterator<InputSource> sourceIterator() throws IOException {
return new BrownSourceIterator(mBrownZipFile);
}
static class BrownSourceIterator extends Iterators.Buffered<InputSource> {
private ZipInputStream mZipIn = null;
public BrownSourceIterator(File brownZipFile) throws IOException {
FileInputStream fileIn = new FileInputStream(brownZipFile);
mZipIn = new ZipInputStream(fileIn);
}
public InputSource bufferNext() {
ZipEntry entry = null;
try {
while ((entry = mZipIn.getNextEntry()) != null) {
if (entry.isDirectory()) continue;
String name = entry.getName();
if (name.equals("brown/CONTENTS")
|| name.equals("brown/README")) continue;
return new InputSource(mZipIn);
}
} catch (IOException e) {
// ignore and close and return null
}
Streams.closeQuietly(mZipIn);
return null;
}
}
}
BrownPosParser.javaclass解析压缩的brown pos语料库如下:
public class BrownPosParser
extends StringParser<ObjectHandler<Tagging<String>>> {
@Override
public void parseString(char[] cs, int start, int end) {
String in = new String(cs,start,end-start);
String[] sentences = in.split("\n");
for (int i = 0; i < sentences.length; ++i)
if (!Strings.allWhitespace(sentences[i]))
processSentence(sentences[i]);
}
public String normalizeTag(String rawTag) {
String tag = rawTag;
String startTag = tag;
// remove plus, default to first
int splitIndex = tag.indexOf('+');
if (splitIndex >= 0)
tag = tag.substring(0,splitIndex);
int lastHyphen = tag.lastIndexOf('-');
if (lastHyphen >= 0) {
String first = tag.substring(0,lastHyphen);
String suffix = tag.substring(lastHyphen+1);
if (suffix.equalsIgnoreCase("HL")
|| suffix.equalsIgnoreCase("TL")
|| suffix.equalsIgnoreCase("NC")) {
tag = first;
}
}
int firstHyphen = tag.indexOf('-');
if (firstHyphen > 0) {
String prefix = tag.substring(0,firstHyphen);
String rest = tag.substring(firstHyphen+1);
if (prefix.equalsIgnoreCase("FW")
|| prefix.equalsIgnoreCase("NC")
|| prefix.equalsIgnoreCase("NP"))
tag = rest;
}
// neg last, and only if not whole thing
int negIndex = tag.indexOf('*');
if (negIndex > 0) {
if (negIndex == tag.length()-1)
tag = tag.substring(0,negIndex);
else
tag = tag.substring(0,negIndex)
+ tag.substring(negIndex+1);
}
// multiple runs to normalize
return tag.equals(startTag) ? tag : normalizeTag(tag);
}
private void processSentence(String sentence) {
String[] tagTokenPairs = sentence.split(" ");
List<String> tokenList = new ArrayList<String>(tagTokenPairs.length);
List<String> tagList = new ArrayList<String>(tagTokenPairs.length);
for (String pair : tagTokenPairs) {
int j = pair.lastIndexOf('/');
String token = pair.substring(0,j);
String tag = normalizeTag(pair.substring(j+1));
tokenList.add(token);
tagList.add(tag);
}
Tagging<String> tagging
= new Tagging<String>(tokenList,tagList);
getHandler().handle(tagging);
}
}
问题是在解析UTF-8语料库时出现了以下错误: 关键问题在BrownPosParser.java:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
堆栈跟踪如下:
C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags>ant eval-brown
Buildfile: C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build.xml
compile:
[javac] Compiling 11 source files to C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build\classes
eval-brown:
[java] COMMAND PARAMETERS:
[java] Sent eval rate=5
[java] Toks before eval=1000000
[java] Max n-best eval=32
[java] Max n-gram=8
[java] Num chars=128
[java] Lambda factor=8.0
[java] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
[java] at BrownPosParser.parseString(BrownPosParser.java:20)
[java] at com.aliasi.corpus.StringParser.parse(StringParser.java:71)
[java] at EvaluatePos.parseCorpus(EvaluatePos.java:123)
[java] at EvaluatePos.run(EvaluatePos.java:75)
[java] at EvaluatePos.main(EvaluatePos.java:183)
[java] Java Result: 1
我应该修改代码的哪一部分才能正确解析 UTF-8 pos 语料库?
非常感谢任何帮助。
不确定是否能解决您的问题;但要设置字符集,请更改此行:
mZipIn = new ZipInputStream(fileIn);
至
mZipIn = new ZipInputStream(new BufferedInputStream(fileIn), Charset.forName("UTF-8"));
找到并消除连续的spaces,一个space在行的开头或结尾,并检查语料库中所有标记是否都有/。
有效。