停用词删除出错
Stop word removing went wrong
出于某些 IR 目的,我想提取一些文本片段,在分析之前,我希望删除停用词。为此,我制作了一个 txt
停用词文件,然后使用以下代码尝试删除那些无用的词:
private static void stopWordRemowal() throws FileNotFoundException, IOException {
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
stopWords.add(line.trim());
BufferedReader br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);
for(String readReady;(readReady = br2.readLine()) != null;)
{
StringTokenizer tokenizer =new StringTokenizer(readReady) ;
String temp=tokenizer.nextToken();
if(!stopWords.equals(temp))
{
theNewWords.write(temp.getBytes());
theNewWords.write(System.getProperty("line.separator").getBytes());
}}
}
但实际上效果并不好。考虑以下示例文本片段:
Text summarization is the process of extracting salient information from the source text and to present that
information to the user in the form of summary
输出如下:
Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary
几乎没有效果。但是不知道为什么。
您应该使用 Set 的 contains 方法,而不是像这样的等于方法:
if(!stopWords.contains(temp))//does set contains my string temp?
而不是
if(!stopWords.equals(temp))//set equals to string? not possible
出于某些 IR 目的,我想提取一些文本片段,在分析之前,我希望删除停用词。为此,我制作了一个 txt
停用词文件,然后使用以下代码尝试删除那些无用的词:
private static void stopWordRemowal() throws FileNotFoundException, IOException {
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
stopWords.add(line.trim());
BufferedReader br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);
for(String readReady;(readReady = br2.readLine()) != null;)
{
StringTokenizer tokenizer =new StringTokenizer(readReady) ;
String temp=tokenizer.nextToken();
if(!stopWords.equals(temp))
{
theNewWords.write(temp.getBytes());
theNewWords.write(System.getProperty("line.separator").getBytes());
}}
}
但实际上效果并不好。考虑以下示例文本片段:
Text summarization is the process of extracting salient information from the source text and to present that
information to the user in the form of summary
输出如下:
Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary
几乎没有效果。但是不知道为什么。
您应该使用 Set 的 contains 方法,而不是像这样的等于方法:
if(!stopWords.contains(temp))//does set contains my string temp?
而不是
if(!stopWords.equals(temp))//set equals to string? not possible