从文件中删除停用词 - 多次检查它会导致内容重复并且不会删除这些词
Remove stop words from file - going over it multiple times causes content duplication and does not remove the words
我正在尝试检查一堆文件,阅读每个文件,然后从指定列表中删除包含这些词的所有停用词。结果是一场灾难——整个文件的内容被一遍又一遍地复制。
我试过的:
- 将文件保存为字符串并尝试使用正则表达式查看
- 将文件保存为字符串并逐行检查并将标记与存储在 LinkedHashSet 中的停用词进行比较,我也可以将它们存储在文件中
- 试图以多种方式扭曲下面的逻辑,得到越来越荒谬的输出。
- 尝试使用 .contains()
方法查看文本/行,但没有成功
我的大致逻辑如下:
for every word in the stopwords set:
while(file has more lines):
save current line into String
while (current line has more tokens):
assign current token into String
compare token with current stopword:
if(token equals stopword):
write in the output file "" + " "
else: write in the output file the token as is
Tried what's in this question 和许多其他 SO 问题,但无法实现我需要的。
真实代码如下:
private static void removeStopWords(File fileIn) throws IOException {
File stopWordsTXT = new File("stopwords.txt");
System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
Set<String> stopWords = new LinkedHashSet<String>();
for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
// trim() eliminates leading and trailing spaces
stopWords.add(line.trim());
}
File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
FileWriter fOut = new FileWriter(outp);
Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
while(readerTxt.hasNextLine()) {
String line = readerTxt.nextLine();
System.out.println(line);
Scanner lineReader = new Scanner(line);
for (String curSW : stopWords) {
while(lineReader.hasNext()) {
String token = lineReader.next();
if(token.equals(curSW)) {
System.out.println("---> Removing SW: " + curSW);
fOut.write("" + " ");
} else {
fOut.write(token + " ");
}
}
}
fOut.write("\n");
}
fOut.close();
}
最常发生的是它从停止词集中寻找第一个词,仅此而已。输出包含所有其他词,即使我设法删除了第一个词。第一个将在最后的下一个附加输出中出现。
我的停用词列表的一部分
about
above
after
again
against
all
am
and
any
are
as
at
对于标记,我指的是单词,即从行中获取每个单词并将其与当前停用词进行比较
经过一段时间的调试,我相信我已经找到了解决办法。这个问题非常棘手,因为您必须使用多个不同的扫描仪和文件 readers 等。这是我所做的:
我更改了您添加到停用词集的方式,因为它没有正确添加它们。我使用缓冲 reader 读取每一行,然后使用扫描仪读取每个单词,然后将其添加到集合中。
然后,当您比较它们时,我摆脱了您的一个循环,因为您可以轻松地使用 .contains() 方法来检查该词是否为停用词。
我让你完成写入文件以删除停用词的部分,因为我相信你现在可以解决这个问题,因为其他一切都在工作。
-我的示例停用词 txt 文件:
停用词
字数
-我的示例输入文件完全相同,所以它应该捕获所有三个词。
代码:
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
// trim() eliminates leading and trailing spaces
Scanner words = new Scanner(stopWordsLine);
String word = words.next();
while(word != null) {
stopWords.add(word.trim()); //Add the stop words to the set
if(words.hasNext()) {
word = words.next(); //If theres another line, read it
}
else {
break; //else break the inner while loop
}
}
stopWordsLine = readerSW.readLine();
}
BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();
while(line != null) {
Scanner lineReader = new Scanner(line);
String line2 = lineReader.next();
while(line2 != null) {
if(stopWords.contains(line2)) {
System.out.println("removing " + line2);
}
if(lineReader.hasNext()) { //If theres another line, read it
line2 = lineReader.next();
}
else {
break; //else break the first while loop
}
}
lineReader.close();
line = outp.readLine();
}
输出:
removing Stop
removing words
removing Words
如果我可以详细说明我的代码或我为什么做某事,请告诉我!
我正在尝试检查一堆文件,阅读每个文件,然后从指定列表中删除包含这些词的所有停用词。结果是一场灾难——整个文件的内容被一遍又一遍地复制。
我试过的:
- 将文件保存为字符串并尝试使用正则表达式查看
- 将文件保存为字符串并逐行检查并将标记与存储在 LinkedHashSet 中的停用词进行比较,我也可以将它们存储在文件中
- 试图以多种方式扭曲下面的逻辑,得到越来越荒谬的输出。
- 尝试使用 .contains()
方法查看文本/行,但没有成功
我的大致逻辑如下:
for every word in the stopwords set:
while(file has more lines):
save current line into String
while (current line has more tokens):
assign current token into String
compare token with current stopword:
if(token equals stopword):
write in the output file "" + " "
else: write in the output file the token as is
Tried what's in this question 和许多其他 SO 问题,但无法实现我需要的。
真实代码如下:
private static void removeStopWords(File fileIn) throws IOException {
File stopWordsTXT = new File("stopwords.txt");
System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
Set<String> stopWords = new LinkedHashSet<String>();
for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
// trim() eliminates leading and trailing spaces
stopWords.add(line.trim());
}
File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
FileWriter fOut = new FileWriter(outp);
Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
while(readerTxt.hasNextLine()) {
String line = readerTxt.nextLine();
System.out.println(line);
Scanner lineReader = new Scanner(line);
for (String curSW : stopWords) {
while(lineReader.hasNext()) {
String token = lineReader.next();
if(token.equals(curSW)) {
System.out.println("---> Removing SW: " + curSW);
fOut.write("" + " ");
} else {
fOut.write(token + " ");
}
}
}
fOut.write("\n");
}
fOut.close();
}
最常发生的是它从停止词集中寻找第一个词,仅此而已。输出包含所有其他词,即使我设法删除了第一个词。第一个将在最后的下一个附加输出中出现。
我的停用词列表的一部分
about
above
after
again
against
all
am
and
any
are
as
at
对于标记,我指的是单词,即从行中获取每个单词并将其与当前停用词进行比较
经过一段时间的调试,我相信我已经找到了解决办法。这个问题非常棘手,因为您必须使用多个不同的扫描仪和文件 readers 等。这是我所做的:
我更改了您添加到停用词集的方式,因为它没有正确添加它们。我使用缓冲 reader 读取每一行,然后使用扫描仪读取每个单词,然后将其添加到集合中。
然后,当您比较它们时,我摆脱了您的一个循环,因为您可以轻松地使用 .contains() 方法来检查该词是否为停用词。
我让你完成写入文件以删除停用词的部分,因为我相信你现在可以解决这个问题,因为其他一切都在工作。
-我的示例停用词 txt 文件: 停用词 字数
-我的示例输入文件完全相同,所以它应该捕获所有三个词。
代码:
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
// trim() eliminates leading and trailing spaces
Scanner words = new Scanner(stopWordsLine);
String word = words.next();
while(word != null) {
stopWords.add(word.trim()); //Add the stop words to the set
if(words.hasNext()) {
word = words.next(); //If theres another line, read it
}
else {
break; //else break the inner while loop
}
}
stopWordsLine = readerSW.readLine();
}
BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();
while(line != null) {
Scanner lineReader = new Scanner(line);
String line2 = lineReader.next();
while(line2 != null) {
if(stopWords.contains(line2)) {
System.out.println("removing " + line2);
}
if(lineReader.hasNext()) { //If theres another line, read it
line2 = lineReader.next();
}
else {
break; //else break the first while loop
}
}
lineReader.close();
line = outp.readLine();
}
输出:
removing Stop
removing words
removing Words
如果我可以详细说明我的代码或我为什么做某事,请告诉我!