我怎样才能正确地输入带撇号的单词？ "wouldn't" 和 "couldn't" 之类的词被作为 "wouldn" 和 "couldn" 放入 ArrayList

Question

IT Networking/programming 学生试图完成作业，但我遇到了障碍。我们的任务是读取文本文件，将单词放入 ArrayList，并对内容执行字符串操作。我能够将单词拉入 ArrayList，按升序对内容进行排序，删除任何少于四个字符的单词，删除重复的条目，并删除数字。不过我发现带撇号的单词是 "cut-off"。 "wouldn't" 和 "couldn't" 之类的词被作为 "wouldn" 和 "couldn" 放入我的 ArrayList。

我已经为我的扫描仪对象尝试了不同的定界符，但我似乎找不到能够在单词中保留撇号并且不会在撇号后切断单词的定界符。

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedHashSet;
import java.util.Scanner;

public class textFile {

    public static void main(String[] args) throws FileNotFoundException {

        // Scanner object reads in the required text file to the "words" ArrayList.
        Scanner sc = new Scanner(new File("textfile.txt"), "UTF-8");
        ArrayList<String> words = new ArrayList<String>();
        while (sc.hasNext()) {
            sc.useDelimiter("[^A-Za-z]");
            words.add(sc.next().toLowerCase());

        }
        // Closes the Scanner object used just above.
        sc.close();

        // Sorts the "words" ArrayList in ascending order.
        Collections.sort(words);

        // Creates the "wordsNoDuplicates" ArrayList. Removes duplicate strings.
        LinkedHashSet<String> wordsNoDup = new LinkedHashSet<String>(words);

        // Removes all words containing less than four characters.
        wordsNoDup.removeIf(u -> u.length() < 4);

        // Prints the total number of words in the "wordsNoDup" ArrayList
        System.out.println("Total Number of Words: " + wordsNoDup.size() + "\n");

        // Calculate and print the average word length.
        // double avgWordLength = 21186 / wordsNoDup.size();

        System.out.println("Average Word Length: " + 7.0 + "\n");

        // Print out the "words" ArrayList. Intended for debugging.
        System.out.print(wordsNoDup);

        System.out.println();

    }
}

同样，"couldn't"、"shouldn't" 和 "wouldn't" 等词被提取为 "couldn"、"shouldn" 和 "wouldn"。好像撇号和它后面的任何东西都被删除了。我会公开承认我不是一个对 Java 或编程有广泛了解的人，但我们将不胜感激任何帮助！

Answer 1

在你的代码中使用这个，

sc.useDelimiter("[^A-Za-z]");

字母表以外的任何字符都将充当分隔符，因此 ' 也将充当分隔符，因此我建议将上面的代码行更改为此，

sc.useDelimiter("[^A-Za-z']");

因此 ' 将不再被视为分隔符，应在单词中保留 '。

但我认为最好阅读您的文本并使用适当的正则表达式来匹配和过滤您的单词，因此，您只在例外情况下允许 ' 出现在一个单词中而不可能在一个词之外。

我怎样才能正确地输入带撇号的单词？ "wouldn't" 和 "couldn't" 之类的词被作为 "wouldn" 和 "couldn" 放入 ArrayList

How can I pull in words with apostrophes correctly? Words like "wouldn't" and "couldn't" are being placed into ArrayList as "wouldn" and "couldn"

java

regex

delimiter

text-files

java.util.scanner