从查询中单独计算字符串的频率

Question

我想从名为 a.java 的文件中搜索查询。如果我的查询是字符串名称，我想从文本文件的查询中单独获取字符串的频率。首先，我必须计算 String 的频率，然后分别命名，然后将两者的频率相加。我怎样才能在 java 平台上实施这个程序？

public class Tf2 {
Integer k;
int totalword = 0;
int totalfile, containwordfile = 0;
Map<String, Integer> documentToCount = new HashMap<>();
File file = new File("H:/java");
File[] files = file.listFiles();
public void Count(String word) {
   File[] files = file.listFiles();
    Integer count = 0;
    for (File f : files) {
        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader(f));
            count = documentToCount.get(word);

            documentToCount.clear();

            String line;
            while ((line = br.readLine()) != null) {
                String term[] = line.trim().replaceAll("[^a-zA-Z0-9 ]", " ").toLowerCase().split(" ");


                for (String terms : term) {
                    totalword++;
                    if (count == null) {
                        count = 0;
                    }
                    if (documentToCount.containsKey(word)) {

                        count = documentToCount.get(word);
                        documentToCount.put(terms, count + 1);
                    } else {
                        documentToCount.put(terms, 1);

                    }

                }

            }
          k = documentToCount.get(word);

            if (documentToCount.get(word) != null) {
                containwordfile++;
       
               System.out.println("" + k);

            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
} public static void main(String[] args) throws IOException {Tf2  ob = new Tf2();String query="String name";ob.Count(query);
}}

我用 hashmap 试过了。但它不能单独计算查询的频率。

Answer 1

你真是over-complicating东西大大。如果您需要做的只是计算出现次数，则不需要哈希图或类似的东西。您需要做的就是遍历文档中的所有文本并计算找到搜索字符串的次数。

基本上，您的工作流程是：

将计数器实例化为 0
阅读文字
遍历文本，寻找搜索字符串
找到搜索字符串后，增加计数器
当完成对文本的迭代时，打印计数器的结果

如果您的文本很长，您可以这样做 line-by-line 或以其他方式批量阅读。

这是一个简单的例子。假设我有一个文件，我正在寻找“狗”这个词。

// 1. instantiate counter to 0
int count = 0;

// 2. read text
Path path = ...; // path to my input file
String text = Files.readString(path, StandardCharsets.US_ASCII);

// 3-4. find instances of the string in the text
String searchString = "dog";

int lastIndex = 0;
while (lastIndex != -1) {
  lastIndex = text.indexOf(searchString, lastIndex); // will resolve -1 if the searchString is not found
  if (lastIndex != -1) {
    count++; // increment counter
    lastIndex += searchString.length(); // increment index by length of search term
  }
}

// 5. print result of counter
System.out.println("Found " + count + " instances of " + searchString);

在您的具体示例中，您将阅读 a.java class 的内容，然后找到 'String' 的实例数，然后是 [= 的实例数61=]。您可以在闲暇时将它们加在一起。因此，您需要为要搜索的每个词重复第 3 步和第 4 步，然后在最后对所有计数求和。

当然，最简单的方法是将第 3 步和第 4 步包装在一个 returns 计数的方法中。

int countOccurrences(String searchString, String text) {
  int count = 0;
  int lastIndex = 0;
  while (lastIndex != -1) {
    lastIndex = text.indexOf(searchString, lastIndex);
    if (lastIndex != -1) {
      count++;
      lastIndex += searchString.length();
    }
  }
  return count;
}

// Call:
int nameCount = countOccurrences("name", text);
int stringCount = countOccurrences("String", text);

System.out.println("Counted " + nameCount + " instances of 'name' and " + stringCount + " instances of 'String', for a total of " + (nameCount + stringCount));

（是否对 text 进行 toLowerCase() 取决于您是否需要 case-sensitive 匹配。）

当然，如果你只想要 'name' 而不是 'lastName'，那么你将开始需要考虑单词边界（正则表达式字符 class \b在这里很有用。）为了解析打印的文本，您需要考虑用连字符跨行结尾的单词。但听起来您的用例只是计算 space-delimited 字符串中恰好提供给您的单个单词的实例。

如果您实际上只想将String name的实例作为一个单独的短语，只需使用第一个工作流程。

其他有用的问答：

How do I create a Java string from the contents of a file?
Occurrences of substring in a string

Answer 2

下面是一个使用 Collections.frequency 获取文件中字符串计数的示例：

public void Count(String word) {
    File f = new File("/your/path/text.txt");
    BufferedReader br = null;
    List<String> list = new ArrayList<String>();
    try {
        if (f.exists() && f.isFile()) {
            br = new BufferedReader(new FileReader(f));
            String line;
            while ((line = br.readLine()) != null) {
                String[] arr = line.split(" ");
                for (String str : arr) {
                    list.add(str);
                }

            }
            System.out.println("Frequency = " + Collections.frequency(list, word));
        }

    } catch (IOException e) {
        e.printStackTrace();
    }
}

这是另一个使用 Java Streams API 的示例，也适用于目录内的多文件搜索：

    public class Test {

    public static void main(String[] args) {
        File file = new File("C:/path/to/your/files/");
        String targetWord = "stringtofind";
        long numOccurances = 0;

        if(file.isFile() && file.getName().endsWith(".txt")){

            numOccurances = getLineStreamFromFile(file)
                    .flatMap(str -> Arrays.stream(str.split("\s")))
                    .filter(str -> str.equals(targetWord))
                    .count();

        } else if(file.isDirectory()) {

            numOccurances = Arrays.stream(file.listFiles(pathname -> pathname.toString().endsWith(".txt")))
                    .flatMap(Test::getLineStreamFromFile)
                    .flatMap(str -> Arrays.stream(str.split("\s")))
                    .filter(str -> str.equals(targetWord))
                    .count();
        }

        System.out.println(numOccurances);
    }

    public static Stream<String> getLineStreamFromFile(File file){
        try {
            return Files.lines(file.toPath());
        } catch (IOException e) {
            e.printStackTrace();
        }
        return Stream.empty();
    }
  }

此外，您可以将输入字符串分解为单个单词并循环获取每个单词的出现次数。

Answer 3

您可以使用以词为键、计数为值的映射：

  public static void main(String[] args) {
    String corpus =
        "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world";
    String query = "edited Wikipedia volunteers";

    Map<String, Integer> word2count = new HashMap<>();
    for (String word : corpus.split(" ")) {
      if (!word2count.containsKey(word))
        word2count.put(word, 0);
      word2count.put(word, word2count.get(word) + 1);
    }

    for (String q : query.split(" "))
      System.out.println(q + ": " + word2count.get(q));
  }

Answer 4

If I have a file that contains a line "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world".I want to search a query "edited Wikipedia volunteers ".then my program first count the frequency edited from the text file, then count Wikipedia frequency and then volunteers frequency, and at last it sum up all the frequency. can I solve it by using hashmap?

您可以按照以下方式进行：

import java.util.HashMap;
import java.util.Map;

public class Main {
    public static void main(String[] args) {
        // The given string
        String str = "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world.";

        // The query string
        String query = "edited Wikipedia volunteers";

        // Split the given string and the query string on space
        String[] strArr = str.split("\s+");
        String[] queryArr = query.split("\s+");

        // Map to hold the frequency of each word of query in the string
        Map<String, Integer> map = new HashMap<>();

        for (String q : queryArr) {
            for (String s : strArr) {
                if (q.equals(s)) {
                    map.put(q, map.getOrDefault(q, 0) + 1);
                }
            }
        }

        // Display the map
        System.out.println(map);

        // Get the sum of all frequencies
        int sumFrequencies = map.values().stream().mapToInt(Integer::intValue).sum();

        System.out.println("Sum of frequencies: " + sumFrequencies);
    }
}

输出：

{edited=1, Wikipedia=1, volunteers=1}
Sum of frequencies: 3

查看 the documentation of Map#getOrDefault 以了解更多信息。

更新

在原来的答案中，我使用了 Java Stream API 来获得值的总和。下面给出了另一种方法：

// Get the sum of all frequencies
int sumFrequencies = 0;
for (int value : map.values()) {
    sumFrequencies += value;
}

你的另一个问题是：

if I have multiple files in a folder then how can i know of how many times is this query os occurring in which file

您可以创建一个 Map<String, Map<String, Integer>>，其中键为文件名，值（即 Map<String, Integer>）为文件的频率图。我已经在上面展示了创建这个频率图的算法。您所要做的就是遍历文件列表并填充此映射 (Map<String, Map<String, Integer>>)。

从查询中单独计算字符串的频率

Count frequency of a string individually from query

java

algorithm

file

hashmap

tf-idf

更新