C++ 中的停用词

Stop Words in C++

以下 C++ 程序采用两个文本文件,stop_words.txt 和 story.txt。然后删除 story.txt 文件中出现的所有停用词。例如,

Monkey is a common name that may refer to groups or species of mammals, in part, the simians of infraorder L. The term is applied descriptively to groups of primates, such as families of new world monkeys and old world monkeys. Many monkey species are tree-dwelling (arboreal), although there are species that live primarily on the ground, such as baboons. Most species are also active during the day (diurnal). Monkeys are generally considered to be intelligent, especially the old world monkeys of Catarrhini.

上面的文字是story.txt,下面给出stop_words.txt文件:

is
are 
be

当我 运行 我的代码时,它不会删除所有停用词并保留其中一些。该代码还创建了一个名为 stop_words_counter.txt 的文件,该文件应显示停用词出现的次数,如下所示:

is 2
are 4
b 1

但我的输出文件显示如下:

is 1
are 4
be 1

我将非常感谢有关此代码的一些帮助!我贴在下面供大家参考。


#include <iostream>
#include <string>
#include <fstream>
using namespace std;

const int MAX_NUM_STOPWORDS = 100;

struct Stop_word
{
  string word;  // stop word
  int count;    // removal count
};


int stops[100];
string ReadLineFromStory(string story_filename )
{
  string x = "";
  string b;
  ifstream fin;
  fin.open(story_filename);
  while(getline(fin, b))
  {
    x += b;

  }
  return x;
}

void ReadStopWordFromFile(string stop_word_filename, Stop_word words[], int &num_words)
{
  ifstream fin;
  fin.open(stop_word_filename);
  string a;
  int i = 0;
  if (fin.fail())
  {
    cout << "Failed to open "<< stop_word_filename << endl;
    exit(1);
  }
  words[num_words].count = 0;
  while (fin >> words[num_words].word)
  {
    
    ++num_words;
  }


  fin.close();
}

void WriteStopWordCountToFile(string wordcount_filename, Stop_word words[], int num_words)
{
  ofstream fout;
  fout.open(wordcount_filename);
  for (int i = 0; i < 1; i++)
  {
    fout << words[i].word << " "<< stops[i] + 1 << endl;
  }
  for (int i = 1; i < num_words; i++)
  {
    fout << words[i].word << " "<< stops[i] << endl;
  }

  fout.close();
}

int RemoveWordFromLine(string &line, string word)
{
  int length = line.length();
    int counter = 0;
    int wl = word.length();
    for(int i=0; i < length; i++)
    {
        int x = 0;
        if(line[i] == word[0] && (i==0 || (i != 0 && line[i-1]==' ')))
        {
            for(int j = 1 ; j < wl; j++)
                if (line[i+j] != word[j])
                {
                    x = 1;
                    break;
                }
            if(x == 0 && (i + wl == length || (i + wl != length && line[i+wl] == ' ')))
            {
                for(int k = i + wl; k < length; k++)
                    line[k -wl] =line[k];
                length -= wl;

                counter++;
            }
        }

  }
  line[length] = 0;
  char newl[1000] = {0};
  for(int i = 0; i < length; i++)
    newl[i] = line[i];
  line.assign(newl);
  return counter;
}


int RemoveAllStopwordsFromLine(string &line, Stop_word words[], int num_words)
{
  int counter[100];
  int final = 0;
    for(int i = 1; i <= num_words; i++)
  {
    counter[i] = RemoveWordFromLine(line, words[i].word);
    final += counter[i];
    stops[i] = counter[i];

  }
    return final;

}


int main()
{

  Stop_word stopwords[MAX_NUM_STOPWORDS];     // an array of struct Stop_word
  int num_words = 0, total = 0;
  // read in two filenames from user input
  string a, b, c;
  cin >> a >> b;

  // read stop words from stopword file and
  // store them in an array of struct Stop_word
  ReadStopWordFromFile(a, stopwords, num_words);

  // open text file
  c = ReadLineFromStory(b);


  // open cleaned text file
  ofstream fout;
  fout.open("story_cleaned.txt");


  // read in each line from text file, remove stop words,
  // and write to output cleaned text file

  total = RemoveAllStopwordsFromLine(c, stopwords, num_words) + 1 ;

  fout << c;

  // close text file and cleaned text file

  fout.close();

  // write removal count of stop words to files

  WriteStopWordCountToFile("stop_words_count.txt", stopwords, num_words);

  // output to screen total number of words removed
  cout << "Number of stop words removed = " << total << endl;

  return 0;
}

您的代码中存在一个重大错误。

在函数中RemoveAllStopwordsFromLine

您使用了错误的数组索引。在 C++ 中,数组中的第一个元素的索引为 0。此外,您必须与 "less" 比较大小。

for (int i = 1; i <= num_words; i++)

所以第一个停用词"is",永远不会被检查和统计。

请修改为

for (int i = 0; i < num_words; i++)

但是你还需要删除函数 WriteStopWordCountToFile 中的补丁。您为元素 0 做了一个特例。那是错误的。

请删除

    for (int i = 0; i < 1; i++)
    {
        fout << words[i].word << " " << stops[i] + 1 << endl;
    }

并从0开始下一个for。计算总数时去掉“+”。

因为您使用的是 C 风格数组、幻数和超复杂代码,所以我将向您展示一个现代 C++ 解决方案。

在 C++ 中有许多有用的算法。有些是专门为满足您的要求而设计的。所以,请使用它们。尝试摆脱 C 并迁移到 C++。

#include <string>
#include <iostream>
#include <fstream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
#include <sstream>


// The filenames. Whatever you want
const std::string storyFileName{ "r:\story.txt" };
const std::string stopWordFileName{ "r:\stop_words.txt" };
const std::string stopWordsCountFilename{ "r:\stop_words_count.txt" };
const std::string storyCleanedFileName{ "r:\story_cleaned.txt" };



// Becuase of the simplicity of the task, put everything in main
int main() {

    // Open all 4 needed files
    std::ifstream storyFile(storyFileName);
    std::ifstream stopWordFile(stopWordFileName);
    std::ofstream stopWordsCountFile(stopWordsCountFilename);
    std::ofstream storyCleanedFile(storyCleanedFileName);

    // Check, if the files could be opened
    if (storyFile && stopWordFile && stopWordsCountFile && storyCleanedFile) {

        // 1. Read the complete sourcefile with the story into a std::string
        std::string story( std::istreambuf_iterator<char>(storyFile), {} );

        // 2. Read all "stop words" into a std::vector of std::strings
        std::vector stopWords(std::istream_iterator<std::string>(stopWordFile), {});

        // 3. Count the occurences of the "stop words" and write them into the destination file
        std::for_each(stopWords.begin(), stopWords.end(), [&story,&stopWordsCountFile](std::string& sw) {
            std::regex re{sw};                          // One of the "stop words"
            stopWordsCountFile << sw << " --> " <<      // Write count to output
                std::distance(std::sregex_token_iterator(story.begin(), story.end(), re, 1), {}) << "\n";});

        // 4. Replace "stop words" in story and write new story into file
        std::ostringstream wordsToReplace;      // Build a list of all stop words, followed by an option white space
        std::copy(stopWords.begin(), stopWords.end(), std::ostream_iterator<std::string>(wordsToReplace, "\s?|"));

        storyCleanedFile << std::regex_replace(story,std::regex(wordsToReplace.str()), "");
    }
    else {
        // In case that any of the files could not be opened.
        std::cerr << "\n*** Error: Could not open one of the files\n";
    }
    return 0;
}

请尝试研究和理解这段代码。这是一个非常简单的解决方案。