计算 txt 文件中唯一单词的数量和每个单词的出现次数

Question

目前我正在尝试创建一个应用程序来进行一些文本处理以读取文本文件，然后我使用字典来创建单词索引，从技术上讲它会像这样..程序将是运行并读取一个文本文件然后检查它，看看这个词是否已经在那个文件中，以及它的 id 词是什么作为一个独特的词。如果是这样，它将打印出他们遇到的每个单词的索引号和总出现次数，并继续检查整个文件。并产生这样的东西：http://pastebin.com/CjtcYchF

这是我正在输入的文本文件的示例：http://pastebin.com/ZRVbhWhV 快速 ctrl-F 显示 "not" 出现了 2 次，"that" 出现了 4 次。我需要做的是索引每个单词并像这样调用它：

sample input : "that I have not that place sunrise beach like not good dirty beach trash beach" 

    dictionary :            output.txt / output.dat:
    index word                     
      1    I                4:2 1:1 2:1 3:2 5:1 6:1 7:3 8:1 9:1 10:1 11:1
      2   have                   
      3   not                    
      4   that                   
      5   place                  
      6   sunrise
      7   beach
      8   like
      9   good
      10  dirty
      11  trash

我试图实现一些代码来创建字典。这是我目前所拥有的：

   private void bagofword_Click(object sender, EventArgs e)
            {
                //creating dictionary in background
                    //Dictionary<string, int> dict = new Dictionary<string, int>();
                    string rawinputbow = File.ReadAllText(textBox31.Text);
                    //string[] inputbow = rawinputbow.Split(' ');

                    var inputbow = rawinputbow.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
                                   .ToList();
                    var dict = new OrderedDictionary();
                    var output = new List<int>();

                    foreach (var element in inputbow.Select((word, index) => new { word, index }))
                    {

                        if (dict.Contains(element.word))
                        {
                            var count = (int)dict[element.word];
                            dict[element.word] = ++count;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = output.ToString();
                           // textBoxfile.Text = inputbow.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }
                        else
                        {
                            dict[element.word] = 1;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = dict.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }

                    }
    }

    public int GetIndex(OrderedDictionary dictionary, string key)
            {
                for (int index = 0; index < dictionary.Count; index++)
                {
                    if (dictionary[index] == dictionary[key])                   
                        return index; // We found the item       
                        //textBoxfile.Text = index.ToString();
                }

                return -1;
            }

有人知道如何完成该代码吗？非常感谢任何帮助！

Answer 1

使用此代码

  string input = "that I have not that place sunrise beach like not good dirty beach trash beach";
        var wrodList = input.Split(null);
        var output = wrodList.GroupBy(x => x).Select(x => new Word { charchter = x.Key, repeat = x.Count() }).OrderBy(x=>x.repeat);
        foreach (var item in output)
        {
            textBoxfile.Text += item.charchter +" : "+ item.repeat+Environment.NewLine;
        }

class 用于保存数据

 public class word
    {
        public string  charchter { get; set; }
        public int repeat { get; set; }
    }

Answer 2

仅在空格上拆分是不够的。你有一些像 temple, photos. 或 cafes/restaraunts 这样的词。更好的方法是使用像 \w+ 这样的正则表达式。此外，应以不区分大小写的方式比较单词。

我的方法是：

var words = Regex.Matches(File.ReadAllText(filename), @"\w+").Cast<Match>()
            .Select((m, pos) => new { Word = m.Value, Pos = pos })
            .GroupBy(s => s.Word, StringComparer.CurrentCultureIgnoreCase)
            .Select(g => new { Word = g.Key, PosInText = g.Select(z => z.Pos).ToList() })
            .ToList();


foreach(var item in words)
{
    Console.WriteLine("{0,-15} POS:{1}", item.Word, string.Join(",", item.PosInText));
}


for (int i = 0; i < words.Count; i++)
{
    Console.Write("{0}:{1} ", i, words[i].PosInText.Count);
}

Answer 3

### Sample code for you to tweak for your needs:
touch test.txt
echo "ravi chandran marappan 30" > test.txt                                                                                                                                     
echo "ramesh kumar marappan 24" >> test.txt
echo "ram lakshman marappan 22" >> test.txt
sed -e 's/ /\n/g' test.txt | sort | uniq | awk '{print "echo """,,
"""`grep -wc ",," test.txt`"}' | sh

Results:                          
22 -1                                                                                                                                                         
24 -1                                                                                                                                                         
30 -1                                                                                                                                                         
chandran -1                                                                                                                                                   
kumar -1                                                                                                                                                      
lakshman -1                                                                                                                                                   
marappan -3                                                                                                                         
ram -1                                                                                                                            
ramesh -1                                                                                                                       
ravi -1

计算 txt 文件中唯一单词的数量和每个单词的出现次数

Count the number of unique words and occurrence of each word from txt file

c#

text-processing

visual-studio