将单词列表转换为频率 json

Question

我编写了一个代码，它获取项目列表并输出一个 json，其中唯一项目作为键，频率作为值。

下面的代码在我测试时工作正常


const tokenFrequency = tokens =>{

  const setTokens=[...new Set(tokens)]
  return setTokens.reduce((obj, tok) => {
    const frequency   = tokens.reduce((count, word) =>word===tok?count+1:count, 0);

    const containsDigit = /\d+/;
    if (!containsDigit.test(tok)) {
      obj[tok.toLocaleLowerCase()] = frequency;
    }
    return obj;
  }, new Object());
}

喜欢

const x=["hello","hi","hi","whatsup","hey"]
console.log(tokenFrequency(x))

产生输出

{ hello: 1, hi: 2, whatsup: 1, hey: 1 }

但是当我尝试使用海量数据语料库的单词列表时，它似乎产生了错误的结果。

比如说，如果我输入一个列表单词，列表的长度超过 14000，它会产生错误的结果。

示例： https://github.com/Nahdus/word2vecDataParsing/blob/master/corpous/listOfWords.txt 当此页面（上面链接）中的此列表对单词 "is" 的频率起作用时，结果为 4，但实际频率为 907。

为什么大数据会这样？如何解决？

Answer 1

您需要首先通过对它们应用 toLowerCase() 来规范化您的标记，或者一种区分相同但只是大小写不同的单词的方法。

原因：

您的小型数据集没有 Is 个单词（大写 'i'）。大型数据集确实出现了 Is（大写 'i'），它显然具有频率 4，这反过来会覆盖小写 is 的频率。

将单词列表转换为频率 json

converting list of words into frequency json

json

list

data-processing

node.js