无法在 R 中绘制 Zipf 定律
Can't plot Zipf's law in R
我从一个文本文件中加载了大量术语及其频率,并将其转换为 table:
myTbl = read.table("word_count.txt") # read text file
colnames(myTbl)<-c("term", "frequency")
head(myTbl, n = 10)
> head(myTbl, n = 10)
term frequency
1 de 35945
2 i 34850
3 \xe3n 19936
4 s 15348
5 cu 13722
6 la 13505
7 se 13364
8 pe 13361
9 nu 12693
10 o 11995
我应该添加一个包含单词排名的列,然后根据频率绘制排名,但我该怎么做?
与其滚动自己的计算,不如使用 tm
包更容易。将 myTbl 转换为术语文档矩阵 (tdm)
library(tm)
tdm <- TermDocumentMatrix(myTbl) # there are many more clean up steps, but I am simplifying
那么您不仅可以显示 Zipf,还可以显示 Heaps 和 plots。
Zipf_plot(tdm)
Heaps_plot(tdm) # how vocabulary grows as size of text grows
或者,您可以使用 qdap
包及其排名频率图。这是小插图中的引述:
Rank Frequency Plots are a way of visualizing word rank versus
frequencies as related to Zipf's law which states that the rank of a
word is inversely related to its frequency. The rank_freq_mplot and
rank_freq_plot provide the means to plot the ranks and frequencies of
words (with rank_freq_mplot plotting by grouping variable(s)).
Rank_freq_mplot utilizes the ggplot2 package, whereas, rank_freq_plot
employs base graphics.
我从一个文本文件中加载了大量术语及其频率,并将其转换为 table:
myTbl = read.table("word_count.txt") # read text file
colnames(myTbl)<-c("term", "frequency")
head(myTbl, n = 10)
> head(myTbl, n = 10)
term frequency
1 de 35945
2 i 34850
3 \xe3n 19936
4 s 15348
5 cu 13722
6 la 13505
7 se 13364
8 pe 13361
9 nu 12693
10 o 11995
我应该添加一个包含单词排名的列,然后根据频率绘制排名,但我该怎么做?
与其滚动自己的计算,不如使用 tm
包更容易。将 myTbl 转换为术语文档矩阵 (tdm)
library(tm)
tdm <- TermDocumentMatrix(myTbl) # there are many more clean up steps, but I am simplifying
那么您不仅可以显示 Zipf,还可以显示 Heaps 和 plots。
Zipf_plot(tdm)
Heaps_plot(tdm) # how vocabulary grows as size of text grows
或者,您可以使用 qdap
包及其排名频率图。这是小插图中的引述:
Rank Frequency Plots are a way of visualizing word rank versus frequencies as related to Zipf's law which states that the rank of a word is inversely related to its frequency. The rank_freq_mplot and rank_freq_plot provide the means to plot the ranks and frequencies of words (with rank_freq_mplot plotting by grouping variable(s)).
Rank_freq_mplot utilizes the ggplot2 package, whereas, rank_freq_plot employs base graphics.