如何根据 R 中的名称对双打列表进行排序
how to sort a list of doubles according to their names in R
我正在尝试编写一个函数来计算 R1 词汇丰富度度量。公式如下:
R1 = 1 - ( F(h) - h*h/2N) )
其中 N 是标记的数量,h 是赫希点,F(h) 是到该点的累积相对频率。我的实际数据格式与以下数据相同:
txt <- list(
a = c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.","The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.","His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.","We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
str(tstat_by)
str(h_values)
F <- list()
R <- list()
temp <- list()
for( Ls in names(tstat_by) ){
for (item in names(h_values[[Ls]]) ){
temp[[Ls]][[item]] <- subset(tstat_by[[Ls]][[item]], rank <= h_values[[Ls]][[item]])
F[[Ls]][[item]] <- sum(temp[[Ls]][[item]]$frequency) / sum(tstat_by[[Ls]][[item]]$frequency)
R[[Ls]][[item]] <- 1 - ( F[[Ls]][[item]] -
h_values[[Ls]][[item]] ^ 2 /
2 * sum(tstat_by[Ls][[item]]$frequency) )
}}
我将需要的值存储在列表中,但顺序错误。这是 for 循环产生的结果:
names(R[["a"]])
[1] "text1" "text10" "text11" "text2" "text3" "text4" "text5" "text6" "text7"
[10] "text8" "text9"
但我需要它按以下自然顺序排列:
names(R[["a"]])
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8" "text9"
[10] "text10" "text11"
所以问题是如何根据值的名称对值进行排序——名称的数字部分需要按顺序排列。
在去除“文本”部分后,按元素名称中的整数值排序。
> R$a <- R$a[order(as.integer(gsub("text", "", names(R$a))))]
> R$a
$text1
[1] 0.8666667
$text2
[1] 0.8510638
$text3
[1] 0.9
$text4
[1] 0.9411765
$text5
[1] 0.8333333
$text6
[1] 0.9166667
$text7
[1] 0.8666667
$text8
[1] 0.8571429
$text9
[1] 0.7741935
$text10
[1] 0.8888889
$text11
[1] 0.8717949
我正在尝试编写一个函数来计算 R1 词汇丰富度度量。公式如下:
R1 = 1 - ( F(h) - h*h/2N) )
其中 N 是标记的数量,h 是赫希点,F(h) 是到该点的累积相对频率。我的实际数据格式与以下数据相同:
txt <- list(
a = c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.","The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.","His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.","We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
str(tstat_by)
str(h_values)
F <- list()
R <- list()
temp <- list()
for( Ls in names(tstat_by) ){
for (item in names(h_values[[Ls]]) ){
temp[[Ls]][[item]] <- subset(tstat_by[[Ls]][[item]], rank <= h_values[[Ls]][[item]])
F[[Ls]][[item]] <- sum(temp[[Ls]][[item]]$frequency) / sum(tstat_by[[Ls]][[item]]$frequency)
R[[Ls]][[item]] <- 1 - ( F[[Ls]][[item]] -
h_values[[Ls]][[item]] ^ 2 /
2 * sum(tstat_by[Ls][[item]]$frequency) )
}}
我将需要的值存储在列表中,但顺序错误。这是 for 循环产生的结果:
names(R[["a"]])
[1] "text1" "text10" "text11" "text2" "text3" "text4" "text5" "text6" "text7"
[10] "text8" "text9"
但我需要它按以下自然顺序排列:
names(R[["a"]])
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8" "text9"
[10] "text10" "text11"
所以问题是如何根据值的名称对值进行排序——名称的数字部分需要按顺序排列。
在去除“文本”部分后,按元素名称中的整数值排序。
> R$a <- R$a[order(as.integer(gsub("text", "", names(R$a))))]
> R$a
$text1
[1] 0.8666667
$text2
[1] 0.8510638
$text3
[1] 0.9
$text4
[1] 0.9411765
$text5
[1] 0.8333333
$text6
[1] 0.9166667
$text7
[1] 0.8666667
$text8
[1] 0.8571429
$text9
[1] 0.7741935
$text10
[1] 0.8888889
$text11
[1] 0.8717949