R - fasttext 如何从命令行将输出加载到数据帧中

R - fasttext how to load output into a dataframe from command line

我正在 R 中开发一个从命令行调用 fasttext 的项目,但我不确定如何加载 fasttext 给我的输出作为数据帧

> data.train<-data.frame(index=c(rep("__label__1",3),rep("__label__2",3)),country=c("ENGLAND","BRITAIN","UNITED KINDOM","USA","AMERICA","UNITED STATES"))

> data.train
       index       country
1 __label__1       ENGLAND
2 __label__1       BRITAIN
3 __label__1 UNITED KINDOM
4 __label__2           USA
5 __label__2       AMERICA
6 __label__2 UNITED STATES

> data.test<-c("EGLND","MURICA")

> data.test
[1] "EGLND"  "MURICA"

> write.table(data.train,"data.train.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
> 

> write.table(data.test,"data.test.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
> 
> system("fasttext supervised -input data.train.txt -output model_data")
Read 0M words
Number of words:  8
Number of labels: 2
Progress: 0.0%  words/sec/thread: 103000  lr: 0.100000  loss: 0.672343  eta: -596523h-14m Progress: 100.0%  words/sec/thread: 103000  lr: 0.000000  loss: 0.672343  eta: 0h0m 
Saving model file.

> system("fasttext predict-prob model_data.bin data.test.txt 2")

__label__1 0.5 __label__2 0.498047
__label__1 0.5 __label__2 0.498047

> res<-system("fasttext predict-prob model_data.bin data.test.txt 2", intern=TRUE)

> res
[1] "__label__1 0.5 __label__2 0.498047" "__label__1 0.5 __label__2 0.498047"

原来的系统调用只是将 fasttext 输出打印到控制台,这是问题所在,但是根据评论,intern=TRUE 允许我将其保存到变量 res,但现在的问题是变量只是一个字符串向量,我实际需要的是每个标签的概率数据框,如下所示:

> want
  __label__1  __label__2
1 0.5       0.49807
2 0.5       0.49807

这个问题 回答了类似的问题,但是 python 我需要在 R 中做这个。

假设您已经使用 system(..., intern = TRUE) 获得 res 的字符向量,您可以尝试以下操作。

res3 <- c("__label__1 0.500768 __label__2 0.499252", 
          "__label__2 0.500768 __label__1 0.499252",
          "__label__3 1")

library(data.table)
x <- fread(text = res3, fill = TRUE)
# rename the columns in "variable"/"value" pairs and add a row indicator
setnames(x, paste0(rep(c("var_", "val_"), length.out = ncol(x)), 
                   rep(1:2, each = ncol(x)/2)))[, row := .I][]
# melt the data into a long form and cast it into a wide form
out <- melt(x, measure = patterns("var_", "val_"), na.rm = TRUE)[
  , dcast(.SD, row ~ value1, value.var = "value2")]
out
#    row __label__1 __label__2 __label__3
# 1:   1   0.500768   0.499252         NA
# 2:   2   0.499252   0.500768         NA
# 3:   3         NA         NA          1

如果您想在输出中用 0 替换 NA,您可以将 fill = 0 添加到 dcast