R - fasttext 如何从命令行将输出加载到数据帧中
R - fasttext how to load output into a dataframe from command line
我正在 R 中开发一个从命令行调用 fasttext 的项目,但我不确定如何加载 fasttext 给我的输出作为数据帧
> data.train<-data.frame(index=c(rep("__label__1",3),rep("__label__2",3)),country=c("ENGLAND","BRITAIN","UNITED KINDOM","USA","AMERICA","UNITED STATES"))
> data.train
index country
1 __label__1 ENGLAND
2 __label__1 BRITAIN
3 __label__1 UNITED KINDOM
4 __label__2 USA
5 __label__2 AMERICA
6 __label__2 UNITED STATES
> data.test<-c("EGLND","MURICA")
> data.test
[1] "EGLND" "MURICA"
> write.table(data.train,"data.train.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
>
> write.table(data.test,"data.test.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
>
> system("fasttext supervised -input data.train.txt -output model_data")
Read 0M words
Number of words: 8
Number of labels: 2
Progress: 0.0% words/sec/thread: 103000 lr: 0.100000 loss: 0.672343 eta: -596523h-14m Progress: 100.0% words/sec/thread: 103000 lr: 0.000000 loss: 0.672343 eta: 0h0m
Saving model file.
> system("fasttext predict-prob model_data.bin data.test.txt 2")
__label__1 0.5 __label__2 0.498047
__label__1 0.5 __label__2 0.498047
> res<-system("fasttext predict-prob model_data.bin data.test.txt 2", intern=TRUE)
> res
[1] "__label__1 0.5 __label__2 0.498047" "__label__1 0.5 __label__2 0.498047"
原来的系统调用只是将 fasttext 输出打印到控制台,这是问题所在,但是根据评论,intern=TRUE 允许我将其保存到变量 res,但现在的问题是变量只是一个字符串向量,我实际需要的是每个标签的概率数据框,如下所示:
> want
__label__1 __label__2
1 0.5 0.49807
2 0.5 0.49807
这个问题 回答了类似的问题,但是 python 我需要在 R 中做这个。
假设您已经使用 system(..., intern = TRUE)
获得 res
的字符向量,您可以尝试以下操作。
res3 <- c("__label__1 0.500768 __label__2 0.499252",
"__label__2 0.500768 __label__1 0.499252",
"__label__3 1")
library(data.table)
x <- fread(text = res3, fill = TRUE)
# rename the columns in "variable"/"value" pairs and add a row indicator
setnames(x, paste0(rep(c("var_", "val_"), length.out = ncol(x)),
rep(1:2, each = ncol(x)/2)))[, row := .I][]
# melt the data into a long form and cast it into a wide form
out <- melt(x, measure = patterns("var_", "val_"), na.rm = TRUE)[
, dcast(.SD, row ~ value1, value.var = "value2")]
out
# row __label__1 __label__2 __label__3
# 1: 1 0.500768 0.499252 NA
# 2: 2 0.499252 0.500768 NA
# 3: 3 NA NA 1
如果您想在输出中用 0
替换 NA
,您可以将 fill = 0
添加到 dcast
。
我正在 R 中开发一个从命令行调用 fasttext 的项目,但我不确定如何加载 fasttext 给我的输出作为数据帧
> data.train<-data.frame(index=c(rep("__label__1",3),rep("__label__2",3)),country=c("ENGLAND","BRITAIN","UNITED KINDOM","USA","AMERICA","UNITED STATES"))
> data.train
index country
1 __label__1 ENGLAND
2 __label__1 BRITAIN
3 __label__1 UNITED KINDOM
4 __label__2 USA
5 __label__2 AMERICA
6 __label__2 UNITED STATES
> data.test<-c("EGLND","MURICA")
> data.test
[1] "EGLND" "MURICA"
> write.table(data.train,"data.train.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
>
> write.table(data.test,"data.test.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
>
> system("fasttext supervised -input data.train.txt -output model_data")
Read 0M words
Number of words: 8
Number of labels: 2
Progress: 0.0% words/sec/thread: 103000 lr: 0.100000 loss: 0.672343 eta: -596523h-14m Progress: 100.0% words/sec/thread: 103000 lr: 0.000000 loss: 0.672343 eta: 0h0m
Saving model file.
> system("fasttext predict-prob model_data.bin data.test.txt 2")
__label__1 0.5 __label__2 0.498047
__label__1 0.5 __label__2 0.498047
> res<-system("fasttext predict-prob model_data.bin data.test.txt 2", intern=TRUE)
> res
[1] "__label__1 0.5 __label__2 0.498047" "__label__1 0.5 __label__2 0.498047"
原来的系统调用只是将 fasttext 输出打印到控制台,这是问题所在,但是根据评论,intern=TRUE 允许我将其保存到变量 res,但现在的问题是变量只是一个字符串向量,我实际需要的是每个标签的概率数据框,如下所示:
> want
__label__1 __label__2
1 0.5 0.49807
2 0.5 0.49807
这个问题
假设您已经使用 system(..., intern = TRUE)
获得 res
的字符向量,您可以尝试以下操作。
res3 <- c("__label__1 0.500768 __label__2 0.499252",
"__label__2 0.500768 __label__1 0.499252",
"__label__3 1")
library(data.table)
x <- fread(text = res3, fill = TRUE)
# rename the columns in "variable"/"value" pairs and add a row indicator
setnames(x, paste0(rep(c("var_", "val_"), length.out = ncol(x)),
rep(1:2, each = ncol(x)/2)))[, row := .I][]
# melt the data into a long form and cast it into a wide form
out <- melt(x, measure = patterns("var_", "val_"), na.rm = TRUE)[
, dcast(.SD, row ~ value1, value.var = "value2")]
out
# row __label__1 __label__2 __label__3
# 1: 1 0.500768 0.499252 NA
# 2: 2 0.499252 0.500768 NA
# 3: 3 NA NA 1
如果您想在输出中用 0
替换 NA
,您可以将 fill = 0
添加到 dcast
。