基于 R 中的语言模型从列中检测和检索文本

Question

我正在使用 googleLanguageR 从数据框的文本列中自动检测文本语言。对于特定的句子，我执行以下操作：

library(googleLanguageR)
gl_auth("credential.json")

gl_translate_detect(df[[45, 'text']])

其中 text 是数据框 df 中的一列。 45 是我要检测语言的行号。 “credential.json”是 API 来自 Google 的私钥。

这给我相应的检测到的语言作为输出。但是，我想申请整个包含英语和德语混合文本的文本栏，并将它们分开。

我尝试了以下方法：

gl_translate_detect(df[['text']])

但给我：

Error in nchar(string) : invalid multibyte string, element 13

我的想法是提供一个语料库来检测数据帧上的底层语言。

Answer 1

它可能没有被矢量化。我们可以使用 rowwise

library(dplyr)
df %>%
   rowwise %>%
   mutate(out = tryCatch(gl_tranlsate_detect(text), 
     error = function(e) NA_character_))

或使用 lapply 遍历 'text' 列中的每个元素并应用函数

lapply(df$text, gl_translate_detect)

Detecting and Retrieving text from a column based on language model in R