在 R 中对日文文本进行分词：仅对指定列的第一行进行分词

Question

我正在尝试使用日语分词器 RMeCab 对一组推文进行分词，特别是函数 RMeCabDF（用于数据帧）。

文档说明了以下用法：

RMeCabDF

Description

RMeCabDF takes data frames as the first argument, and analyzes the columns specified by the second argument. Blank data should be replaced with NA. If 1 is designated as the third argument, it returns each morpheme in its basic form.

Usage

RMeCabDF(dataf, coln, mypref, dic = "", mecabrc = "", etc = "")

Arguments

dataf data.frame

coln Column number or name which include Japanese sentences

mypref Default being 0, the same morphemic forms that appear on the text are returned. If 1 is designated, the basic forms of them are instead.

dic to specify user dictionary, e.x. ishida.dic

mecabrc not implemented (to specify mecab resource file)

etc other options to mecab

因此，在此之后，我使用以下代码对数据帧 trump_ja 中的列号 89 进行标记：

trump_ja_tokens <- RMeCabDF(trump_ja, coln = 89)

这会导致 List of 1 - 但如您所见，数据框有 989 行。

我的其他行去哪儿了？

我必须逐行分词吗？如果是这样，是否有任何方法可以自动执行此过程以避免键入 1000 行代码（或使用 Excel 生成 1000 行代码）？

Answer 1

您可以使用带有 tidytext 的 RMeCab 分词器，就像 this user 那样。您可以这样设置：

df %>%
    unnest_tokens(word, text, token = RMeCab::RMeCabC)

其中 df 是您的数据框，word 是您要创建的新列，text 是您已有的包含文本的旧列想要标记化。 unnest_tokens() 中的 token 参数可以将函数作为参数，对于这些情况。

在 R 中对日文文本进行分词：仅对指定列的第一行进行分词

Tokenizing Japanese text in R: Only first line of the specified column is tokenized

r

tokenize

dataframe

mecab

tidytext