在 r 中使用带引号的 strsplit 结果
Using strsplit results in terms with quotation marks in r
我有一大组数据,是从excel导入的。我希望获得数据集的词频 table。但是,当我使用 strspplit 时,它包含引号和其他标点符号,这会给出错误的结果。
我使用 strsplit 的方式有一个小错误,需要帮助,因为我自己无法弄清楚。
df = read_excel("C:/Users/B M Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
vect <- c(df[1])
vectsplit <- strsplit(tolower(vect), "\s+")
vectlev <- unique(unlist(vectsplit))
vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))
输出向量是这样的:
[1] "3 英寸 c 夹" "baby vice" "baby vice bench" "baby vise"
[5] "bench" "bench vice" "bench vice clamp" "bench vise"
[9] "bench voice" "bench wise" "bench wise heavy" "bench wise table"
[13] "box for tools" "c clamp" "c clamp set" "c clamps"
[17] "carpenter tools" "carpenter tools low price" "cast iron pipe" "clamp"
[21] "clamp set" "clamps woodworking" "g clamp" "g clamp set 3 inch"
我需要把每个字都说出来。当我使用 strplit 时,它包括所有标点符号。
下面是我得到的vectsplit的一小部分。它包括我不需要的所有引号、反斜杠和逗号。
[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby""vice"
[9] "bench\"," "\"baby" "vise\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"bench" "vise\"," "\"bench" "voice\"," "\"bench" "wise\" , "\"bench"
[25] "wise" "heavy\"," "\"bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\"," "\"c" "clamps\"," "\"carpenter"
[41] "tools\"," "\"carpenter" "tools" "low" "price\"," "\"cast" "iron" "pipe\","
如果你检查 vect 的 class,你会发现它不是一个字符向量,而是一个列表。
vect<-c(df[1])
class(vect)
> "list"
如果你如下定义vect,问题就会消失:
vect<-df[[1]]
class(vect)
> "character"
如果您这样定义 vect 然后使用 strsplit,它应该可以正常工作。请记住,不同类型的子集化([1] 与 [[1]])将产生不同的 classes 输出。
我有一大组数据,是从excel导入的。我希望获得数据集的词频 table。但是,当我使用 strspplit 时,它包含引号和其他标点符号,这会给出错误的结果。
我使用 strsplit 的方式有一个小错误,需要帮助,因为我自己无法弄清楚。
df = read_excel("C:/Users/B M Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
vect <- c(df[1])
vectsplit <- strsplit(tolower(vect), "\s+")
vectlev <- unique(unlist(vectsplit))
vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))
输出向量是这样的:
[1] "3 英寸 c 夹" "baby vice" "baby vice bench" "baby vise"
[5] "bench" "bench vice" "bench vice clamp" "bench vise"
[9] "bench voice" "bench wise" "bench wise heavy" "bench wise table"
[13] "box for tools" "c clamp" "c clamp set" "c clamps"
[17] "carpenter tools" "carpenter tools low price" "cast iron pipe" "clamp"
[21] "clamp set" "clamps woodworking" "g clamp" "g clamp set 3 inch"
我需要把每个字都说出来。当我使用 strplit 时,它包括所有标点符号。
下面是我得到的vectsplit的一小部分。它包括我不需要的所有引号、反斜杠和逗号。
[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby""vice"
[9] "bench\"," "\"baby" "vise\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"bench" "vise\"," "\"bench" "voice\"," "\"bench" "wise\" , "\"bench"
[25] "wise" "heavy\"," "\"bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\"," "\"c" "clamps\"," "\"carpenter"
[41] "tools\"," "\"carpenter" "tools" "low" "price\"," "\"cast" "iron" "pipe\","
如果你检查 vect 的 class,你会发现它不是一个字符向量,而是一个列表。
vect<-c(df[1])
class(vect)
> "list"
如果你如下定义vect,问题就会消失:
vect<-df[[1]]
class(vect)
> "character"
如果您这样定义 vect 然后使用 strsplit,它应该可以正常工作。请记住,不同类型的子集化([1] 与 [[1]])将产生不同的 classes 输出。