R 编程 strsplit():不希望的结果
R programming strsplit(): Undesired result
我想拆分文本,我正在按照示例 1:
示例 1:
> x <- "Split the words in a sentence."
> strsplit(x, " ")
[[1]]
[1] "Split" "the" "words" "in"
[5] "a" "sentence."
所以我正在尝试拆分 NewString:
> NewString
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
> strsplit(NewString,' ')
[[1]]
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
函数不拆分 text.The 奇怪的是如果复制NewString的输出并粘贴到strsplit():
>strsplit("s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 ",' ')
[[1]]
[1] "s14" "v13" "s13" "s13" "v12" "s12" "v11" "s11" "v10" "s10" "s10" "v09" "s09"
[14] "v08" "s08" "v07" "s07" "v06" "s06" "v05" "s05" "v04" "s04" "v03" "s03" "v02"
[27] "s02" "s01" "v00"
可能是什么问题?
(NewString是使用rvest包输出的)
编辑:
CharToRaw 给出以下输出:
> charToRaw(lol)
[1] 73 31 34 c2 a0 76 31 33 c2 a0 73 31 33 c2 a0 73 31 33 c2 a0 76 31 32 c2 a0
[26] 73 31 32 c2 a0 76 31 31 c2 a0 73 31 31 c2 a0 76 31 30 c2 a0 73 31 30 c2 a0
[51] 73 31 30 c2 a0 76 30 39 c2 a0 73 30 39 c2 a0 76 30 38 c2 a0 73 30 38 c2 a0
[76] 76 30 37 c2 a0 73 30 37 c2 a0 76 30 36 c2 a0 73 30 36 c2 a0 76 30 35 c2 a0
[101] 73 30 35 c2 a0 76 30 34 c2 a0 73 30 34 c2 a0 76 30 33 c2 a0 73 30 33 c2 a0
[126] 76 30 32 c2 a0 73 30 32 c2 a0 73 30 31 c2 a0 76 30 30 c2 a0
这可以使用 stringi
包和 stri_split
来完成。
首先让我们创建一个由相同字符分隔的字符串(194/160 是十六进制的 C2A0):
s=rawToChar(as.raw(c(65,66,48,194, 160,65,67,49,194,160,65,68,50)))
> s
[1] "AB0 AC1 AD2"
普通str_split
不行:
> str_split(s,"\s+")
[[1]]
[1] "AB0 AC1 AD2"
但安装 stringi
和:
> stri_split(s,regex="\s+")
[[1]]
[1] "AB0" "AC1" "AD2"
我怀疑 stringi
对空格 (\s) 有更广泛的概念。
我想拆分文本,我正在按照示例 1:
示例 1:
> x <- "Split the words in a sentence."
> strsplit(x, " ")
[[1]]
[1] "Split" "the" "words" "in"
[5] "a" "sentence."
所以我正在尝试拆分 NewString:
> NewString
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
> strsplit(NewString,' ')
[[1]]
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
函数不拆分 text.The 奇怪的是如果复制NewString的输出并粘贴到strsplit():
>strsplit("s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 ",' ')
[[1]]
[1] "s14" "v13" "s13" "s13" "v12" "s12" "v11" "s11" "v10" "s10" "s10" "v09" "s09"
[14] "v08" "s08" "v07" "s07" "v06" "s06" "v05" "s05" "v04" "s04" "v03" "s03" "v02"
[27] "s02" "s01" "v00"
可能是什么问题?
(NewString是使用rvest包输出的)
编辑: CharToRaw 给出以下输出:
> charToRaw(lol)
[1] 73 31 34 c2 a0 76 31 33 c2 a0 73 31 33 c2 a0 73 31 33 c2 a0 76 31 32 c2 a0
[26] 73 31 32 c2 a0 76 31 31 c2 a0 73 31 31 c2 a0 76 31 30 c2 a0 73 31 30 c2 a0
[51] 73 31 30 c2 a0 76 30 39 c2 a0 73 30 39 c2 a0 76 30 38 c2 a0 73 30 38 c2 a0
[76] 76 30 37 c2 a0 73 30 37 c2 a0 76 30 36 c2 a0 73 30 36 c2 a0 76 30 35 c2 a0
[101] 73 30 35 c2 a0 76 30 34 c2 a0 73 30 34 c2 a0 76 30 33 c2 a0 73 30 33 c2 a0
[126] 76 30 32 c2 a0 73 30 32 c2 a0 73 30 31 c2 a0 76 30 30 c2 a0
这可以使用 stringi
包和 stri_split
来完成。
首先让我们创建一个由相同字符分隔的字符串(194/160 是十六进制的 C2A0):
s=rawToChar(as.raw(c(65,66,48,194, 160,65,67,49,194,160,65,68,50)))
> s
[1] "AB0 AC1 AD2"
普通str_split
不行:
> str_split(s,"\s+")
[[1]]
[1] "AB0 AC1 AD2"
但安装 stringi
和:
> stri_split(s,regex="\s+")
[[1]]
[1] "AB0" "AC1" "AD2"
我怀疑 stringi
对空格 (\s) 有更广泛的概念。