拆分由点连接的两个单词
Split two words connected by a dot
我有一个包含新闻文章的大数据框。我注意到有些文章有两个单词由点连接,如下例所示 The government.said it was important to quit.
。我将进行一些主题建模,因此我需要将每个单词分开。
这是我用来分隔那些单词的代码
#String example
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")
#Code to separate the words
test <- do.call(paste, as.list(strsplit(test, "\.")[[1]]))
#This is what I get
> test
[1] "i need to separate the words connected by dots however, I need to keep having the dots separating sentences"
如您所见,我删除了文本上的所有点(句点)。我怎样才能得到以下结果:
"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"
最后的注释
我的数据框由 17.000 篇文章组成;所有的文字都是小写的。我只是提供了一个小例子,说明我在尝试分隔由点连接的两个单词时遇到的问题。此外,有什么方法可以在列表中使用 strsplit
吗?
您可以使用
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\b\.\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\p{L})\.(?=\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\S*(*SKIP)(*F)|\b\.\b", " ", test, perl=TRUE)
输出:
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."
详情
\b\.\b
- 用单词边界包围的点(即 .
前后不能是任何非单词字符,除了字母、数字或下划线外不能有任何字符
(?<=\p{L})\.(?=\p{L})
匹配前面或后面都没有字母的点((?<=\p{L})
是负向后视,(?=\p{L})
是负向前视)
(?:ht|f)tps?://\S*(*SKIP)(*F)|\b\.\b
匹配 http/ftp
或 https/ftps
,然后匹配 ://
,然后是任何 0 个或多个非空白字符,并跳过匹配并继续搜索从遇到 SKIP PCRE 动词时的位置匹配。
我有一个包含新闻文章的大数据框。我注意到有些文章有两个单词由点连接,如下例所示 The government.said it was important to quit.
。我将进行一些主题建模,因此我需要将每个单词分开。
这是我用来分隔那些单词的代码
#String example
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")
#Code to separate the words
test <- do.call(paste, as.list(strsplit(test, "\.")[[1]]))
#This is what I get
> test
[1] "i need to separate the words connected by dots however, I need to keep having the dots separating sentences"
如您所见,我删除了文本上的所有点(句点)。我怎样才能得到以下结果:
"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"
最后的注释
我的数据框由 17.000 篇文章组成;所有的文字都是小写的。我只是提供了一个小例子,说明我在尝试分隔由点连接的两个单词时遇到的问题。此外,有什么方法可以在列表中使用 strsplit
吗?
您可以使用
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\b\.\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\p{L})\.(?=\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\S*(*SKIP)(*F)|\b\.\b", " ", test, perl=TRUE)
输出:
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."
详情
\b\.\b
- 用单词边界包围的点(即.
前后不能是任何非单词字符,除了字母、数字或下划线外不能有任何字符(?<=\p{L})\.(?=\p{L})
匹配前面或后面都没有字母的点((?<=\p{L})
是负向后视,(?=\p{L})
是负向前视)(?:ht|f)tps?://\S*(*SKIP)(*F)|\b\.\b
匹配http/ftp
或https/ftps
,然后匹配://
,然后是任何 0 个或多个非空白字符,并跳过匹配并继续搜索从遇到 SKIP PCRE 动词时的位置匹配。