正则表达式模式以随机 \n 或 \n\n 作为换行符计算诗歌中的行数

Question

我需要数221首诗的行数并尝试数换行符\n。

但是，有些行有双换行符 \n\n 以构成新的诗句。这些我只想算作一个。每首诗中双换行的数量和位置是随机的。

最小工作示例：

library("quanteda")

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"

poems <- quanteda::corpus(poem1, poem2)

生成的行数对于 poem1 应为 5 行，对于 poem2 应为 4 行。

我尝试了 stringi::stri_count_fixed(texts(poems), pattern = "\n")，但正则表达式模式不够精细，无法解决随机双换行问题。

Answer 1

您可以将 stringr::str_count 与 \R+ 模式结合使用来查找字符串中 个连续换行序列 的数目：

> poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
> poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
> library(stringr)
> str_count(poem1, "\R+")
[1] 4
> str_count(poem2, "\R+")
[1] 3

所以行数是str_count(x, "\R+") + 1。

\R 模式匹配任何换行符序列，CRLF、LF 或 CR。 \R+ 匹配一个或多个这样的换行符序列。

参见R code DEMO online:

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
library(stringr)
str_count(poem1, "\R+")
# => [1] 4
str_count(poem2, "\R+")
# => [1] 3
## Line counts:
str_count(poem1, "\R+") + 1
# => [1] 5
str_count(poem2, "\R+") + 1
# => [1] 4

正则表达式模式以随机 \n 或 \n\n 作为换行符计算诗歌中的行数

Regex pattern to count lines in poems with randomly \n or \n\n as line breaks

regex

nlp

r

quanteda

data-science