删除所有数字字符比大于文本中平均值的句子

Remove all sentences where number to character ratio is greater than the average in the text

是否可以找到并删除所有包含较高数字比的句子? 我创建了以下函数来计算给定字符串中的比率:

a <- "1aaaaaa2bbbbbbb3"

Num_Char_Ration <- function(string){
length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string))
}
Num_Char_Ration(a)
#0.1875

现在的任务是找到一种方法来计算句子的比率(因此对于以“.”结尾的字符序列),然后从文本中删除比率较高的句子。例如:

input:
a <- " aa111111. bbbbbb22. cccccc3." 
output:
#"bbbbbb22. cccccc3."

您需要将长字符串拆分成单个单词! (例如strsplit()

数据:

words <- c("aa111111.","bbbbbb22.","cccccc3.")

代码:

library(magrittr)
fun1 <- function(x) {
    num <- gsub("\D","",x) %>% nchar
    char<- gsub("[^A-z]","",x,perl=T) %>% nchar

    if(num <= char) return(x) else NULL
}

sapply(words,fun1) %>% unlist %>% unname

结果:

#[1] "bbbbbb22." "cccccc3." 

我会使用 stringr 包来计算数字和字符:

# Original data
input <- " aa111111. bbbbbb22. cccccc3." 
# Split by . 
inputSplit <- strsplit(input, "\.")[[1]]

# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))

# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"

counts 看起来像这样:

# To get ratio between digits and string
# Divide first row by second row
      aa111111  bbbbbb22  cccccc3
[1,]         6         2        1
[2,]         8         8        7
# Simplified num to char ratio function
Num_Char_Ration <- function(string) {
  lengths(regmatches(x, gregexpr("[0-9]", x))) / nchar(x)
}

clear_nmbstring <- function(x) {
  x <- strsplit(x, ".", fixed = TRUE)[[1]]
  cleanx <- trimws(x)
  x <- x[Num_Char_Ration(cleanx) < 0.5]
  paste(x, collapse = ".")
}

# Example:
string <- c(" aa111111. bbbbbb22. cccccc3.")
clear_nmbstring(string)
[1] " bbbbbb22. cccccc3"

这是我在 base R 中的做法。改编了 Andre 的代码。

my_string <- " aa111111. bbbbbb22. cccccc3." 

#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\.)\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))

my_string
##[1] "bbbbbb22." "cccccc3." 

如果你想把这些句子组合成一个段落,你可以使用

paste(my_string,collapse=" ")
##[1] "bbbbbb22. cccccc3."

这是一个简单的基本解决方案:

x <- strsplit(input,"\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"