在 R 中的文本数据向量中查找美元金额前后的字符

Question

我有一个文本数据向量（新闻数据）。我正在尝试扫描任何金额的文本以及围绕该金额的文本。我用我的向量的第一个元素来管理这个，但是很难使用循环和列表来为所有数据重复这个过程。我使用 stringr 中的 str_extract_currencies，它在检测数字方面做得很好。用正则表达式也许可以，但我不知道怎么做。

textdata <- data.frame(document = c(1,2),
                       txt = c("Outplay today announced its .3M series A fundraise from Sequoa Capital India. ..., which is poised to be a .59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management â€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and .13  billion with hypertension across the world"))

numbers <- str_extract_currencies(textdata$txt[1]) %>% 
  filter(curr_sym == '$')

for (i in 1:nrow(numbers)){
  print( stringr::str_extract(textdata$txt[1], paste0(".{0,20}", numbers$amount[i], ".{0,20}")))
}

finaldata <- data.frame(document = c(1,1,2),
                        money_related = c("oday announced its .3M series A fundraise",
                                          " is poised to be a .59B market by 2023, is",
                                          "with diabetes and .13  billion with hyper"))

一个文档可能包含 0 个或多个金额实例。我喜欢像这样将它存储到 data.frame：

> finaldata
  document                                money_related
1        1  oday announced its .3M series A fundraise
2        1  is poised to be a .59B market by 2023, is
3        2  with diabetes and .13  billion with hyper

非常感谢。

Answer 1

只需将您的函数包装在 lapply:

library(dplyr)
library(strex)
library(stringr)

textdata <- data.frame(document = c(1,2),
                    txt = c("Outplay today announced its .3M series A fundraise from Sequoa Capital India. ..., which is poised to be a .59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management â€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and .13  billion with hypertension across the world"))


numbers <- as.data.frame(lapply(nrow(textdata), function(x){
  return(filter(str_extract_currencies(textdata[[x]]),curr_sym == '$'))
}))
numbers$string <- stringr::str_extract(numbers$string, paste0(".{0,20}", numbers$amount, ".{0,20}"))

> numbers
  string_num                                       string curr_sym amount
1          1  oday announced its .3M series A fundraise        $   7.30
2          1  is poised to be a .59B market by 2023, is        $   5.59
3          2  with diabetes and .13  billion with hyper        $   1.13

Answer 2

这是一个没有 {strex} 包的 tidyverse 解决方案。但您可能需要运行根据您的真实数据并添加其他几种可能的情况：

library(tidyverse)

textdata %>% 
  rowwise(document) %>% 
  summarise(txt = str_extract_all(txt, ".{1,20}(\|USD)[0-9.]+\s?[A-z]?.{1,20}")) %>% 
  unnest_longer(txt)

#> `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
#> # A tibble: 3 x 2
#> # Groups:   document [2]
#>   document txt                                             
#>      <dbl> <chr>                                           
#> 1        1 "today announced its .3M series A fundraise " 
#> 2        1 "h is poised to be a .59B market by 2023, is "
#> 3        2 "e with diabetes and .13  billion with hypert"

^{由 reprex package (v2.0.1)}

创建于 2022-01-21

在 R 中的文本数据向量中查找美元金额前后的字符

Find characters before and after dollar amount in vector of text data in R

regex

string

r

tidytext