基于R中自定义列表的实体提取

Entities extraction based on customized list in R

我有文本列表,还有实体列表。

文本列表通常是矢量化字符串。

实体列表有点复杂。 一些实体,可以详尽地列出,例如世界主要城市列表。 一些实体虽然不可能详尽列出,但可以通过正则表达式模式捕获。


list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming', ...)

entity_city <- c('Copenhagen', 'Paris', 'New York', ...)

entity_IP_address <- c('regex code for IP address')

entity_IP_address <- c('regex code for URL')

entity_verb <- c('verbs')

鉴于 list_of_textentities 的列表,我想为每个文本找到匹配的实体。

例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')entity_verb对应c(eat, drink, sleep)entity_IP对应c(133.001.00.00),等等


res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
                      ,entities <- c(entity_verb, entity_IP_address, entity_city))

res[['verb']]
c('eat', 'drink', 'sleep')

res[['IP']]
c('133.001.00.00')

res[['city']]
c('Copenhagen')

R package我可以利用的吗?

请看地图和qdapDictionaries。对于世界城市,我将人口超过 100 万的城市作为子集。否则,错误 'regular expression is too large'.

library(maps)
library(qdapDictionaries)

list_of_text  <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex   <- "(?(?=.*?(\d+\.\d+\.\d+\.\d+).*?)(\1|))"

regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']

verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
                     start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])

citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
                    start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])