基于R中自定义列表的实体提取

Question

我有文本列表，还有实体列表。

文本列表通常是矢量化字符串。

实体列表有点复杂。一些实体，可以详尽地列出，例如世界主要城市列表。一些实体虽然不可能详尽列出，但可以通过正则表达式模式捕获。


list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming', ...)

entity_city <- c('Copenhagen', 'Paris', 'New York', ...)

entity_IP_address <- c('regex code for IP address')

entity_IP_address <- c('regex code for URL')

entity_verb <- c('verbs')

鉴于 list_of_text 和 entities 的列表，我想为每个文本找到匹配的实体。

例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')，entity_verb对应c(eat, drink, sleep)，entity_IP对应c(133.001.00.00)，等等


res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
                      ,entities <- c(entity_verb, entity_IP_address, entity_city))

res[['verb']]
c('eat', 'drink', 'sleep')

res[['IP']]
c('133.001.00.00')

res[['city']]
c('Copenhagen')

有R package我可以利用的吗？

Answer 1

请看地图和qdapDictionaries。对于世界城市，我将人口超过 100 万的城市作为子集。否则，错误 'regular expression is too large'.

library(maps)
library(qdapDictionaries)

list_of_text  <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex   <- "(?(?=.*?(\d+\.\d+\.\d+\.\d+).*?)(\1|))"

regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']

verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
                     start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])

citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
                    start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])

基于R中自定义列表的实体提取

Entities extraction based on customized list in R

nlp

r

named-entity-recognition

text-mining

r-package