基于R中自定义列表的实体提取
Entities extraction based on customized list in R
我有文本列表,还有实体列表。
文本列表通常是矢量化字符串。
实体列表有点复杂。
一些实体,可以详尽地列出,例如世界主要城市列表。
一些实体虽然不可能详尽列出,但可以通过正则表达式模式捕获。
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming', ...)
entity_city <- c('Copenhagen', 'Paris', 'New York', ...)
entity_IP_address <- c('regex code for IP address')
entity_IP_address <- c('regex code for URL')
entity_verb <- c('verbs')
鉴于 list_of_text
和 entities
的列表,我想为每个文本找到匹配的实体。
例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,entity_verb
对应c(eat, drink, sleep)
,entity_IP
对应c(133.001.00.00)
,等等
res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,entities <- c(entity_verb, entity_IP_address, entity_city))
res[['verb']]
c('eat', 'drink', 'sleep')
res[['IP']]
c('133.001.00.00')
res[['city']]
c('Copenhagen')
有R package
我可以利用的吗?
请看地图和qdapDictionaries。对于世界城市,我将人口超过 100 万的城市作为子集。否则,错误 'regular expression is too large'.
library(maps)
library(qdapDictionaries)
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex <- "(?(?=.*?(\d+\.\d+\.\d+\.\d+).*?)(\1|))"
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']
verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])
citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])
我有文本列表,还有实体列表。
文本列表通常是矢量化字符串。
实体列表有点复杂。 一些实体,可以详尽地列出,例如世界主要城市列表。 一些实体虽然不可能详尽列出,但可以通过正则表达式模式捕获。
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming', ...)
entity_city <- c('Copenhagen', 'Paris', 'New York', ...)
entity_IP_address <- c('regex code for IP address')
entity_IP_address <- c('regex code for URL')
entity_verb <- c('verbs')
鉴于 list_of_text
和 entities
的列表,我想为每个文本找到匹配的实体。
例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,entity_verb
对应c(eat, drink, sleep)
,entity_IP
对应c(133.001.00.00)
,等等
res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,entities <- c(entity_verb, entity_IP_address, entity_city))
res[['verb']]
c('eat', 'drink', 'sleep')
res[['IP']]
c('133.001.00.00')
res[['city']]
c('Copenhagen')
有R package
我可以利用的吗?
请看地图和qdapDictionaries。对于世界城市,我将人口超过 100 万的城市作为子集。否则,错误 'regular expression is too large'.
library(maps)
library(qdapDictionaries)
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.whosebug.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex <- "(?(?=.*?(\d+\.\d+\.\d+\.\d+).*?)(\1|))"
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']
verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])
citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])