从 R 中的 *.txt 文件中提取文本
Extracting text from *.txt files in R
我已经使用 Mac 的表达式来确认我的正则表达式有效,但我找不到从我的文本文件中提取信息的命令。我有 2,500 个文本文件,我需要提取每个文档的日期以填充数据集。仅供参考,"date" 是要提取的第一个变量,还会有其他变量。文件的格式各不相同,并且有多个日期。我只对每个文档的第一个日期感兴趣。一些文档以日期换行,另一些文档以单词 "Date" 或 "Dated" 开头。
每个文本文档的示例:
Bangor
dorset
LL56 43r
date: 10 july 2009
take notice: the blah blah blah text goes here and there's lots of it.
action:
有效的正则表达式:
"\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}"
文本文档作为单元素字符向量在 R Studio 环境中可见。我想提取文本 "as is" 这样的东西...
> strapply(NoFN, ("\d{1,2}\.?:january|february|march|april|may|june|july|august|september|october|november|december\.\d{4}")[[1]]
> [1] 10 july 2009
显然这实际上行不通!
非常感谢!
伊恩
您的正则表达式不适合 R,因为您需要转义 \
字符。
正则表达式应该是:
"\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}"
如果您使用 stringr
包,并且您的文本被加载到 txt
,您可以这样做:
library(stringr)
txt = "Bangor dorset LL56 43r\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
str_match(string = txt, pattern = "\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}")
[,1]
[1,] "10 july 2009"
我相信这样做。它使用内置变量 month.name
并且与问题不同,它使用 ()
.
对月份进行分组
txt <- "\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
pattern <- paste(tolower(month.name), collapse = "|")
pattern <- paste0("(", pattern, ")")
pattern <- paste("[[:digit:]]{1,2}[[:space:]]*", pattern, "[[:digit:]]{4}")
m <- regexpr(pattern, txt)
regmatches(txt, m)
#[1] "10 july 2009"
谢谢大家,这真是一种享受!
库(stringr)
txt = "Bangor dorset LL56 43r\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
str_match(string = txt, pattern = "\d{1,2}\s+(?:一月|二月|三月|四月|五月|六月|七月|八月|九月|十月|十一月|十二月)\s+\d{4}")
[,1]
[1,]“2009 年 7 月 10 日”
我已经使用 Mac 的表达式来确认我的正则表达式有效,但我找不到从我的文本文件中提取信息的命令。我有 2,500 个文本文件,我需要提取每个文档的日期以填充数据集。仅供参考,"date" 是要提取的第一个变量,还会有其他变量。文件的格式各不相同,并且有多个日期。我只对每个文档的第一个日期感兴趣。一些文档以日期换行,另一些文档以单词 "Date" 或 "Dated" 开头。
每个文本文档的示例:
Bangor
dorset
LL56 43r
date: 10 july 2009
take notice: the blah blah blah text goes here and there's lots of it.
action:
有效的正则表达式:
"\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}"
文本文档作为单元素字符向量在 R Studio 环境中可见。我想提取文本 "as is" 这样的东西...
> strapply(NoFN, ("\d{1,2}\.?:january|february|march|april|may|june|july|august|september|october|november|december\.\d{4}")[[1]]
> [1] 10 july 2009
显然这实际上行不通!
非常感谢! 伊恩
您的正则表达式不适合 R,因为您需要转义 \
字符。
正则表达式应该是:
"\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}"
如果您使用 stringr
包,并且您的文本被加载到 txt
,您可以这样做:
library(stringr)
txt = "Bangor dorset LL56 43r\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
str_match(string = txt, pattern = "\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}")
[,1]
[1,] "10 july 2009"
我相信这样做。它使用内置变量 month.name
并且与问题不同,它使用 ()
.
txt <- "\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
pattern <- paste(tolower(month.name), collapse = "|")
pattern <- paste0("(", pattern, ")")
pattern <- paste("[[:digit:]]{1,2}[[:space:]]*", pattern, "[[:digit:]]{4}")
m <- regexpr(pattern, txt)
regmatches(txt, m)
#[1] "10 july 2009"
谢谢大家,这真是一种享受!
库(stringr)
txt = "Bangor dorset LL56 43r\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"
str_match(string = txt, pattern = "\d{1,2}\s+(?:一月|二月|三月|四月|五月|六月|七月|八月|九月|十月|十一月|十二月)\s+\d{4}")
[,1]
[1,]“2009 年 7 月 10 日”