从语料库 R 中的文本中提取日期
Extract date from texts in corpus R
我有一个语料库对象,我想从中提取数据,以便将它们添加为 docvar。
对象看起来像这样
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (@) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
我要做的第一件事是提取第一个出现的日期,大多数情况下它出现为 "Month Year"(1990 年 12 月)或 "Month Day, Year"(1991 年 6 月 19 日)或打字错误 FREBRUARY 10 , 1995 在这种情况下可以丢弃月份。
我的代码是
&
Extract Dates in any format from Text in R:
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\d{2}[ /]*\d{4}")))
并得到错误:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
但是,我不知道如何提供日期格式。此外,我真的不知道如何编写正确的正则表达式。
https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html
关于这个主题的其他问题是:
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html
str_extract_all(texts(c1)
, "(\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\d|2\d{3})(?=\D|$))|(\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\d|2\d{3})(?=\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|(\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\s+\d{1,2},\s+\d{4})"
, simplify = TRUE)[,1]
这给出了第一次出现格式 JUNE 19, 1991 或 December 1990
我有一个语料库对象,我想从中提取数据,以便将它们添加为 docvar。
对象看起来像这样
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (@) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
我要做的第一件事是提取第一个出现的日期,大多数情况下它出现为 "Month Year"(1990 年 12 月)或 "Month Day, Year"(1991 年 6 月 19 日)或打字错误 FREBRUARY 10 , 1995 在这种情况下可以丢弃月份。
我的代码是
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\d{2}[ /]*\d{4}")))
并得到错误:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
但是,我不知道如何提供日期格式。此外,我真的不知道如何编写正确的正则表达式。
https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html
关于这个主题的其他问题是:
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html
str_extract_all(texts(c1)
, "(\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\d|2\d{3})(?=\D|$))|(\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\d|2\d{3})(?=\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|(\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\s+\d{1,2},\s+\d{4})"
, simplify = TRUE)[,1]
这给出了第一次出现格式 JUNE 19, 1991 或 December 1990