如何在 R 中将字符列重塑为两列(日期和文本)?

How to reshape a character column into two columns (Date and Text) in R?

我有以下性格:

cal = "\n \n21/01/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n21/01/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n03/02/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n17/02/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n11/03/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n11/03/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n24/03/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n25/03/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n22/04/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n22/04/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n12/05/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n10/06/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in the Netherlands\n        \n \n10/06/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in the Netherlands\n        \n \n23/06/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n24/06/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n22/07/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n22/07/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n09/09/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n09/09/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n22/09/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n23/09/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n06/10/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n28/10/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n28/10/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n10/11/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n01/12/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n02/12/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n16/12/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n16/12/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n"
 cal = gsub( "\n", " ", calendar)


如您所见,文本中既有日期又有文本。我想做的是将文本变成两列:'Date' 和 'Event'.

这将是结果(为简单起见仅显示第一行):

Date                    Event

21/01/2021        Governing Council of the ECB: monetary policy meeting in Frankfurt
21/01/2021        Press conference following the Governing Council meeting of the ECB...
03/02/2021        Governing Council of the ECB: non-monetary policy meeting in Frankfurt
17/02/2021        Governing Council of the ECB: non-monetary policy meeting in Frankfurt
11/03/2021        Governing Council of the ECB: monetary policy meeting in Frankfurt        
...

我尝试了很多将语料库重塑成句子的函数以及提取日期的函数,但我没能做到。例如:

library(anytime)
anydate(str_extract_all(cal, "[[:alnum:]]+[ /]*\d{2}[ /]*\d{4}")[[1]]) %>% as.data.frame()

# it gives me back lot of NAs, I don't know why

[1] NA           NA           "2021-03-02" NA           "2021-11-03" "2021-11-03" NA          
 [8] NA           NA           NA           "2021-12-05" "2021-10-06" "2021-10-06" NA          
[15] NA           NA           NA           "2021-09-09" "2021-09-09" NA           NA          
[22] "2021-06-10" NA           NA           "2021-10-11" "2021-01-12" "2021-02-12" NA          
[29] NA          

谁能帮帮我?

谢谢!

使用 read.table 我们可以在 \n 处拆分。 strip.white=TRUE 省略仅包含空格的元素。现在的结果模式是 date - event - date ... 我们现在可以很好地将 row-wise 转换为 matrix.

r <- setNames(data.frame(matrix(
  read.table(text=cal, sep="\n", row.names=NULL, strip.white=T)[,1], 
  ncol=2, byrow=TRUE)), c("date", "event"))
r$date <- as.Date(r$date, "%d/%m/%Y")  ## format to date

结果

r
#          date                                                                                  event
# 1  2021-01-21                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 2  2021-01-21       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 3  2021-02-03                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 4  2021-02-17                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 5  2021-03-11                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 6  2021-03-11       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 7  2021-03-24                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 8  2021-03-25                                        General Council meeting of the ECB in Frankfurt
# 9  2021-04-22                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 10 2021-04-22       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 11 2021-05-12                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 12 2021-06-10               Governing Council of the ECB: monetary policy meeting in the Netherlands
# 13 2021-06-10 Press conference following the Governing Council meeting of the ECB in the Netherlands
# 14 2021-06-23                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 15 2021-06-24                                        General Council meeting of the ECB in Frankfurt
# 16 2021-07-22                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 17 2021-07-22       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 18 2021-09-09                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 19 2021-09-09       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 20 2021-09-22                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 21 2021-09-23                                        General Council meeting of the ECB in Frankfurt
# 22 2021-10-06                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 23 2021-10-28                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 24 2021-10-28       Press conference following the Governing Council meeting of the ECB in Frankfurt
# 25 2021-11-10                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 26 2021-12-01                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
# 27 2021-12-02                                        General Council meeting of the ECB in Frankfurt
# 28 2021-12-16                     Governing Council of the ECB: monetary policy meeting in Frankfurt
# 29 2021-12-16       Press conference following the Governing Council meeting of the ECB in Frankfurt
library(dplyr)
library(stringr)

x = unlist(str_split(cal,"\n\s{2,}\n\s\n"))
y = data.frame(x, stringsAsFactors = FALSE)
y %>% separate(x,c("Date","Event"),"\n\n\s{2,}\n") 

您可以使用 str_match_all 提取符合特定模式的数据。

library(stringr)

tmp <- data.frame(str_match_all(trimws(gsub('\s+', ' ', cal)), 
                  '(\d+/\d+/\d+)\s([A-Za-z:\s-]+)')[[1]][, -1])
tmp$X2 <- trimws(tmp$X2)
tmp

#           X1                                                                                     X2
#1  21/01/2021                     Governing Council of the ECB: monetary policy meeting in Frankfurt
#2  21/01/2021       Press conference following the Governing Council meeting of the ECB in Frankfurt
#3  03/02/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
#4  17/02/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
#5  11/03/2021                     Governing Council of the ECB: monetary policy meeting in Frankfurt
#6  11/03/2021       Press conference following the Governing Council meeting of the ECB in Frankfurt
#7  24/03/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
#...
#...