从 JSON 日志文件中提取字段值的 R 代码

R Code to extract values for fields out of JSON logfile

我有一个文件,其中包含来自日志集合的 50,000 条记录。我需要为每条记录提取 "State": & "Code": 后面的值。我尝试过正则表达式,但无法正常工作。相反,我尝试了这个命令来查看我是否可以获得其中的 1 个值,但它只是超时了。

#this never completes
sub(".*?Code(.*?);.*", "\1", logfile 

我没有从事此类工作的经验,因此非常感谢您的帮助!下面是日志文件的格式(假设为 JSON)。我的目标是返回以下值(如果不能包含状态和代码,那没关系):

(州:红色,代码:空(州:蓝色,代码:无收据)

下面是 logfile 的确切语法,有 2 条记录:

 "
    2020-05-12 00:07:00.9681200, z123-asddfas,"
    ========== mode for SKU ==========
    ========== Records found ==========
    No records found
    ========== DRecords found ==========
    No drecords found
    "
    2020-05-12 00:08:46.5076411,qwer98-asdha,"
    ========== mode for SKU ==========
    ========== records found ==========
    {
        "State":  "Red",
        "Code":  null
    }
    ========== DRecords found ==========
    No drecords found
    "
    2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
    ========== mode for SKU ==========
    ========== records found ==========
    {
        "State":  "Blue",
        "Code":  "no receipt"
    }

阅读您的文字

logIn <-  read_lines('"
    2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
  ========== Records found ==========
  No records found
========== DRecords found ==========
  No drecords found
"
    2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
  ========== records found ==========
  {
    "State":  "Red",
    "Code":  null
  }
========== DRecords found ==========
  No drecords found
"
    2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
  ========== records found ==========
  {
    "State":  "Blue",
    "Code":  "no receipt"
  }')

把它变成可争论的形式,清理它,然后过滤

@library(tidyverse)    
tibble(lines = logIn) %>% 
     # Keep only the lines with 'state' or 'code'
  filter(str_detect(lines, "(?ix) ( state | code )")) %>% 
     # Clean out all the whitespace and punct, except the ':'
  mutate(lines = str_replace_all(lines, '["\s,]', '')) %>% 
     # Use separate to divide into two new columns
  separate(lines, c("ATTR", "VALUE"), sep = ":")

我们得到了什么?

# A tibble: 4 x 2
  ATTR  VALUE    
  <chr> <chr>    
1 State Red      
2 Code  null     
3 State Blue     
4 Code  noreceipt
##################### 按要求
tibble(lines = logIn) %>% 
  # Keep only the lines with 'state' or 'code'
  filter(str_detect(lines, "(?ix) ( state | code )")) %>% 
    # This ID will come in useful
  rowid_to_column("ID") %>% 
  # Clean out all the whitespace and punct, except the ':'
  mutate(lines = str_replace_all(lines, '["\s,]', ''),
         # Give each State and Code the same ID.
         ID = floor((ID + 1) / 2)) %>% 
  # Use separate to divide into two new columns
  separate(lines, c("ATTR", "VALUE"), sep = ":") %>% 
    # spread take it from longform to wideform
  spread(key = ATTR, value = VALUE) %>% 
  select(ID, State, Code)

# A tibble: 2 x 3
     ID State Code     
  <dbl> <chr> <chr>    
1     1 Red   null     
2     2 Blue  noreceipt