从 JSON 日志文件中提取字段值的 R 代码
R Code to extract values for fields out of JSON logfile
我有一个文件,其中包含来自日志集合的 50,000 条记录。我需要为每条记录提取 "State": & "Code": 后面的值。我尝试过正则表达式,但无法正常工作。相反,我尝试了这个命令来查看我是否可以获得其中的 1 个值,但它只是超时了。
#this never completes
sub(".*?Code(.*?);.*", "\1", logfile
我没有从事此类工作的经验,因此非常感谢您的帮助!下面是日志文件的格式(假设为 JSON)。我的目标是返回以下值(如果不能包含状态和代码,那没关系):
(州:红色,代码:空(州:蓝色,代码:无收据)
下面是 logfile 的确切语法,有 2 条记录:
"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}
阅读您的文字
logIn <- read_lines('"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}')
把它变成可争论的形式,清理它,然后过滤
@library(tidyverse)
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\s,]', '')) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":")
我们得到了什么?
# A tibble: 4 x 2
ATTR VALUE
<chr> <chr>
1 State Red
2 Code null
3 State Blue
4 Code noreceipt
##################### 按要求
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# This ID will come in useful
rowid_to_column("ID") %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\s,]', ''),
# Give each State and Code the same ID.
ID = floor((ID + 1) / 2)) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":") %>%
# spread take it from longform to wideform
spread(key = ATTR, value = VALUE) %>%
select(ID, State, Code)
# A tibble: 2 x 3
ID State Code
<dbl> <chr> <chr>
1 1 Red null
2 2 Blue noreceipt
我有一个文件,其中包含来自日志集合的 50,000 条记录。我需要为每条记录提取 "State": & "Code": 后面的值。我尝试过正则表达式,但无法正常工作。相反,我尝试了这个命令来查看我是否可以获得其中的 1 个值,但它只是超时了。
#this never completes
sub(".*?Code(.*?);.*", "\1", logfile
我没有从事此类工作的经验,因此非常感谢您的帮助!下面是日志文件的格式(假设为 JSON)。我的目标是返回以下值(如果不能包含状态和代码,那没关系):
(州:红色,代码:空(州:蓝色,代码:无收据)
下面是 logfile 的确切语法,有 2 条记录:
"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}
阅读您的文字
logIn <- read_lines('"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}')
把它变成可争论的形式,清理它,然后过滤
@library(tidyverse)
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\s,]', '')) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":")
我们得到了什么?
# A tibble: 4 x 2
ATTR VALUE
<chr> <chr>
1 State Red
2 Code null
3 State Blue
4 Code noreceipt
##################### 按要求
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# This ID will come in useful
rowid_to_column("ID") %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\s,]', ''),
# Give each State and Code the same ID.
ID = floor((ID + 1) / 2)) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":") %>%
# spread take it from longform to wideform
spread(key = ATTR, value = VALUE) %>%
select(ID, State, Code)
# A tibble: 2 x 3
ID State Code
<dbl> <chr> <chr>
1 1 Red null
2 2 Blue noreceipt