根据关键字将额外的列添加到 .txt 文件列表
Add extra column to a list of .txt files based on keywords
我有一个包含 100 个文本文件的列表,其中包含属于英国每个站点的温度值。但是,除了手动工作,我没有办法在循环中区分它们。
我希望通过关键字检测它们,然后通过 selected 名称为列向量添加属性,例如:
EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 25-06-2021
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:
Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu
FILE FORMAT (MISSING VALUE CODE IS -9999):
01-06 SOUID: Source identifier
08-15 DATE : Date YYYYMMDD
17-21 TX : maximum temperature in 0.1 °C
23-27 Q_TX : Quality code for TX (0='valid'; 1='suspect'; 9='missing')
This is the blended series of station UNITED KINGDOM, ARMAGH (STAID: 271).
Blended and updated with sources: 100918 146805
See file sources.txt and stations.txt for more info.
SOUID, DATE, TX, Q_TX
146805,18440101, 19, 0
146805,18440102, -2, 0
146805,18440103, 67, 0
146805,18440104, 111, 0
146805,18440105, 117, 0
146805,18440106, 89, 0
146805,18440107, 61, 0
146805,18440108, 69, 0
#Expected:
SOUID, DATE, TX, Q_TX Station
146805,18440101, 19, 0 ARMAGH
146805,18440102, -2, 0 ARMAGH
146805,18440103, 67, 0 ARMAGH
146805,18440104, 111, 0 ARMAGH
146805,18440105, 117, 0 ARMAGH
146805,18440106, 89, 0 ARMAGH
146805,18440107, 61, 0 ARMAGH
146805,18440108, 69, 0 ARMAGH
我可以select文件列表使用:
files <- list.files(pattern = ".txt", full.names=TRUE)
all.txt <- lapply(files, data.table::fread)
但是,fread
删除了 header,所以我无法确定它们属于哪个站。
如果我有一个可用的电台名称列表,我可以将它们与之匹配,我如何根据文本中的电台创建一个新列?
更新:
我设法将文本文件读入 read_table
,然后提取 Kingdom 之后的站名,但是那些带有空格的站名,只有第一个单词是 select。考虑到电台名称位于 United Kingdom 之后的行末尾,那么 selectKINGDOM 之后的其余单词就可以了。
这是我到目前为止尝试过的方法:
stringr::str_extract(xp1$xp, '(?<=KINGDOM\s)\w+')
如果我有例如Cex et England
,我只得到Cex
您可以使用正则表达式提取文件中出现在 .*UNITED KINGDOM,
之后的单词,并将其用作站名。
library(data.table)
all.txt <- lapply(files, function(x) transform(fread(x),
Station = sub('.*UNITED KINGDOM, (.*?)\(.*', '\1',
paste0(readLines(x), collapse = '\n'))))
我有一个包含 100 个文本文件的列表,其中包含属于英国每个站点的温度值。但是,除了手动工作,我没有办法在循环中区分它们。
我希望通过关键字检测它们,然后通过 selected 名称为列向量添加属性,例如:
EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 25-06-2021
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:
Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu
FILE FORMAT (MISSING VALUE CODE IS -9999):
01-06 SOUID: Source identifier
08-15 DATE : Date YYYYMMDD
17-21 TX : maximum temperature in 0.1 °C
23-27 Q_TX : Quality code for TX (0='valid'; 1='suspect'; 9='missing')
This is the blended series of station UNITED KINGDOM, ARMAGH (STAID: 271).
Blended and updated with sources: 100918 146805
See file sources.txt and stations.txt for more info.
SOUID, DATE, TX, Q_TX
146805,18440101, 19, 0
146805,18440102, -2, 0
146805,18440103, 67, 0
146805,18440104, 111, 0
146805,18440105, 117, 0
146805,18440106, 89, 0
146805,18440107, 61, 0
146805,18440108, 69, 0
#Expected:
SOUID, DATE, TX, Q_TX Station
146805,18440101, 19, 0 ARMAGH
146805,18440102, -2, 0 ARMAGH
146805,18440103, 67, 0 ARMAGH
146805,18440104, 111, 0 ARMAGH
146805,18440105, 117, 0 ARMAGH
146805,18440106, 89, 0 ARMAGH
146805,18440107, 61, 0 ARMAGH
146805,18440108, 69, 0 ARMAGH
我可以select文件列表使用:
files <- list.files(pattern = ".txt", full.names=TRUE)
all.txt <- lapply(files, data.table::fread)
但是,fread
删除了 header,所以我无法确定它们属于哪个站。
如果我有一个可用的电台名称列表,我可以将它们与之匹配,我如何根据文本中的电台创建一个新列?
更新:
我设法将文本文件读入 read_table
,然后提取 Kingdom 之后的站名,但是那些带有空格的站名,只有第一个单词是 select。考虑到电台名称位于 United Kingdom 之后的行末尾,那么 selectKINGDOM 之后的其余单词就可以了。
这是我到目前为止尝试过的方法:
stringr::str_extract(xp1$xp, '(?<=KINGDOM\s)\w+')
如果我有例如Cex et England
,我只得到Cex
您可以使用正则表达式提取文件中出现在 .*UNITED KINGDOM,
之后的单词,并将其用作站名。
library(data.table)
all.txt <- lapply(files, function(x) transform(fread(x),
Station = sub('.*UNITED KINGDOM, (.*?)\(.*', '\1',
paste0(readLines(x), collapse = '\n'))))