使用 awk 或其他工具替换文件中的整个字段值

Question

我从 postgresql table 导出了多个字段，包括布尔值（由 postgresql 导出为 t 和 f 字符），我需要将其导入另一个数据库(monetdb) 不会将 t/f 理解为布尔值。

（EDIT 删除了空格以反映真实的文件方面并避免愤怒的评论 - 以前显示的是空格）

id|val_str|bool_1|bool2|bool_3|bool4|
1|help|t|t|f|t|
2|test|f|t|f|f|
...

因为我无法替换所有出现的 t/f 我需要在我的模式中集成字段分隔符。我尝试使用 awk 将字段 t 替换为 TRUE 并将 f 替换为 FALSE:

awk -F'|' '{gsub(/\|t\|/, "|TRUE|"); gsub(/\|f\|/, "|FALSE|"); print;}'

这是部分工作，因为具有相同值 (|t|t|) 的连续字段将仅替换第一次出现（|TRUE|t| - 因为第二次出现实际上是 t| 和不是 |t|).

id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|t|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|f|
...

Table 有 ~450 列，所以我不能真正指定要替换的列列表，也不能在 postgres 中工作到 'transform' 布尔列（我可以但是......）。

我可以运行 gsub() 两次，但我一直在寻找更优雅的方式来匹配所有字段的整个字段内容。

gsub(/^t$/, ...) 也无济于事，因为我们大部分时间都在排队。

Answer 1

如果perl没问题，你可以使用lookarounds:

$ cat ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |   t    |   t    |   f    |   t    |
2  |    test   |   f    |   t    |   f    |   f    | 

$ perl -pe 's/\|\K\h*t\h*(?=\|)/  TRUE  /g; s/\|\K\h*f\h*(?=\|)/  FALSE /g' ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |  TRUE  |  TRUE  |  FALSE |  TRUE  |
2  |    test   |  FALSE |  TRUE  |  FALSE |  FALSE |

\|\K 正后视匹配 |
\h* 可选水平 space，如果输入中实际不存在则移除
(?=\|) 正面前瞻匹配 |

也可以使用 sed 的循环。在 GNU sed 4.2.2 上测试，语法可能因其他实现而异

$ sed ':a s/| *t *|/|  TRUE  |/;ta; :b s/| *f *|/|  FALSE |/;tb' ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |  TRUE  |  TRUE  |  FALSE |  TRUE  |
2  |    test   |  FALSE |  TRUE  |  FALSE |  FALSE |

:a 标签
s/| *t *|/| TRUE |/ 替换命令
ta 只要替换命令成功就分支到标签 a
同样适用于 :b

输入

中没有 space

perl -pe 's/\|\Kt(?=\|)/TRUE/g; s/\|\Kf(?=\|)/FALSE/g' ip.txt 
sed ':a s/|t|/|TRUE|/;ta; :b s/|f|/|FALSE|/;tb' ip.txt 
awk 'BEGIN{FS=OFS="|"} {for(i=1;i<=NF;i++){if($i=="t"){$i="TRUE"} if($i=="f"){$i="FALSE"}} print}' ip.txt

Answer 2

使用sed，这是标准的。

sed 's/| *t */| TRUE /g;s/| *f */| FALSE /g'

这告诉 sed 替换以竖线字符、未知数量的 space（可能为零）、t 和 space 开头的每个子字符串接下来是未知数量的 space 和 | TRUE；与 f.

相同

如果行长度混乱，通过 column -t 管道输出。

Answer 3

假设（根据您的评论）您的输入文件实际上看起来不像您发布的示例，而是如下所示：

$ cat file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|t|t|f|t|
2|test|f|t|f|f|

那么你只需要：

$ awk '{while(gsub(/\|t\|/,"|TRUE|")); while(gsub(/\|f\|/,"|FALSE|"));}1' file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|TRUE|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|FALSE|

N 个替换字符串的一般解决方案是：

$ awk 'BEGIN{m["f"]="FALSE"; m["t"]="TRUE"} {for (k in m) while(gsub("\|"k"\|","|"m[k]"|"));} 1' file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|TRUE|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|FALSE|

Answer 4

Table has ~450 columns so I can't really specify the list of columns to be replaced, nor work in postgres to 'transform' boolean columns (I could but ...).

您可以让 Postgres 为您完成工作。生成 SELECT 列表的基本查询：

SELECT string_agg(CASE WHEN atttypid = 'bool'::regtype
                       THEN quote_ident(attname) || '::text'
                       ELSE quote_ident(attname) END, ', ' ORDER BY attnum)
FROM   pg_attribute
WHERE  attrelid = 'mytable'::regclass  -- provide table name here
AND    attnum > 0
AND    NOT attisdropped;

生成以下形式的字符串：

col1, "CoL 2", bool1::text, "Bool 2"::text

所有标识符都正确转义。列按默认顺序排列。复制并执行它。使用 COPY 导出到文件。（或 psql 中的 \copy。）性能与导出普通 table 大致相同。如果不需要大写，请省略 upper().

为什么简单转换为 text 就足够了？

关于 regclass 和正确转义标识符：

Table name as a PostgreSQL function parameter

如果您需要 TRUE / FALSE / NULL 大写的完整语句，标准 SQL 转换符号（不带冒号 ::），仍然是原始列名，可能还有模式限定的 tablename:

SELECT 'SELECT '
     || string_agg(CASE WHEN atttypid = 'bool'::regtype
                        THEN format('upper(cast(%1$I AS text)) AS %1$I', attname)
                        ELSE quote_ident(attname) END, ', ' ORDER BY attnum)
     || ' FROM myschema.mytable;'           -- provide table name twice now
FROM   pg_attribute
WHERE  attrelid = 'myschema.mytable'::regclass
AND    attnum > 0
AND    NOT attisdropped;

生成以下形式的完整语句：

SELECT col1, "CoL 2", upper(cast(bool1 AS text) AS bool1, upper(cast("Bool 2" AS text)) AS "Bool 2" FROM myschema.mytable;

使用 awk 或其他工具替换文件中的整个字段值

Replace an entire field value in a file using awk or other

postgresql

bash

awk

monetdb