如何模式匹配多个单词并在 Oracle 中替换它?
How to pattern match multiple words and replace it in Oracle?
有一个 table 包含单词和句子列。如果单词存在于 "words" 列中,我将尝试替换句子中的单词。
尝试了以下代码,但它仅适用于单个单词。但是我需要替换多个词,如果它存在于词列中。
Create table temp(id NUMBER,
word VARCHAR2(1000),
Sentence VARCHAR2(2000));
insert into temp(1,'automation testing','automation testing is popular kind of testing');
insert into temp(2,'testing','manual testing');
insert into temp(3,'manual testing','this is an old method of testing');
BEGIN
for t1 in (select id, word from temp)
LOOP
for t2 in (select rownum from temp where sentence is not null)
LOOP
update temp
set sentence = REPLACE(sentence, t1.word,t1.id)
where rownum = rownum;
END LOOP;
END LOOP;
END;
但是如果word栏中存在的话我需要替换多个term
Expected outcome:
id word sentence
1 automation testing 1 is popular kind of 2
2 testing 3
3 manual testing this is an old method of 2
Updated code:
MERGE INTO temp dst
USING (
WITH ordered_words ( rn, id, word, regex_safe_word ) AS (
SELECT ROW_NUMBER() OVER ( ORDER BY LENGTH( word ) ASC, word DESC ),
id,
word,
REGEXP_REPLACE( word, '([][)(}{|^$\.*+?])', '\' )
FROM temp
),
sentences_with_ids ( rid, sentence, rn ) AS (
SELECT ROWID,
sentence,
( SELECT COUNT(*) + 1 FROM ordered_words )
FROM temp
UNION ALL
SELECT s.rid,
REGEXP_REPLACE(
REGEXP_REPLACE(
s.sentence,
'(^|\W)' || w.regex_safe_word || '($|\W)',
'${'|| w.id ||'}'
),
'(^|\W)' || w.regex_safe_word || '($|\W)',
'${' || w.id || '}'
),
s.rn - 1
FROM sentences_with_ids s
INNER JOIN ordered_words w
ON ( s.rn - 1 = w.rn )
),
sentences_with_words ( rid, sentence, rn ) AS (
SELECT rid,
sentence,
( SELECT COUNT(*) + 1 FROM ordered_words )
FROM sentences_with_ids
WHERE rn = 1
UNION ALL
SELECT s.rid,
REPLACE(
s.sentence,
'${' || w.id || '}',
'http://localhost/' || w.id || '/<u>' || w.word || '</u>'
),
s.rn - 1
FROM sentences_with_words s
INNER JOIN ordered_words w
ON ( s.rn - 1 = w.rn )
)
SELECT rid, sentence
FROM sentences_with_words
WHERE rn = 1
) src
ON ( dst.ROWID = src.RID )
WHEN MATCHED THEN
UPDATE
SET sentence = src.sentence;
我们可以提高上述更新查询的性能吗?
使用REGEXP_REPLACE
进行替换。按单词长度的降序排列,因此您可以在 "testing" 出现之前替换 "automation testing" 出现。
示例代码:
with function word_replace ( p_sentence VARCHAR2 ) RETURN VARCHAR2 IS
l_working VARCHAR2(800) := p_sentence;
BEGIN
FOR r IN ( SELECT word, id FROM temp ORDER BY length(word) desc, id ) LOOP
l_working := regexp_replace(l_working, r.word, r.id);
END LOOP;
return l_working;
END;
SELECT sentence, word_replace(sentence)
FROM temp;
+-----------------------------------------------+----------------------------+
| SENTENCE | WORD_REPLACE(SENTENCE) |
+-----------------------------------------------+----------------------------+
| automation testing is popular kind of testing | 1 is popular kind of 2 |
| manual testing | 3 |
| this is an old method of testing | this is an old method of 2 |
+-----------------------------------------------+----------------------------+
我想这对你来说太复杂了,但事实证明一切都可以在一个 SQL 语句中完成。
merge into temp o
using (
select s_rid, sentence, is_last from (
select s.rowid s_rid, w.id word_id, w.word,
cast(replace(s.sentence, w.word, w.id) as varchar2(4000)) sentence,
length(w.word) word_length
from temp w join temp s
on instr(s.sentence, w.word) > 0
)
model
partition by (s_rid)
dimension by (
row_number() over(partition by s_rid order by word_length desc, word) rn
)
measures(word_id, word, sentence, 0 is_last)
rules (
sentence[rn > 1] = replace(sentence[cv()-1], word[cv()], word_id[cv()]),
is_last[any] = presentv(is_last[cv()+1], 0, 1)
)
) n
on (o.rowid = n.s_rid and n.is_last = 1)
when matched then update set o.sentence = n.sentence;
请注意,如果您 运行 MERGE 两次,第二次将不会进行任何更改。这表明没有进行不必要的更新。逻辑:
- 首先,我在句子中找到单词的地方将单词连接到句子中。这消除了不需要替换的任何句子。句子的ROWID我留着以后用
- MODEL子句将按句子划分,按字长降序排列。
- MODEL 子句中的 "rule" 会依次进行每个替换,并会识别每个原始句子的最后一行。
- 使用 MERGE,我可以使用 ROWID 将结果连接到 table,然后更新 SENTENCE 列。
此致,
炖阿什顿
这里的逻辑与 MERGE with MODEL 子句相同,但使用 PL/SQL.
declare
cursor cur_temp is
select s.rowid s_rid, w.id word_id, w.word, s.sentence,
length(w.word) word_length
from temp w join temp s
on instr(s.sentence, w.word) > 0
order by s_rid, word_length desc;
l_rid rowid;
l_sentence varchar2(4000);
procedure upd is
begin
update temp set sentence = l_sentence where rowid = l_rid;
end upd;
begin
for rec in cur_temp loop
if rec.s_rid > l_rid then
upd;
l_sentence := null;
end if;
l_rid := rec.s_rid;
l_sentence := replace(nvl(l_sentence, rec.sentence), rec.word, rec.word_id);
end loop;
upd;
end;
/
此致,
炖菜
由于每个人都可能对 MODEL 子句过敏,这里有一个使用递归子查询的替代方法,这是 SQL 标准。这只是 SELECT 部分,可以将其放入 MERGE 语句的 USING 子句中。
with joined_data as (
select s.rowid s_rid, w.id word_id, w.word, s.sentence,
row_number() over(partition by s.rowid order by length(w.word) desc) rn
from temp w join temp s
on instr(s.sentence, w.word) > 0
)
, recursed_data(s_rid, sentence, rn) as (
select s_rid, replace(sentence, word, word_id), rn
from joined_data
where rn = 1
union all
select n.s_rid, replace(o.sentence, n.word, n.word_id), n.rn
from recursed_data o
join joined_data n on o.s_rid = n.s_rid and n.rn = o.rn + 1
)
select s_rid,
max(sentence) keep (dense_rank last order by rn) sentence
from recursed_data
group by s_rid;
此致,
炖菜
2019-09-13 13:51 UTC - 我知道你现在得到了什么。您想要替换的不是字符串而是单词,新字符串将 包含 单词。所以你正在做第一个系列的替换,然后是第二个系列。
为了加快速度,我仍然会将单词与它们所在的句子进行 JOIN。这是解决方案的 SELECT 部分(进入 USING 子句的部分),因此您可以检查发生了什么并更快地看到一些结果。代码后的一些解释。
with words(id, word, word_length, search1, replace1, search2, replace2) as (
select id, word, length(word),
'(^|\W)' || REGEXP_REPLACE(word, '([][)(}{|^$\.*+?])', '\') || '($|\W)',
'{'|| id ||'}',
'{'|| id ||'}',
'http://localhost/' || id || '/<u>' || word || '</u>'
FROM temp
)
, joined_data as (
select w.search1, w.replace1, w.search2, w.replace2,
s.rowid s_rid, s.sentence,
row_number() over(partition by s.rowid order by word_length desc) rn
from words w
join temp s
on instr(s.sentence, w.word) > 0
and regexp_like(s.sentence, w.search1)
)
, unpivoted_data as (
select S_RID, SENTENCE, PHASE, SEARCH_STRING, REPLACE_STRING,
row_number() over(partition by s_rid order by phase, rn) rn,
case when row_number() over(partition by s_rid order by phase, rn)
= count(*) over(partition by s_rid)
then 1
else 0
end is_last
from joined_data
unpivot(
(search_string, replace_string)
for phase in ( (search1, replace1) as 1, (search2, replace2) as 2 ))
)
, replaced_data(S_RID, RN, is_last, SENTENCE) as (
select S_RID, RN, is_last,
regexp_replace(SENTENCE, search_string, replace_string)
from unpivoted_data
where rn = 1
union all
select n.S_RID, n.RN, n.is_last,
case when n.phase = 1
then regexp_replace(o.SENTENCE, n.search_string, n.replace_string)
else replace(o.SENTENCE, n.search_string, n.replace_string)
end
from unpivoted_data n
join replaced_data o
on o.s_rid = n.s_rid and n.rn = o.rn + 1
)
select s_rid, sentence from replaced_data
where is_last = 1
order by s_rid;
WORDS 子查询获取单词长度以及两个替换系列的 "search" 和 "replace" 字符串。
JOINED_DATA 将单词连接到句子中。我先做 INSTR,只在需要的时候做 REGEXP,因为它花费更多 CPU.
UNPIVOTED_DATA 将行分成两个阶段:第一个阶段将 "automation testing" 替换为“{1}”,第二个阶段将“{1}”替换为“http://localhost/1/automation testing” .为这些行分配了正确的顺序,并确定了每个句子的 "last" 行。
REPLACED_DATA 根据阶段执行 REGEXP_REPLACE 或 REPLACE。第二阶段REPLACE就够了
此致,
炖菜
有一个 table 包含单词和句子列。如果单词存在于 "words" 列中,我将尝试替换句子中的单词。
尝试了以下代码,但它仅适用于单个单词。但是我需要替换多个词,如果它存在于词列中。
Create table temp(id NUMBER,
word VARCHAR2(1000),
Sentence VARCHAR2(2000));
insert into temp(1,'automation testing','automation testing is popular kind of testing');
insert into temp(2,'testing','manual testing');
insert into temp(3,'manual testing','this is an old method of testing');
BEGIN
for t1 in (select id, word from temp)
LOOP
for t2 in (select rownum from temp where sentence is not null)
LOOP
update temp
set sentence = REPLACE(sentence, t1.word,t1.id)
where rownum = rownum;
END LOOP;
END LOOP;
END;
但是如果word栏中存在的话我需要替换多个term
Expected outcome:
id word sentence
1 automation testing 1 is popular kind of 2
2 testing 3
3 manual testing this is an old method of 2
Updated code:
MERGE INTO temp dst
USING (
WITH ordered_words ( rn, id, word, regex_safe_word ) AS (
SELECT ROW_NUMBER() OVER ( ORDER BY LENGTH( word ) ASC, word DESC ),
id,
word,
REGEXP_REPLACE( word, '([][)(}{|^$\.*+?])', '\' )
FROM temp
),
sentences_with_ids ( rid, sentence, rn ) AS (
SELECT ROWID,
sentence,
( SELECT COUNT(*) + 1 FROM ordered_words )
FROM temp
UNION ALL
SELECT s.rid,
REGEXP_REPLACE(
REGEXP_REPLACE(
s.sentence,
'(^|\W)' || w.regex_safe_word || '($|\W)',
'${'|| w.id ||'}'
),
'(^|\W)' || w.regex_safe_word || '($|\W)',
'${' || w.id || '}'
),
s.rn - 1
FROM sentences_with_ids s
INNER JOIN ordered_words w
ON ( s.rn - 1 = w.rn )
),
sentences_with_words ( rid, sentence, rn ) AS (
SELECT rid,
sentence,
( SELECT COUNT(*) + 1 FROM ordered_words )
FROM sentences_with_ids
WHERE rn = 1
UNION ALL
SELECT s.rid,
REPLACE(
s.sentence,
'${' || w.id || '}',
'http://localhost/' || w.id || '/<u>' || w.word || '</u>'
),
s.rn - 1
FROM sentences_with_words s
INNER JOIN ordered_words w
ON ( s.rn - 1 = w.rn )
)
SELECT rid, sentence
FROM sentences_with_words
WHERE rn = 1
) src
ON ( dst.ROWID = src.RID )
WHEN MATCHED THEN
UPDATE
SET sentence = src.sentence;
我们可以提高上述更新查询的性能吗?
使用REGEXP_REPLACE
进行替换。按单词长度的降序排列,因此您可以在 "testing" 出现之前替换 "automation testing" 出现。
示例代码:
with function word_replace ( p_sentence VARCHAR2 ) RETURN VARCHAR2 IS
l_working VARCHAR2(800) := p_sentence;
BEGIN
FOR r IN ( SELECT word, id FROM temp ORDER BY length(word) desc, id ) LOOP
l_working := regexp_replace(l_working, r.word, r.id);
END LOOP;
return l_working;
END;
SELECT sentence, word_replace(sentence)
FROM temp;
+-----------------------------------------------+----------------------------+ | SENTENCE | WORD_REPLACE(SENTENCE) | +-----------------------------------------------+----------------------------+ | automation testing is popular kind of testing | 1 is popular kind of 2 | | manual testing | 3 | | this is an old method of testing | this is an old method of 2 | +-----------------------------------------------+----------------------------+
我想这对你来说太复杂了,但事实证明一切都可以在一个 SQL 语句中完成。
merge into temp o
using (
select s_rid, sentence, is_last from (
select s.rowid s_rid, w.id word_id, w.word,
cast(replace(s.sentence, w.word, w.id) as varchar2(4000)) sentence,
length(w.word) word_length
from temp w join temp s
on instr(s.sentence, w.word) > 0
)
model
partition by (s_rid)
dimension by (
row_number() over(partition by s_rid order by word_length desc, word) rn
)
measures(word_id, word, sentence, 0 is_last)
rules (
sentence[rn > 1] = replace(sentence[cv()-1], word[cv()], word_id[cv()]),
is_last[any] = presentv(is_last[cv()+1], 0, 1)
)
) n
on (o.rowid = n.s_rid and n.is_last = 1)
when matched then update set o.sentence = n.sentence;
请注意,如果您 运行 MERGE 两次,第二次将不会进行任何更改。这表明没有进行不必要的更新。逻辑:
- 首先,我在句子中找到单词的地方将单词连接到句子中。这消除了不需要替换的任何句子。句子的ROWID我留着以后用
- MODEL子句将按句子划分,按字长降序排列。
- MODEL 子句中的 "rule" 会依次进行每个替换,并会识别每个原始句子的最后一行。
- 使用 MERGE,我可以使用 ROWID 将结果连接到 table,然后更新 SENTENCE 列。
此致, 炖阿什顿
这里的逻辑与 MERGE with MODEL 子句相同,但使用 PL/SQL.
declare
cursor cur_temp is
select s.rowid s_rid, w.id word_id, w.word, s.sentence,
length(w.word) word_length
from temp w join temp s
on instr(s.sentence, w.word) > 0
order by s_rid, word_length desc;
l_rid rowid;
l_sentence varchar2(4000);
procedure upd is
begin
update temp set sentence = l_sentence where rowid = l_rid;
end upd;
begin
for rec in cur_temp loop
if rec.s_rid > l_rid then
upd;
l_sentence := null;
end if;
l_rid := rec.s_rid;
l_sentence := replace(nvl(l_sentence, rec.sentence), rec.word, rec.word_id);
end loop;
upd;
end;
/
此致, 炖菜
由于每个人都可能对 MODEL 子句过敏,这里有一个使用递归子查询的替代方法,这是 SQL 标准。这只是 SELECT 部分,可以将其放入 MERGE 语句的 USING 子句中。
with joined_data as (
select s.rowid s_rid, w.id word_id, w.word, s.sentence,
row_number() over(partition by s.rowid order by length(w.word) desc) rn
from temp w join temp s
on instr(s.sentence, w.word) > 0
)
, recursed_data(s_rid, sentence, rn) as (
select s_rid, replace(sentence, word, word_id), rn
from joined_data
where rn = 1
union all
select n.s_rid, replace(o.sentence, n.word, n.word_id), n.rn
from recursed_data o
join joined_data n on o.s_rid = n.s_rid and n.rn = o.rn + 1
)
select s_rid,
max(sentence) keep (dense_rank last order by rn) sentence
from recursed_data
group by s_rid;
此致, 炖菜
2019-09-13 13:51 UTC - 我知道你现在得到了什么。您想要替换的不是字符串而是单词,新字符串将 包含 单词。所以你正在做第一个系列的替换,然后是第二个系列。
为了加快速度,我仍然会将单词与它们所在的句子进行 JOIN。这是解决方案的 SELECT 部分(进入 USING 子句的部分),因此您可以检查发生了什么并更快地看到一些结果。代码后的一些解释。
with words(id, word, word_length, search1, replace1, search2, replace2) as (
select id, word, length(word),
'(^|\W)' || REGEXP_REPLACE(word, '([][)(}{|^$\.*+?])', '\') || '($|\W)',
'{'|| id ||'}',
'{'|| id ||'}',
'http://localhost/' || id || '/<u>' || word || '</u>'
FROM temp
)
, joined_data as (
select w.search1, w.replace1, w.search2, w.replace2,
s.rowid s_rid, s.sentence,
row_number() over(partition by s.rowid order by word_length desc) rn
from words w
join temp s
on instr(s.sentence, w.word) > 0
and regexp_like(s.sentence, w.search1)
)
, unpivoted_data as (
select S_RID, SENTENCE, PHASE, SEARCH_STRING, REPLACE_STRING,
row_number() over(partition by s_rid order by phase, rn) rn,
case when row_number() over(partition by s_rid order by phase, rn)
= count(*) over(partition by s_rid)
then 1
else 0
end is_last
from joined_data
unpivot(
(search_string, replace_string)
for phase in ( (search1, replace1) as 1, (search2, replace2) as 2 ))
)
, replaced_data(S_RID, RN, is_last, SENTENCE) as (
select S_RID, RN, is_last,
regexp_replace(SENTENCE, search_string, replace_string)
from unpivoted_data
where rn = 1
union all
select n.S_RID, n.RN, n.is_last,
case when n.phase = 1
then regexp_replace(o.SENTENCE, n.search_string, n.replace_string)
else replace(o.SENTENCE, n.search_string, n.replace_string)
end
from unpivoted_data n
join replaced_data o
on o.s_rid = n.s_rid and n.rn = o.rn + 1
)
select s_rid, sentence from replaced_data
where is_last = 1
order by s_rid;
WORDS 子查询获取单词长度以及两个替换系列的 "search" 和 "replace" 字符串。
JOINED_DATA 将单词连接到句子中。我先做 INSTR,只在需要的时候做 REGEXP,因为它花费更多 CPU.
UNPIVOTED_DATA 将行分成两个阶段:第一个阶段将 "automation testing" 替换为“{1}”,第二个阶段将“{1}”替换为“http://localhost/1/automation testing” .为这些行分配了正确的顺序,并确定了每个句子的 "last" 行。
REPLACED_DATA 根据阶段执行 REGEXP_REPLACE 或 REPLACE。第二阶段REPLACE就够了
此致, 炖菜