如何模式匹配多个单词并在 Oracle 中替换它?

How to pattern match multiple words and replace it in Oracle?

有一个 table 包含单词和句子列。如果单词存在于 "words" 列中,我将尝试替换句子中的单词。

尝试了以下代码,但它仅适用于单个单词。但是我需要替换多个词,如果它存在于词列中。

Create table temp(id NUMBER,
word VARCHAR2(1000),
Sentence VARCHAR2(2000));

insert into temp(1,'automation testing','automation testing is popular kind of testing');
insert into temp(2,'testing','manual testing');
insert into temp(3,'manual testing','this is an old method of testing');

BEGIN
for t1 in (select id, word from temp)
LOOP
    for t2 in (select rownum from temp where sentence is not null)
    LOOP
        update temp 
        set sentence = REPLACE(sentence, t1.word,t1.id)
        where rownum = rownum;
    END LOOP;
END LOOP;
END;

但是如果word栏中存在的话我需要替换多个term

Expected outcome:

id word                   sentence
1  automation testing     1 is popular kind of 2
2  testing                3
3  manual testing         this is an old method of 2
Updated code:

MERGE INTO temp dst
USING (
  WITH ordered_words ( rn, id, word, regex_safe_word ) AS (
    SELECT ROW_NUMBER() OVER ( ORDER BY LENGTH( word ) ASC, word DESC ),
           id,
           word,
           REGEXP_REPLACE( word, '([][)(}{|^$\.*+?])', '\' )
    FROM   temp
  ),
  sentences_with_ids ( rid, sentence, rn ) AS (
    SELECT ROWID,
           sentence,
           ( SELECT COUNT(*) + 1 FROM ordered_words )
    FROM   temp
  UNION ALL
    SELECT s.rid,
           REGEXP_REPLACE(
             REGEXP_REPLACE(
               s.sentence,
               '(^|\W)' || w.regex_safe_word || '($|\W)',
               '${'|| w.id ||'}'
              ),
             '(^|\W)' || w.regex_safe_word || '($|\W)',
             '${' || w.id || '}'
           ),
           s.rn - 1
    FROM   sentences_with_ids s
           INNER JOIN ordered_words w
           ON ( s.rn - 1 = w.rn ) 
  ),
  sentences_with_words ( rid, sentence, rn ) AS (
    SELECT rid,
           sentence,
           ( SELECT COUNT(*) + 1 FROM ordered_words )
    FROM   sentences_with_ids
    WHERE  rn = 1
  UNION ALL
    SELECT s.rid,
           REPLACE(
             s.sentence,
             '${' || w.id || '}',
             'http://localhost/' || w.id || '/<u>' || w.word || '</u>'
           ),
           s.rn - 1
    FROM   sentences_with_words s
           INNER JOIN ordered_words w
           ON ( s.rn - 1 = w.rn ) 
  )
  SELECT rid, sentence
  FROM   sentences_with_words
  WHERE  rn = 1
) src
ON ( dst.ROWID = src.RID )
WHEN MATCHED THEN
  UPDATE
  SET    sentence = src.sentence;

我们可以提高上述更新查询的性能吗?

使用REGEXP_REPLACE进行替换。按单词长度的降序排列,因此您可以在 "testing" 出现之前替换 "automation testing" 出现。

示例代码:

with function word_replace ( p_sentence VARCHAR2 ) RETURN VARCHAR2 IS
  l_working VARCHAR2(800) := p_sentence;
BEGIN
  FOR r IN ( SELECT word, id FROM temp ORDER BY length(word) desc, id ) LOOP
    l_working := regexp_replace(l_working, r.word, r.id);
  END LOOP;
  return l_working;
END;
SELECT sentence, word_replace(sentence) 
FROM   temp;
+-----------------------------------------------+----------------------------+
|                   SENTENCE                    |   WORD_REPLACE(SENTENCE)   |
+-----------------------------------------------+----------------------------+
| automation testing is popular kind of testing | 1 is popular kind of 2     |
| manual testing                                | 3                          |
| this is an old method of testing              | this is an old method of 2 |
+-----------------------------------------------+----------------------------+

我想这对你来说太复杂了,但事实证明一切都可以在一个 SQL 语句中完成。

merge into temp o
using (
  select s_rid, sentence, is_last from (
    select s.rowid s_rid, w.id word_id, w.word,
      cast(replace(s.sentence, w.word, w.id) as varchar2(4000)) sentence,
      length(w.word) word_length
    from temp w join temp s
    on instr(s.sentence, w.word) > 0
  )
  model
    partition by (s_rid)
    dimension by (
      row_number() over(partition by s_rid order by word_length desc, word) rn
    )
    measures(word_id, word, sentence, 0 is_last)
  rules (
    sentence[rn > 1] = replace(sentence[cv()-1], word[cv()], word_id[cv()]),
    is_last[any] = presentv(is_last[cv()+1], 0, 1)
  )
) n
on (o.rowid = n.s_rid and n.is_last = 1)
when matched then update set o.sentence = n.sentence;

请注意,如果您 运行 MERGE 两次,第二次将不会进行任何更改。这表明没有进行不必要的更新。逻辑:

  1. 首先,我在句子中找到单词的地方将单词连接到句子中。这消除了不需要替换的任何句子。句子的ROWID我留着以后用
  2. MODEL子句将按句子划分,按字长降序排列。
  3. MODEL 子句中的 "rule" 会依次进行每个替换,并会识别每个原始句子的最后一行。
  4. 使用 MERGE,我可以使用 ROWID 将结果连接到 table,然后更新 SENTENCE 列。

此致, 炖阿什顿

这里的逻辑与 MERGE with MODEL 子句相同,但使用 PL/SQL.

declare
  cursor cur_temp is
    select s.rowid s_rid, w.id word_id, w.word, s.sentence,
      length(w.word) word_length
    from temp w join temp s
    on instr(s.sentence, w.word) > 0
    order by s_rid, word_length desc;
  l_rid rowid;
  l_sentence varchar2(4000);
  procedure upd is
  begin
    update temp set sentence = l_sentence where rowid = l_rid;
  end upd;
begin
  for rec in cur_temp loop
    if rec.s_rid > l_rid then
      upd;
      l_sentence := null;
    end if;
    l_rid := rec.s_rid;
    l_sentence := replace(nvl(l_sentence, rec.sentence), rec.word, rec.word_id);
  end loop;
  upd;
end;
/

此致, 炖菜

由于每个人都可能对 MODEL 子句过敏,这里有一个使用递归子查询的替代方法,这是 SQL 标准。这只是 SELECT 部分,可以将其放入 MERGE 语句的 USING 子句中。

with joined_data as (
  select s.rowid s_rid, w.id word_id, w.word, s.sentence,
  row_number() over(partition by s.rowid order by length(w.word) desc) rn
  from temp w join temp s
  on instr(s.sentence, w.word) > 0
)
, recursed_data(s_rid, sentence, rn) as (
  select s_rid, replace(sentence, word, word_id), rn
  from joined_data
  where rn = 1
  union all
  select n.s_rid, replace(o.sentence, n.word, n.word_id), n.rn
  from recursed_data o 
  join joined_data n on o.s_rid = n.s_rid and n.rn = o.rn + 1
)
select s_rid,
max(sentence) keep (dense_rank last order by rn) sentence
from recursed_data
group by s_rid;

此致, 炖菜

2019-09-13 13:51 UTC - 我知道你现在得到了什么。您想要替换的不是字符串而是单词,新字符串将 包含 单词。所以你正在做第一个系列的替换,然后是第二个系列。

为了加快速度,我仍然会将单词与它们所在的句子进行 JOIN。这是解决方案的 SELECT 部分(进入 USING 子句的部分),因此您可以检查发生了什么并更快地看到一些结果。代码后的一些解释。

with words(id, word, word_length, search1, replace1, search2, replace2) as (
  select id, word, length(word),
  '(^|\W)' || REGEXP_REPLACE(word, '([][)(}{|^$\.*+?])', '\') || '($|\W)',
  '{'|| id ||'}',
  '{'|| id ||'}',
  'http://localhost/' || id || '/<u>' || word || '</u>'
  FROM temp
)
, joined_data as (
  select w.search1, w.replace1, w.search2, w.replace2,
    s.rowid s_rid, s.sentence,
    row_number() over(partition by s.rowid order by word_length desc) rn
  from words w
  join temp s
  on instr(s.sentence, w.word) > 0
  and regexp_like(s.sentence, w.search1)
)
, unpivoted_data as (
  select S_RID, SENTENCE, PHASE, SEARCH_STRING, REPLACE_STRING,
    row_number() over(partition by s_rid order by phase, rn) rn,
    case when row_number() over(partition by s_rid order by phase, rn)
      = count(*) over(partition by s_rid)
      then 1
      else 0
    end is_last
  from joined_data
  unpivot(
    (search_string, replace_string) 
    for phase in ( (search1, replace1) as 1, (search2, replace2) as 2 ))
)
, replaced_data(S_RID, RN, is_last, SENTENCE) as (
  select S_RID, RN, is_last,
    regexp_replace(SENTENCE, search_string, replace_string)
  from unpivoted_data
  where rn = 1
  union all
  select n.S_RID, n.RN, n.is_last,
    case when n.phase = 1
      then regexp_replace(o.SENTENCE, n.search_string, n.replace_string)
      else replace(o.SENTENCE, n.search_string, n.replace_string)
    end
  from unpivoted_data n
  join replaced_data o
    on o.s_rid = n.s_rid and n.rn = o.rn + 1  
)
select s_rid, sentence from replaced_data
where is_last = 1
order by s_rid;

WORDS 子查询获取单词长度以及两个替换系列的 "search" 和 "replace" 字符串。

JOINED_DATA 将单词连接到句子中。我先做 INSTR,只在需要的时候做 REGEXP,因为它花费更多 CPU.

UNPIVOTED_DATA 将行分成两个阶段:第一个阶段将 "automation testing" 替换为“{1}”,第二个阶段将“{1}”替换为“http://localhost/1/automation testing” .为这些行分配了正确的顺序,并确定了每个句子的 "last" 行。

REPLACED_DATA 根据阶段执行 REGEXP_REPLACE 或 REPLACE。第二阶段REPLACE就够了

此致, 炖菜