python 带有变量的正则表达式在包含特定字符时不起作用

Question

我正在处理一些药物的数据框，我想从产品描述的完整句子中提取剂量。每种活性物质 (DCI) 都有一个剂量，在列表中列出。每个 DCI 的剂量通常在 description 中的名称后。

我正在使用：

teste=[]
for x in listofdci:
   teste2 = [f"{x}{y}" for x,y in re.findall(rf"(?:{x})\s*(\d+(?:[.,]\d+)*)\s*(g|mg|)",strength)]
   teste.extend(teste2)

除变量包含()或+的情况外，效果很好，例如：

listofdci = [' Acid. L(+)-lacticum D4']
description = ' Acid. L(+)-lacticum D4 250 mg'
#error: nothing to repeat

#

listofdci = ['Zinkoxid', '(+/–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, (+/–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#error: nothing to repeat
#Here he collects the first dosage -> ['13g'] and then outputs the error

#

listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[]
#here it outputs an empty list

理想情况下我想要：

listofdci = [' Acid. L(+)-lacticum D4']
description = ' Acid. L(+)-lacticum D4 250 mg'
#['250mg']

#

listofdci = ['Zinkoxid', '(+/–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, (+/–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#['13g','0,026','5,2g','24,5','10,4']

#

listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[65mg]

除了可能从数据集中删除每个 () 或 + 之外，我不知道如何避免这个特定问题。另外，因为这些字符可以出现在字符串的每个部分，所以我认为我无法使用集合来识别它们：'[]'

Answer 1

如果关键字和数字之间的括号内可以有一个可选的子字符串，则可以使用

teste=[]
for x in listofdci:
    test2 = [f"{x}{y}" for x,y in re.findall(rf"{re.escape(x)}(?:\s*\([^()]*\))?\s*(\d+(?:[.,]\d+)*)\s*(m?g\b|)", description)]
    if test2:
        teste.extend(test2)

参见Python demo。

详情:

{re.escape(x)} - 转义关键字
(?:\s*\([^()]*\))? - 零个或多个空格的可选序列，(，除 ( 和 ) 之外的零个或多个字符，然后是 )
\s* - 零个或多个空格
(\d+(?:[.,]\d+)*) - 一个或多个数字，然后是零个或多个 . / , 序列和一个或多个数字
\s* - 零个或多个空格
(m?g\b|) - m, mg 作为整个单词，或空字符串。

python 带有变量的正则表达式在包含特定字符时不起作用

python regex with a variable not working when it contains specific characters

python

regex

string