正则表达式 - 使用子字符串本身区分非常相似的子字符串

Regex - distinguishing between very similar substrings using the substring itself

我知道标题很混乱,但这是一个很难简单表述的问题。希望这确实有一个解决方案,这是我对正则表达式世界还很陌生的结果。

我正在尝试解析一本化学书中的一些文本并将其转换为 JSON,但我在按其主要标识符划分文本时遇到了问题。所有这些都是在 Python 3.10 环境中完成的。

考虑以下字符串:

0047 Heptasilver nitrate octaoxide
[12258-22-9] Ag NO
7 11
(Ag O ) .AgNO
3 4 2 3
Alone, or Sulfides, or Nonmetals
The crystalline product produced by electrolytic oxidation
of silver nitrate (and possibly as formulated) detonates
feebly at 110°C. Mixtures with phosphorus and sulfur
explode on impact, hydrogen sulfide ignites on contact,
and antimony trisulfide ignites when ground with the salt.
Mellor, 1941, Vol. 3, 483–485
See other SILVER COMPOUNDS
See related METAL NITRATES
0048 Aluminium
[7429-90-5] Al
Al
HCS 1980, 135 (powder)
Finely divided aluminium powder or dust forms highly
explosive dispersions in air [1], and all aspects of pre-
vention of aluminium dust explosions are covered in 2
US National Fire Codes [2]. The effects on the ignition
properties of impurities introduced by recycled metal used
to prepare dust were studied [3]. Pyrophoricity is elimi-
nated by surface coating aluminium powder with poly-
styrene [4]. Explosion hazards involved in arc and flame
spraying of the powder were analyzed and discussed [5],
and the effect of surface oxide layers on flammability
was studied [6]. The causes of a severe explosion in
1983 in a plant producing fine aluminium powder were
analyzed, and improvements in safety practices discussed
[7]. A number of fires and explosions involving aluminiumdust arising from grinding, polishing, and buffing opera-
tions were discussed, and precautions detailed [8] [12]
[13]. Atomized and flake aluminium powders attain
See other METALS
See other REDUCANTS
0049 Aluminium-cobalt alloy (Raney cobalt alloy)
[37271-59-3] 50:50; [12043-56-0] Al Co; Al—Co
5
[73730-53-7] Al Co
2
Al Co
The finely powdered Raney cobalt alloy is a significant
dust explosion hazard.
See DUST EXPLOSION INCIDENTS (reference 22)
0050 Aluminium–copper–zinc alloy
(Devarda’s alloy)
[8049-11-4] Al—Cu—Zn
Al Cu Zn
Silver nitrate: Ammonia, etc.
See DEVARDA’S ALLOY
See other ALLOYS0051 Aluminium amalgam (Aluminium–
mercury alloy)
[12003-69-9] (1:1) Al—Hg
Al Hg
The amalgamated aluminium wool remaining from prepa-
ration of triphenylaluminium will rapidly oxidize and
become hot upon exposure to air. Careful disposal is nec-
essary [1]. Amalgamated aluminium foil may be pyro-
phoric and should be kept moist and used immediately [2].
1. Neely, T. A. et al., Org. Synth., 1965, 45, 109
2. Calder, A. et al., Org. Synth., 1975, 52, 78
See other ALLOYS

此字符串包含有关 5 种不同化合物的信息,这些化合物由开头的 4 位数字标识,后跟名称,然后在另一行中是方括号中的 CAS 唯一标识符。

我试图将其划分为每个 object 的单独子字符串的方法是识别始终后跟其他标识符的 4 位数字,并在该点划分文本。

我目前正在使用正确识别 4 位标识符的正则表达式:

\n(\d{4})\s(?:[\s\S]*?)(?:\[\d*?-\d*?-\d*?\]|\[ *?\] [a-zA-Z]*?)

但是,这也包括其他一些不是标识符的 4 位数字的实例,例如 body 文本中的日期,例如铝 (0048) 文本中的日期“1983” ) 复合条目。

我曾尝试使用否定前瞻与我用于隔离 4 位标识符的相同表达式,但是 none 我尝试过的方法中的一些有效。现在我不确定这是否可能,或者我可能过于复杂了。

另一种方法是使用 CAS(在方括号中),但这会更糟,因为有些条目具有多个甚至空 CAS。

如有任何建议,我们将不胜感激!

关于您的模式的几点注意事项:

  • 您可以省略模式末尾的 [a-zA-Z]*?,因为它是最后一部分并且是非贪婪的,因此它不会匹配任何字符
  • \d*?-\d*?-\d*?\[ *?\]这样的部分不必是非贪婪的,因为要重复的指定字符不能越过后面的字符

如果匹配应始终以换行符开头:

\n(\d{4}).*(?:\n\(.*)*\n\[(?: *|\d+-\d+-\d+)]

说明

  • \n 匹配一个换行符
  • (\d{4})组 1
  • 中捕获 4 个数字
  • .* 匹配行的其余部分
  • (?:\n\(.*)* 可选择重复匹配换行符和 ( 后跟行的其余部分
  • \n 匹配一个换行符
  • \[(?: *|\d+-\d+-\d+)] 匹配 [...] 其中只能有空格或中间有连字符的数字

看到一个regex demo

如果方括号直接在下一行:

^(\d{4}).*\n\[(?: *|\d+-\d+-\d+)]

再看一个regex demo