从字符串列表中删除所有转义序列

Remove all escape sequences from list of strings

我正在玩弄 pokebase 一个 python pokeAPI 的包装器,一些 api 响应包含 \n \x0c 等。最后我不需要它们,但我不想遍历每个字母以删除它们,并且 .replace 似乎也不可持续(也因为我认为这会导致问题)。

这是字符串示例列表:https://pastebin.com/SbhR50br

["The female's horn\ndevelops slowly.\nPrefers physical\x0cattacks such as\nclawing and\nbiting.", 'When resting deep\nin its burrow, its\nthorns always\x0cretract.\nThis is proof that\nit is relaxed.', 'When feeding its\nyoung, it first\nchews and tender\xad\x0cizes the food,\nthen spits it out\nfor the offspring.', 'It has a calm and\ncaring nature.\nBecause its horn\x0cgrows slowly, it\nprefers not to\nfight.', 'It has a docile\nnature. If it is\nthreatened with\x0cattack, it raises\nthe barbs that are\nall over its body.', 'When NIDORINA are with their friends or\nfamily, they keep their barbs tucked\naway to prevent hurting each other.\x0cThis POKéMON appears to become\nnervous if separated from the others.', 'When it is with its friends or\nfamily, its barbs are tucked away to\nprevent injury. It appears to become\nnervous if separated from the others.', 'The female has a gentle temperament.\nIt emits ultrasonic cries that have the\npower to befuddle foes.', 'The female’s horns develop slowly.\nPrefers physical attacks such as clawing\nand biting.', 'When it senses danger, it raises\nall the barbs on its body. These\nbarbs grow slower than NIDORINO’s.', 'When feeding its young, it first\nchews the food into a paste, then\nspits it out for the offspring.', 'It has a calm and caring nature.\nBecause its horn grows slowly, it\nprefers not to fight.', 'When it senses danger, it raises\nall the barbs on its body. These\nbarbs grow slower than Nidorino’s.', 'The female has a gentle temperament.\nIt emits ultrasonic cries that have the power\nto befuddle foes.', 'When feeding its young, it first chews the food into\na paste, then spits it out for the offspring.', 'When Nidorina are with their friends or family, they keep their\nbarbs tucked away to prevent hurting each other.\nThis Pokémon appears to become nervous if separated from\nthe others.', 'When Nidorina are with their friends or family, they keep\ntheir barbs tucked away to prevent hurting each other.\nThis Pokémon appears to become nervous if separated\nfrom the others.']
flavor = random.choice([listofstringshere])
#remove \ stuff from flavor here!
print(flavor)

我想我可以用 regex 做点什么,但这只是猜测。

由于您的原始文本数据具有 'special unicode characters'(实际上无法打印),您很可能面临后续问题。

例如,

\xad 是来自 unicode utf-8 table conversion. and they are not needed in your case I belive.

软连字符

These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.

Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.

\x0c 是换页或 page break

\n 是新行,在您的情况下,我也认为与使文本更漂亮有关,您也不关心它。

所以完整的解决方案是,使用 re.sub (substitute/replace):

  1. 删除\xad\xad\x0c
  2. \x0c\n
  3. 上放置' '个空格

import re

egstrings = ["The female's horn\ndevelops slowly.\nPrefers physical\x0cattacks such as\nclawing and\nbiting.", 
           'When resting deep\nin its burrow, its\nthorns always\x0cretract.\nThis is proof that\nit is relaxed.',
            "When feeding its\nyoung, it first\nchews and tender\xad\x0cizes the food,\nthen spits it out\nfor the offspring."]

for flavor in egstrings:
    flavor = re.sub('\xad(\x0c)*',  '', flavor) # replaces \xad or \xad\x0c by nothing
    print(re.sub('[\n-\x0c]', ' ', flavor)) # replaces \n and \x0c by space

The female's horn develops slowly. Prefers physical attacks such as clawing and biting.

When resting deep in its burrow, its thorns always retract. This is proof that it is relaxed.

When feeding its young, it first chews and tenderizes the food, then spits it out for the offspring.

从您的示例字符串看来,您似乎并不想真正删除不可打印的字符,而是将它们替换为空格,在这种情况下,您可以将 re.sub 与匹配字符的模式一起使用一组不可打印的字符:

import re
flavor = re.sub(r'[\x00-\x1f]+', ' ', flavor)