在替换不需要的字符时,如何防止某些单词 运行 在一起?

How can I prevent some words from being run together when replacing unwanted characters?


婴儿汉斯·帕特里克 (Hans Patrick) 以通常的方式接受了他的乳膏,而不是通过专利瓶的工具。他的任性之一,当他还 child 时,就是用他的小肺的所有力量尖叫,当他被他的 parents 严厉惩罚时。这种奇特的习惯,正是他成熟后的天才的预示。


The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity

通过这种方式,我可以在 space 处拆分单个单词,并且在单词末尾没有标点符号附件。


Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray) 
    trimmedLine = line;
    trimmedLine = trimmedLine.Replace("—", " ");
    trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
    string[] subWords = trimmedLine.Split();

...但它并非在所有情况下都有效,我不明白为什么它通常有效但有时会删除 space 个字符,因此 运行 将两个词放在一起,这样在逐步执行上面的第二行代码后,该行最终成为:

婴儿汉斯·帕特里克以通常的方式接受了他的乳膏,而不是通过专利瓶的工具他的一个反复无常的时候 child 就是用他小的所有力量尖叫当他被他的 parents 严厉惩罚时,他的肺部受到了严重的惩罚。这种奇异的习惯只是预示着那个使他在成熟时表现出色的天才

所以,一些单词 运行 组合成一个单词(它们之间没有 space):



这些词之间的 space 似乎不是 space 字符。给定文本在固定宽度字体中的样子,在第一期中被破坏 (the usual):

The infant Hans Patrick received his mammarial balm in the 
usual way, and not through the instrumentality of a patent 
bottle. One of his caprices, when yet a child, was to scream 
with all the force of his little lungs, when he was severely 
chastised by his parents. This singular habit was but a 
foreshadowing of that genius which has rendered him so 
eminent in his maturity.

它显示了换行符处发生的所有问题,看起来它们是换行符。您可以通过将正则表达式中的 space 更改为 \s 来解决此问题,以保留所有形式的白色 space (注意 \ 必须在 c# 正则表达式中转义) :

Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\s'-]");