在替换不需要的字符时,如何防止某些单词 运行 在一起?

How can I prevent some words from being run together when replacing unwanted characters?

我想删除所有字符,如逗号、句点、引号等,这样一行:

婴儿汉斯·帕特里克 (Hans Patrick) 以通常的方式接受了他的乳膏,而不是通过专利瓶的工具。他的任性之一,当他还 child 时,就是用他的小肺的所有力量尖叫,当他被他的 parents 严厉惩罚时。这种奇特的习惯,正是他成熟后的天才的预示。

...将转化为:

The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity

通过这种方式,我可以在 space 处拆分单个单词,并且在单词末尾没有标点符号附件。

我正在尝试使用以下代码来做到这一点:

Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray) 
{
    trimmedLine = line;
    trimmedLine = trimmedLine.Replace("—", " ");
    trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
    string[] subWords = trimmedLine.Split();

...但它并非在所有情况下都有效,我不明白为什么它通常有效但有时会删除 space 个字符,因此 运行 将两个词放在一起,这样在逐步执行上面的第二行代码后,该行最终成为:

婴儿汉斯·帕特里克以通常的方式接受了他的乳膏,而不是通过专利瓶的工具他的一个反复无常的时候 child 就是用他小的所有力量尖叫当他被他的 parents 严厉惩罚时,他的肺部受到了严重的惩罚。这种奇异的习惯只是预示着那个使他在成熟时表现出色的天才

所以,一些单词 运行 组合成一个单词(它们之间没有 space):

theusual
patentbottle
screamwith
severelychastised
aforeshadowing
soeminent

为什么会发生这种情况,如何防止它继续发生?

这些词之间的 space 似乎不是 space 字符。给定文本在固定宽度字体中的样子,在第一期中被破坏 (the usual):

The infant Hans Patrick received his mammarial balm in the 
usual way, and not through the instrumentality of a patent 
bottle. One of his caprices, when yet a child, was to scream 
with all the force of his little lungs, when he was severely 
chastised by his parents. This singular habit was but a 
foreshadowing of that genius which has rendered him so 
eminent in his maturity.

它显示了换行符处发生的所有问题,看起来它们是换行符。您可以通过将正则表达式中的 space 更改为 \s 来解决此问题,以保留所有形式的白色 space (注意 \ 必须在 c# 正则表达式中转义) :

Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\s'-]");