在替换不需要的字符时,如何防止某些单词 运行 在一起?
How can I prevent some words from being run together when replacing unwanted characters?
我想删除所有字符,如逗号、句点、引号等,这样一行:
婴儿汉斯·帕特里克 (Hans Patrick) 以通常的方式接受了他的乳膏,而不是通过专利瓶的工具。他的任性之一,当他还 child 时,就是用他的小肺的所有力量尖叫,当他被他的 parents 严厉惩罚时。这种奇特的习惯,正是他成熟后的天才的预示。
...将转化为:
The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity
通过这种方式,我可以在 space 处拆分单个单词,并且在单词末尾没有标点符号附件。
我正在尝试使用以下代码来做到这一点:
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray)
{
trimmedLine = line;
trimmedLine = trimmedLine.Replace("—", " ");
trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
string[] subWords = trimmedLine.Split();
...但它并非在所有情况下都有效,我不明白为什么它通常有效但有时会删除 space 个字符,因此 运行 将两个词放在一起,这样在逐步执行上面的第二行代码后,该行最终成为:
婴儿汉斯·帕特里克以通常的方式接受了他的乳膏,而不是通过专利瓶的工具他的一个反复无常的时候 child 就是用他小的所有力量尖叫当他被他的 parents 严厉惩罚时,他的肺部受到了严重的惩罚。这种奇异的习惯只是预示着那个使他在成熟时表现出色的天才
所以,一些单词 运行 组合成一个单词(它们之间没有 space):
theusual
patentbottle
screamwith
severelychastised
aforeshadowing
soeminent
为什么会发生这种情况,如何防止它继续发生?
这些词之间的 space 似乎不是 space 字符。给定文本在固定宽度字体中的样子,在第一期中被破坏 (the usual
):
The infant Hans Patrick received his mammarial balm in the
usual way, and not through the instrumentality of a patent
bottle. One of his caprices, when yet a child, was to scream
with all the force of his little lungs, when he was severely
chastised by his parents. This singular habit was but a
foreshadowing of that genius which has rendered him so
eminent in his maturity.
它显示了换行符处发生的所有问题,看起来它们是换行符。您可以通过将正则表达式中的 space 更改为 \s
来解决此问题,以保留所有形式的白色 space (注意 \
必须在 c# 正则表达式中转义) :
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\s'-]");
我想删除所有字符,如逗号、句点、引号等,这样一行:
婴儿汉斯·帕特里克 (Hans Patrick) 以通常的方式接受了他的乳膏,而不是通过专利瓶的工具。他的任性之一,当他还 child 时,就是用他的小肺的所有力量尖叫,当他被他的 parents 严厉惩罚时。这种奇特的习惯,正是他成熟后的天才的预示。
...将转化为:
The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity
通过这种方式,我可以在 space 处拆分单个单词,并且在单词末尾没有标点符号附件。
我正在尝试使用以下代码来做到这一点:
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray)
{
trimmedLine = line;
trimmedLine = trimmedLine.Replace("—", " ");
trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
string[] subWords = trimmedLine.Split();
...但它并非在所有情况下都有效,我不明白为什么它通常有效但有时会删除 space 个字符,因此 运行 将两个词放在一起,这样在逐步执行上面的第二行代码后,该行最终成为:
婴儿汉斯·帕特里克以通常的方式接受了他的乳膏,而不是通过专利瓶的工具他的一个反复无常的时候 child 就是用他小的所有力量尖叫当他被他的 parents 严厉惩罚时,他的肺部受到了严重的惩罚。这种奇异的习惯只是预示着那个使他在成熟时表现出色的天才
所以,一些单词 运行 组合成一个单词(它们之间没有 space):
theusual
patentbottle
screamwith
severelychastised
aforeshadowing
soeminent
为什么会发生这种情况,如何防止它继续发生?
这些词之间的 space 似乎不是 space 字符。给定文本在固定宽度字体中的样子,在第一期中被破坏 (the usual
):
The infant Hans Patrick received his mammarial balm in the
usual way, and not through the instrumentality of a patent
bottle. One of his caprices, when yet a child, was to scream
with all the force of his little lungs, when he was severely
chastised by his parents. This singular habit was but a
foreshadowing of that genius which has rendered him so
eminent in his maturity.
它显示了换行符处发生的所有问题,看起来它们是换行符。您可以通过将正则表达式中的 space 更改为 \s
来解决此问题,以保留所有形式的白色 space (注意 \
必须在 c# 正则表达式中转义) :
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\s'-]");