如何防止仅从字符串中间删除撇号？

Question

我需要保留只有字母数字字符、连字符和撇号的单词。目前除了撇号，我什么都有。 hadn't、didn't 和 ain't 等单词中的撇号正在使用以下代码删除：

Regex onlyAlphanumericAndDash = new Regex("[^a-zA-Z0-9 -]");
. . .
foreach (string line in doc1StrArray) // doc1StrArray populated in FindAndStorePhrasesFoundInBothDocs()
{
    trimmedLine = line;
    // first replace the "long dash" with a space (otherwise the dashed words run together:
    // "consecrated—we" becomes "consecratedwe"
    trimmedLine = trimmedLine.Replace("—", " ");
    trimmedLine = onlyAlphanumericAndDash.Replace(trimmedLine, "");
    string[] subLines = trimmedLine.Split();
    foreach (string whirred in subLines)
    {
        if (String.IsNullOrEmpty(whirred)) continue;
        _whirred = whirred.Trim();
        iWordsInDoc1++;
        slAllDoc1Words.Add(_whirred);
        if (IgnoreWord(_whirred)) continue;
        InsertIntoWordStatsTable(_whirred, 1, 0);
    }
}

我需要保留撇号，但前提是它们在一个词中。换句话说，单词末尾的撇号应该被删除，开头也一样（当它是单引号时）；但是撇号在一个词中 - 换句话说那些表示收缩的词，例如“没有” - 应该被保留。

我需要向 Regex 添加什么，或者我需要如何修改它才能完成此操作？

Answer 1

我对由 Split() 创建的变量名称 subLines（表示文本行）感到有点困惑 - 无参数拆分将按空格拆分。因此，subLines 是否包含单词或行？我认为，尽管有名称，但它包含单词，因此您可以将正则表达式修改为：

[^a-zA-Z0-9 '-]

这将保留所有撇号。注意：我把它放在 - 之前而不是之后，所以它没有定义 from (space) to (apostrophe) 的范围（如 A-Z 是）的风险 - 如果你尝试过，请记住这一点已经;当在字符 class 中使用 - 并且您希望 - 是一个字符而不是意思是“范围”时，将其作为第一个（不是 ^ 之后）或最后一个class

中的内容

并且您可以使用 whirred.Trim('\'') 删除单词末尾的撇号 - 调用 whirred.Trim() 删除空格没有任何意义，因为字符串已经在空格上拆分，所以 won ' 是其中留下的任何空白。 Trim() 和 Split() 都在 Char.IsWhitespace(c) 方法

定义为 whitepace 的任何字符上拆分

Answer 2

更新 - 重新阅读问题后变得明显
一切都不需要所有的分割和修剪
可以使用与所需内容匹配的单个正则表达式来完成。

(?:(?![^a-zA-Z0-9'-]+|(?<![a-zA-Z0-9])'|'(?![a-zA-Z0-9])).)+

见https://regex101.com/r/fKtQ8v/1

C# 代码示例：

Regex RxWords = new Regex(@"(?:(?![^a-zA-Z0-9'-]+|(?<![a-zA-Z0-9])'|'(?![a-zA-Z0-9])).)+");
string[] doc1StrArray = { "didn't Shannons' consecrated—we, l'k'" };
int iWordsInDoc1 = 0;
string _whirred;

foreach ( string lin in doc1StrArray )
{
    Match M = RxWords.Match( lin );
    while ( M.Success )
    {
        iWordsInDoc1++;
        _whirred = M.Value;
        M = M.NextMatch();

        Console.WriteLine( "{0}", _whirred );
        //  slAllDoc1Words.Add(_whirred);
        //  if (IgnoreWord(_whirred)) continue;
        //  InsertIntoWordStatsTable(_whirred, 1, 0);
    }
}

输出：

didn't
Shannons
consecrated
we
l'k

Answer 3

下面可以根据需要去掉撇号-

System.Text.RegularExpressions.Regex.Replace("'this isn't a' test'", "'(?=(\s+|$))|(?<=(\s+|^))'", "")

输出为-

这不是测试

如何防止仅从字符串中间删除撇号？

How can I prevent apostrophes from being stripped out only from the the midst of strings?

c#

regex

apostrophe