正则表达式模式与某些节目标题不匹配
Regex pattern isn't matching certain show titles
使用 C# 正则表达式进行匹配并且 return 从字符串解析的数据是 return 不可靠的结果。
我使用的模式如下:
Regex r=new Regex(
@"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})",
RegexOptions.IgnoreCase
);
以下是几个失败的测试用例
Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]
The Soup 2015.05.22 [mp4]
Big Brother UK Live From The House (May 22, 2015)
应该return
- 显示名称(例如,
Ellen
)
- 日期(例如,
2015.05.22
)
- 额外信息(例如,
Joseph Gordon Levitt [REPOST]
)
Alaskan Bush People S02 Wild Times Special
应该return
- 显示姓名(例如,
Alaskan Bush People
)
- 季节(例如,
02
)
- 额外信息(例如,
Wild Times Special
)
500 Questions S01E03
应该return
- 显示名称(例如,
500 Questions
)
- 季节(例如,
01
)
- 剧集(例如,
03
)
有效的示例和 return 正确的数据
Boyster S01E13 – E14
Mysteries at the Museum S08E08
Mysteries at the National Parks S01E07 – E08
The Last Days Of… S01E06
Born Naughty? S01E02
Have I Got News For You S49E07
看起来,模式忽略了 S 和 E(如果未找到),然后使用第一组匹配数字填充该槽。
很明显,此模式需要做更多的工作才能处理上述不同的字符串。非常感谢你在这件事上的帮助。
试试这个:
(?<name>.*?)(?:S(?<season>\d{1,2}))?(?:E(?<episode>\d{1,2}))?(?<date>\d{4}\.\d{2}\.\d{2})(?<extra>.*)?
分而治之
您试图用一个简单的表达式解析太多内容。那不会很好地工作。在这种情况下,best 方法是将问题分成更小的问题,并分别解决每个问题。然后,我们稍后可以将所有内容组合成一个模式。
让我们为您要提取的数据编写一些模式。
Season/episode:
S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?
我使用 \p{Pd}
而不是 -
来适应任何破折号类型。
日期:
\d{4}\.\d{1,2}\.\d{1,2}
或者...
(?i:January|February|March|April|May|June|July|August|September|October|November|December)
\s*\d{1,2},\s*\d{4}
写一个简单的模式以获得额外信息:
.*?
(是的,这很普通)
我们也可以这样检测显示格式:
\[.*?\]
您可以根据需要添加其他部分。
现在,我们可以把所有东西都放在一个模式中,使用组名来提取数据:
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
只需忽略 info
组(它用于 extra
中的条件,这样 extra
就不会消耗应该是节目名称一部分的内容)。你可以获得多个 extra
信息,所以只需将它们连接起来,在每个部分之间放置一个 space。
示例代码:
var inputData = new[]
{
"Boyster S01E13 – E14",
"Mysteries at the Museum S08E08",
"Mysteries at the National Parks S01E07 – E08",
"The Last Days Of… S01E06",
"Born Naughty? S01E02",
"Have I Got News For You S49E07",
"Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]",
"The Soup 2015.05.22 [mp4]",
"Big Brother UK Live From The House (May 22, 2015)",
"Alaskan Bush People S02 Wild Times Special",
"500 Questions S01E03"
};
var re = new Regex(@"
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
", RegexOptions.IgnorePatternWhitespace);
foreach (var input in inputData)
{
Console.WriteLine();
Console.WriteLine("--- {0} ---", input);
var match = re.Match(input);
if (!match.Success)
{
Console.WriteLine("FAIL");
continue;
}
foreach (var groupName in re.GetGroupNames())
{
if (groupName == "0" || groupName == "info")
continue;
var group = match.Groups[groupName];
if (!group.Success)
continue;
foreach (Capture capture in group.Captures)
Console.WriteLine("{0}: '{1}'", groupName, capture.Value);
}
}
这个输出是...
--- Boyster S01E13 - E14 ---
name: 'Boyster'
episode: 'S01E13 - E14'
--- Mysteries at the Museum S08E08 ---
name: 'Mysteries at the Museum'
episode: 'S08E08'
--- Mysteries at the National Parks S01E07 - E08 ---
name: 'Mysteries at the National Parks'
episode: 'S01E07 - E08'
--- The Last Days Ofâ?¦ S01E06 ---
name: 'The Last Days Ofâ?¦'
episode: 'S01E06'
--- Born Naughty? S01E02 ---
name: 'Born Naughty?'
episode: 'S01E02'
--- Have I Got News For You S49E07 ---
name: 'Have I Got News For You'
episode: 'S49E07'
--- Ellen 2015.05.22 Joseph Gordon Levitt [REPOST] ---
name: 'Ellen'
date: '2015.05.22'
format: 'REPOST'
extra: 'Joseph'
extra: 'Gordon'
extra: 'Levitt'
--- The Soup 2015.05.22 [mp4] ---
name: 'The Soup'
date: '2015.05.22'
format: 'mp4'
--- Big Brother UK Live From The House (May 22, 2015) ---
name: 'Big Brother UK Live From The House'
date: 'May 22, 2015'
--- Alaskan Bush People S02 Wild Times Special ---
name: 'Alaskan Bush People'
episode: 'S02'
extra: 'Wild'
extra: 'Times'
extra: 'Special'
--- 500 Questions S01E03 ---
name: '500 Questions'
episode: 'S01E03'
使用 C# 正则表达式进行匹配并且 return 从字符串解析的数据是 return 不可靠的结果。
我使用的模式如下:
Regex r=new Regex(
@"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})",
RegexOptions.IgnoreCase
);
以下是几个失败的测试用例
Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]
The Soup 2015.05.22 [mp4]
Big Brother UK Live From The House (May 22, 2015)
应该return
- 显示名称(例如,
Ellen
) - 日期(例如,
2015.05.22
) - 额外信息(例如,
Joseph Gordon Levitt [REPOST]
)
Alaskan Bush People S02 Wild Times Special
应该return
- 显示姓名(例如,
Alaskan Bush People
) - 季节(例如,
02
) - 额外信息(例如,
Wild Times Special
)
500 Questions S01E03
应该return
- 显示名称(例如,
500 Questions
) - 季节(例如,
01
) - 剧集(例如,
03
)
有效的示例和 return 正确的数据
Boyster S01E13 – E14
Mysteries at the Museum S08E08
Mysteries at the National Parks S01E07 – E08
The Last Days Of… S01E06
Born Naughty? S01E02
Have I Got News For You S49E07
看起来,模式忽略了 S 和 E(如果未找到),然后使用第一组匹配数字填充该槽。
很明显,此模式需要做更多的工作才能处理上述不同的字符串。非常感谢你在这件事上的帮助。
试试这个:
(?<name>.*?)(?:S(?<season>\d{1,2}))?(?:E(?<episode>\d{1,2}))?(?<date>\d{4}\.\d{2}\.\d{2})(?<extra>.*)?
分而治之
您试图用一个简单的表达式解析太多内容。那不会很好地工作。在这种情况下,best 方法是将问题分成更小的问题,并分别解决每个问题。然后,我们稍后可以将所有内容组合成一个模式。
让我们为您要提取的数据编写一些模式。
Season/episode:
S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?
我使用
\p{Pd}
而不是-
来适应任何破折号类型。日期:
\d{4}\.\d{1,2}\.\d{1,2}
或者...
(?i:January|February|March|April|May|June|July|August|September|October|November|December) \s*\d{1,2},\s*\d{4}
写一个简单的模式以获得额外信息:
.*?
(是的,这很普通)
我们也可以这样检测显示格式:
\[.*?\]
您可以根据需要添加其他部分。
现在,我们可以把所有东西都放在一个模式中,使用组名来提取数据:
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
只需忽略 info
组(它用于 extra
中的条件,这样 extra
就不会消耗应该是节目名称一部分的内容)。你可以获得多个 extra
信息,所以只需将它们连接起来,在每个部分之间放置一个 space。
示例代码:
var inputData = new[]
{
"Boyster S01E13 – E14",
"Mysteries at the Museum S08E08",
"Mysteries at the National Parks S01E07 – E08",
"The Last Days Of… S01E06",
"Born Naughty? S01E02",
"Have I Got News For You S49E07",
"Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]",
"The Soup 2015.05.22 [mp4]",
"Big Brother UK Live From The House (May 22, 2015)",
"Alaskan Bush People S02 Wild Times Special",
"500 Questions S01E03"
};
var re = new Regex(@"
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
", RegexOptions.IgnorePatternWhitespace);
foreach (var input in inputData)
{
Console.WriteLine();
Console.WriteLine("--- {0} ---", input);
var match = re.Match(input);
if (!match.Success)
{
Console.WriteLine("FAIL");
continue;
}
foreach (var groupName in re.GetGroupNames())
{
if (groupName == "0" || groupName == "info")
continue;
var group = match.Groups[groupName];
if (!group.Success)
continue;
foreach (Capture capture in group.Captures)
Console.WriteLine("{0}: '{1}'", groupName, capture.Value);
}
}
这个输出是...
--- Boyster S01E13 - E14 ---
name: 'Boyster'
episode: 'S01E13 - E14'
--- Mysteries at the Museum S08E08 ---
name: 'Mysteries at the Museum'
episode: 'S08E08'
--- Mysteries at the National Parks S01E07 - E08 ---
name: 'Mysteries at the National Parks'
episode: 'S01E07 - E08'
--- The Last Days Ofâ?¦ S01E06 ---
name: 'The Last Days Ofâ?¦'
episode: 'S01E06'
--- Born Naughty? S01E02 ---
name: 'Born Naughty?'
episode: 'S01E02'
--- Have I Got News For You S49E07 ---
name: 'Have I Got News For You'
episode: 'S49E07'
--- Ellen 2015.05.22 Joseph Gordon Levitt [REPOST] ---
name: 'Ellen'
date: '2015.05.22'
format: 'REPOST'
extra: 'Joseph'
extra: 'Gordon'
extra: 'Levitt'
--- The Soup 2015.05.22 [mp4] ---
name: 'The Soup'
date: '2015.05.22'
format: 'mp4'
--- Big Brother UK Live From The House (May 22, 2015) ---
name: 'Big Brother UK Live From The House'
date: 'May 22, 2015'
--- Alaskan Bush People S02 Wild Times Special ---
name: 'Alaskan Bush People'
episode: 'S02'
extra: 'Wild'
extra: 'Times'
extra: 'Special'
--- 500 Questions S01E03 ---
name: '500 Questions'
episode: 'S01E03'