ValueError: too many values to unpack (expected 2) , when I try to extract only 2 substrings from a regex pattern

ValueError: too many values to unpack (expected 2) , when I try to extract only 2 substrings from a regex pattern

这是代码,但错误的部分是在验证正则表达式模式结构后提取子字符串的位置

def name_and_img_identificator(input_text, text):
    input_text = re.sub(r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"", normalize("NFD", input_text), 0, re.I)
    input_text = normalize( 'NFC', input_text) # -> NFC
    input_text_to_check = input_text.lower() #Convierte a minuscula todo

    
    #regex_patron_01 = r"\s*\¿?(?:dime los|dime las|dime unos|dime unas|dime|di|cuales son los|cuales son las|cuales son|cuales|que animes|que|top)\s*((?:\w+\s*)+)\s*(?:de series anime|de anime series|de animes|de anime|animes|anime)\s*(?:similares al|similares a|similar al|similar a|parecidos al|parecidos a|parecido al|parecido a)\s*(?:la serie de anime|series de anime|la serie anime|la serie|anime|)\s*(llamada|conocida como|cuyo nombre es|la cual se llama|)\s*((?:\w+\s*)+)\s*\??"

    #Regex in english
    regex_patron_01 = r "\ s * \ ¿? (?: tell me the | tell me some| tell me | say | which are the | which are the | which are | which | which animes | which | top) \ s * ((?: \ w + \ s *) +) \ s * (?: anime series | anime series | anime | anime | anime | anime) \ s * (?: similar to | similar to | similar to | similar to | similar to | similar to | similar to | similar to) \ s * (?: the anime series | anime series | the anime series | the series | anime |) \ s * (called | known like | whose name is | which is called |) \ s * ((?: \ w + \ s *) +) \ s * \ ?? "

    m = re.search(regex_patron_01, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code

    if m:
        num, anime_name = m.groups()[2]

        num = num.strip()
        anime_name = anime_name.strip()
        print(num)
        print(anime_name)

    return text

input_text_str = input("ingrese: ")
text = ""

print(name_and_img_identificator(input_text_str, text))

它给了我这个错误,事实是我不知道如何构造这个正则表达式模式,以便它只从该输入中提取这 2 个值(子字符串)

Traceback (most recent call last):
  File "serie_recommendarion_for_chatbot.py", line 154, in <module>
    print(serie_and_img_identificator(input_text_str, text))
  File "anime_recommendarion_for_chatbot.py", line 142, in name_and_img_identificator
    num, anime_name = m.groups()
ValueError: too many values to unpack (expected 2)

如果我这样输入: 'Dame el top 8 de animes parecidos a Gundam' 'Give me the top 8 anime like Gundam'

我需要你提取:

num = '8'
anime_name = 'Gundam'

在这种情况下,我该如何修正我的正则表达式序列?

您可以尝试提取前 2 个值,也许您缺少一个冒号。

num, anime_name = m.groups()[:2]

可能是这种情况,因为您遇到了 too many values to unpack 错误。


号码和姓名使用两种不同的模式。为了简单起见,我只包含了几个例子。

为数Test cases

(?<=(which are the|which|top)\s)[0-9]+(?=\s(anime series|anime))

为姓名Test cases

(?<=(like|called|which is called)\s)[A-Za-z]+

剩下的工作就是用西班牙语实现模式。

在 Regex 游乐场试试这个:Link

所以没有太大变化,第一组仍然是动漫数量的量词,第二组是动漫本身的名称。我只是稍微简化了正则表达式(为了演示目的去掉了一些不必要的位)。其中大部分与您的版本没有变化,它实际上是非常可靠的正则表达式。

正则表达式:\b(\d+).*(?:called|that are like|known like|whose name is|which is called)\s*((?:\w+\s*)+)\s*\??


测试你原来的问题 - 我粗略地翻译成英文:-)

import re
from unicodedata import normalize


def name_and_img_identificator(input_text, text):
    input_text = re.sub(r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"",
                        normalize("NFD", input_text), 0, re.I)
    input_text = normalize('NFC', input_text)  # -> NFC
    input_text_to_check = input_text.lower()  # Convierte a minuscula todo


    # Regex in english

    # original
    #   note: you have extra spaces here, which regex might not like.
    #   you can get rid of spaces and then it should hopefully be fine.
    # regex_patron_01 = r "\ s * \ ¿? (?: tell me the | tell me some| tell me | say | which are the | which are the | which are | which | which animes | which | top) \ s * ((?: \ w + \ s *) +) \ s * (?: anime series | anime series | anime | anime | anime | anime) \ s * (?: similar to | similar to | similar to | similar to | similar to | similar to | similar to | similar to) \ s * (?: the anime series | anime series | the anime series | the series | anime |) \ s * (called | known like | whose name is | which is called |) \ s * ((?: \ w + \ s *) +) \ s * \ ?? "

    # simplified
    regex_patron_01 = r'\b(\d+).*(?:called|that are like|known like|whose name is|which is called)\s*((?:\w+\s*)+)\s*\??'

    m = re.search(regex_patron_01, input_text_to_check,
                  re.IGNORECASE)  # Con esto valido la regex haber si entra o no en el bloque de code

    if m:
        num, anime_name = m.groups()[:2]

        num = num.strip()
        anime_name = anime_name.strip()
        print(num)
        print(anime_name)

    return text


#input_text_str = input("ingrese: ")
input_text_str = 'Tell me the top 8 animes that are like Gundam?'
text = ""

print(name_and_img_identificator(input_text_str, text))

正则表达式模式错误

  1. 您忘记添加 ?: 以不捕获该组。变化:
regex_patron_01 = r"...(llamada|conocida como|cuyo nombre es|la cual se llama|)..."

收件人:

regex_patron_01 = r"...(?:llamada|conocida como|cuyo nombre es|la cual se llama|)..."
  1. 为了不捕获额外的空格或单词,您对 num 的捕获应该是非贪婪的,这样它就不会捕获像 "de" 这样的单词并让后续模式匹配它。变化:
regex_patron_01 = r"...((?:\w+\s*)+)..."

收件人:

regex_patron_01 = r"...((?:\w+?\s*?)+)..."
  1. .groups() 已经包含匹配的字符串,因此访问索引只会给你一个字符串,这是错误的根本原因。变化:
num, anime_name = m.groups()[2]

收件人:

num, anime_name = m.groups()

通过上面的这些更改,它将成功:

8
gundam

改进

您的正则表达式太复杂并且包含许多硬编码的词,这些词因语言而异。我的建议是对它可以接受的字符串格式设置一个标准:

Any text here (num) any text here (anime_name)

这已经是您输入的格式:

Dame el top 8 de animes parecidos a Gundam

因此您可以删除那个长的正则表达式并替换为这个,输出将是相同的:

regex_patron_01 = r"^.*?(\d+).*\s(.+)$"

请注意,这要求 (anime_name) 是一个单词。为了支持多词,我们必须设置一个特殊字符来标记动漫名称的开头,例如冒号 :

Dame el top 8 de animes parecidos a: Gundam X

那么正则表达式将是:

regex_patron_01 = r"^.*?(\d+).*:\s(.+)$"

输出

8
gundam x