使用字典理解从结构化文本中提取平面字典

Question

我正在尝试从一段结构化文本创建字典，但我无法理解正确的语法。

text = 'english (fluently), spanish (poorly)'

# desired output: 
{english: fluently, spanish: poorly}

# one of my many attempts: 
dict((language,proficiency.strip('\(\)')) for language,proficiency in lp.split(' ') for lp in text.split(', '))

# but resulting error: 
NameError: name 'lp' is not defined

我猜lp in lp.split(' ')没有定义，但我想不通, 如何修改语法以获得所需的结果。

现实中的情况要复杂得多。我有一个数据框，我的目标是最终使用一个函数将上述数据整理成每种语言的列和每种相应熟练程度的列。像下面这样的东西（虽然它可能可以更有效地完成）

# pandas dataframe
pd.DataFrame({'language': ['english, spanish (poorly)', 'turkish']})
        
# desired output: 
pd.DataFrame({'Language: English': [True, False], 'Language proficiency: English': ['average', pd.NA], 'Language: Spanish': [True, False], 'Language proficiency: Spanish': ['poorly', pd.NA], 'Language: Turkish': [False, True], 'Language proficiency: Turkish': [pd.NA, 'average']})
    
# my attempt
def tidy(content):
    if pd.isna(content):
        pass
    else:
        dict((language,proficiency.strip('\(\)')) for language,proficiency in lp.split(' ') for lp in text.split(', '))

def tidy_language(language, content):
    if pd.isna(content):
        return pd.NA
    else:
        if language in content.keys():
            return True
        else:
            return False

def tidy_proficiency(language, content):
    if pd.isna(content):
        return pd.NA
    else:
        if language in content.keys():
            return content.language
        else:
            return pd.NA

languages = ['english', 'spanish', 'turkish']
df['language'] = df['language'].map(lambda x: tidy(x))
for language in languages:
    df['Language: {}'.format(language.capitalize())] = df['language'].map(lambda x: tidy_language(language, content)
    df['Language proficiency: {}'.format(language.capitalize())] =  df['language'].map(lambda x: tidy_proficiency(language, content)

Answer 1

您需要反转列表理解中的两个 for 循环（for 循环需要以与您编写命令式代码相同的顺序出现）。
在 .strip('\(\)') 中不需要反斜杠。
for language,proficiency in lp.split(' ') 将尝试解压 lp.split(' ') 的每个项目到元组 (language,proficiency)，因此，包装 lp.split(' ') 成一个 1 元素列表来实现你想要的:

dict((l,p.strip('()')) for lp in text.split(', ') for l,p in [lp.split(' ')])

{'english': 'fluently', 'spanish': 'poorly'}

以上可以写成dict-理解：

{l: p.strip('()') for lp in text.split(', ') for l,p in [lp.split(' ')]}

读起来更好一些。

使用 re 的替代方法：

import re
dict(re.findall(r'(\w+) \((\w+)\),?', text))

{'english': 'fluently', 'spanish': 'poorly'}

Answer 2

这是一个快速解决方案。将文本提供给函数。

def text_to_dict(text):
    text=text+" "

    new=""
    for alphabet in text:
        if alphabet=="," or alphabet=="(" or alphabet==")":
            continue;
        new+=alphabet

    lis=[]
    temp=""
    for alphabet in new:
        if alphabet==" ":
            if temp[0]==" ":
                temp=temp[1:len(temp)]
            lis.append(temp)
            temp=""
        temp+=alphabet

    dict={}
    for el in lis:
        if lis.index(el)%2==0:
            dict[el]=lis[lis.index(el)+1]

    return dict

if __name__=="__main__":
    text="english (fluently), spanish (poorly), bangla (fluently)"
    print(text_to_dict(text))

Answer 3

虽然fferri provides some perfect solutions to my original question, my final solution in the context of the dataframe resembles more the suggestion of SuperNoob.

我的最终解决方案：

# Create a parser function to form a dictionary of language: proficiency pairs from the values in the 'Speaks' column.
def parse_dictionary(content):
    if pd.isna(content):
        pass
    else:
        d = {}
        lps = content.split(', ')
        for lp in lps:
            if '(' not in lp:
                l = lp
                p = pd.NA
            else:
                l, p = lp.split('(')
                l = l.strip().capitalize()
                p = p.strip('()')
            d[l] = p
        return d
    
# Create a parser function to return the languages fom the dictionary in the 'Speaks' column.    
def parse_language(language, d):      
    if pd.isna(d):
        pass
    else:
        if language in d.keys():
            return True
        else:
            return False
        
# Create a parser function to return the language proficiencies fom the dictionary in the 'Speaks' column.
def parse_proficiency(language, d):   
    if pd.isna(d):
        pass
    else:
        if language in d.keys():
            return d[language]
        else:
            return pd.NA

# Parse the values in the 'Speaks' column to create a dictionary of language: proficiency pairs.
df['Speaks'] = df['Speaks'].map(lambda x: parse_dictionary(x))  

# Parse the values in the 'Speaks' column to create seperate 'language' columns with True-False values.
for language in languages:
    df['Language: {}'.format(language)] = df['Speaks'].apply(lambda d: parse_language(language, d))

# Parse the values in the 'Speaks' column to create seperate 'Language proficiency' columns with proficiency values.
for language in languages:
    df['Language proficiency: {}'.format(language)] = df['Speaks'].apply(lambda d: parse_proficiency(language, d))

使用字典理解从结构化文本中提取平面字典

flat dictionary from structured text using dictionary comprehension

python

dictionary

dictionary-comprehension