pandas: 文本分析:将原始数据传输到数据框

pandas: text analysis: Transfer raw data to dataframe

我需要从文本文件中读取行并提取 每行引用人名和引用文本。

行看起来类似于:

"Am I ever!", Homer Simpson responded.

Remarks:

Hint: Use the returned object from the 'open' method to get the file handler. Each line you read is expected to contain a new-line in the end of the line. Remove the new-line as following: line_cln =line.strip()

There are the options for each line (assume one of these three options): The first set of patterns, for which the person name appears before the quoted text. The second set of patterns, for which the quoted text appears before the person. Empty lines.

Complete the transfer_raw_text_to_dataframe function to return a dataframe with the extracted person name and text as explained above. The information is expected to be extracted from the lines of the given 'filename' file.

The returned dataframe should include two columns:

  • person_name - containing the extracted person name for each line.
  • extracted_text - containing the extracted quoted text for each line.

The returned values:

  • dataframe - The dataframe with the extracted information as described above.
  • Important Note: if a line does not contain any quotation pattern, no information should be saved in the corresponding row in the dataframe.

到目前为止我得到了什么:[已编辑]

def transfer_raw_text_to_dataframe(filename):

    data = open(filename)
    
    quote_pattern ='"(.*)"'
    name_pattern = "\w+\s\w+"
    
    df = open(filename, encoding='utf8')
    lines = df.readlines()
    df.close()
    dataframe = pd.DataFrame(columns=('person_name', 'extracted_text'))
    i = 0  

    for line in lines:
        quote = re.search(quote_pattern,line)
        extracted_quotation = quote.group(1)

        name = re.search(name_pattern,line)
        extracted_person_name = name.group(0)
        
        df2 = {'person_name': extracted_person_name, 'extracted_text': extracted_quotation}
        dataframe = dataframe.append(df2, ignore_index = True)

        dataframe.loc[i] = [person_name, extracted_text]
        i =i+1
            
    return dataframe

创建的数据框形状正确,问题是,每一行中的人名是:'Oh man' 并且引用是“哦,伙计,那家伙很难爱。” (在所有这些中) 这很奇怪,因为它甚至不在 txt 文件中...

谁能帮我解决这个问题?

编辑: 我需要从一个仅包含这些行的简单 txt 文件中提取:

"Am I ever!", Homer Simpson responded.
"Hmmm. So... is it okay if I go to the women's conference with Chloe?", Lisa Simpson answered.
"Really? Uh, sure.", Bart Simpson answered.
"Sounds great.", Bart Simpson replied.
Homer Simpson responded: "Danica Patrick in my thoughts!"
C. Montgomery Burns: "Trust me, he'll say it, or I'll bust him down to Thursday night vespers."
"Gimme that torch." Lisa Simpson said.
"No! No, I've got a lot more mothering left in me!", Marge Simpson said.
"Oh, Homie, I don't care if you're a billionaire. I love you just because you're..." Marge Simpson said.
"Damn you, e-Bay!" Homer Simpson answered.

for 文件夹中的循环:

# All files acc. mask ending with .txt
print(glob.glob("C:\MyFolder\*.txt"))

mylist=[ff for ff in glob.glob("C:\MyFolder\*.txt")]  
print("file_list:\n", mylist)

for filepath in mylist:
   # do smth with each filepath

收集您从文件中获取的所有 dfs - 像这样(例如通过掩码读取 csv 文件):

import glob
import pandas as pd

def dfs_collect():
  mylist=[ff for ff in glob.glob("C:\MyFolder\*.txt")]   # all files by-mask
  print("file_list:\n", mylist)

  dfa=pd.concat((pd.read_csv(file, sep=';', encoding='windows-1250', index_col=False) for file in mylist), ignore_index=True)

但要获取您的文件的内容 - 需要内容示例...没有您的 txt 文件示例(具有 dummy_info 但保留其真实结构),我怀疑是否有人将尝试想象它应该是什么样子

可能是这样的:

    import pandas as pd
    import re
   # do smth 
   with open("C:\12.txt","r") as f:
    data= f.read()
    # print(data)
   ########### findall text in quotes
    m = re.findall(r'\"(.+)\"', data)
    print("RESULT: \n", m)
    df= pd.DataFrame({'rep':m})
    print(df)
   ###########  retrieve and replace text in quotes for nothing
    m= re.sub(r'\"(.+)\"',  r'', data)
   ###########  get First Name & Last Name from the rest text in each line
    regex = re.compile("([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)")
    mm= regex.findall(m)
    df1= pd.DataFrame({'author':mm})
    print(df1)
   ########### join 2 dataframes
    fin=pd.concat([df, df1], axis=1)
    print(fin)

所有打印仅用于检查(将它们拿走以获得更清晰的代码)。 只是“C. Montgomery Burns”失去了他的第一封信...

我认为以下内容可以满足您的需求。请验证输出是否准确。我会解释任何不清楚的行

import pandas as pd
import numpy as np
import nltk
from nltk.tree import ParentedTree
import typing as t # This is optional

# Using `read_csv` to read in the text because I find it easier
data = pd.read_csv("dialog.txt", header = None, sep = "~", quoting=3)
dialouges = data.squeeze() # Getting a series from the above DF with one column

def tag_sentence(tokenized: t.List[str]) -> t.List[t.Tuple[str, str]]:
    tagged = nltk.pos_tag(tokenized)
    tagged = [(token, tag) if tag not in {"``", "''"} else (token, "Q") for token, tag in tagged]
    keep = {"Q", "NNP"}
    renamed = [(token, "TEXT") if tag not in keep else (token, tag) for token, tag in tagged]
    return renamed

def get_parse_tree(tagged_sent):
    grammar = """
    NAME: {<NNP>+}
    WORDS: {<TEXT>+}
    DIALOUGE: {<Q><WORDS|NAME>+<Q>}
    """
    cp = nltk.RegexpParser(grammar)
    parse_tree = cp.parse(tagged_sent)
    return parse_tree

def extract_info(parse_tree):
    ptree = ParentedTree.convert(parse_tree)
    trees = list(ptree.subtrees())
    root = ptree.root()
    
    for subtree in trees[1:]:
        if subtree.parent() == root:
            if subtree.label() == "DIALOUGE":
                dialouge = ' '.join(word for word, _ in subtree.leaves()[1:-1]) # Skipping quotaton marks
            if subtree.label() == "NAME":
                person = ' '.join(word for word, _ in subtree.leaves())
    
    return dialouge, person

def process_sentence(sentence):
    return extract_info(get_parse_tree(tag_sentence(nltk.word_tokenize(sentence))))

processed = [process_sentence(line) for line in dialouges]
result = pd.DataFrame(processed, columns=["extracted_text", "person_name"])

生成的 DataFrame 如下所示: