在 Python 中从字典中的列表中提取字符串
Extracting a String from a List inside a Dictionary in Python
我是 Python 的新手(一周大),非常感谢您的帮助。我正在尝试从 6,000 多篇新闻文章中提取一个字符串(日期)。我正在练习一些编造的文本,这些文本遵循与我想要处理的新闻文章相同的模式:
Lorem Ipsum Dolor
Monday, 5/21/2017
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.
Quisque at dignissim lacus.
和:
Lorem Ipsum Dolor
Monday, 7/21/2017
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.
Quisque at dignissim lacus.
我知道这些日期对于所有 .txt 文件都位于同一位置。它们位于每篇文章标题后的换行符 (\n) 和下一个换行符 (\n) 之间。
到目前为止,我已经成功地使用以下代码创建了一个字典:
base_dir = 'C:/Users/Lorem/text'
output = {}
file_list = []
for (dirpath, dirnames, filenames) in os.walk(base_dir):
for f in filenames:
if 'txt' in str(f):
e = os.path.join(str(dirpath), str(f))
file_list.append(e)
for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
if '\n' in line:
output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)
输出看起来不错:
output
{'C:/Users/Lorem/text\lorem.txt': ['Lorem Ipsum Dolor\n','Monday, 5/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n'],'C:/Users/Lorem/text\lorem2.txt': ['Lorem Ipsum Dolor\n','Monday, 7/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n']}
此时我尝试使用正则表达式从字典中的列表中提取日期:
result = []
for out in output.values():
if re.search('Dolor\n,(.*)\n', out):
result.append(out)
但是,正则表达式不适用于列表。我将如何从我的列表中解析出这些日期?理想情况下,我想要一本字典或一些带有文本和日期的数据结构,这样我就可以将它移到 R 中,如果我工作起来更舒服的话。
谢谢!
您可以使用字典理解来解析:
output = {'C:/Users/Lorem/text\lorem.txt': ['Lorem Ipsum Dolor\n','Monday, 5/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n'],'C:/Users/Lorem/text\lorem2.txt': ['Lorem Ipsum Dolor\n','Monday, 7/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n']}
dates = {a:b[1:3] for a, b in output.items()}
输出:
{'C:/Users/Lorem/text\lorem2.txt': ['Monday, 7/21/2017\n', '\n'], 'C:/Users/Lorem/text\lorem.txt': ['Monday, 5/21/2017\n', '\n']}
我是 Python 的新手(一周大),非常感谢您的帮助。我正在尝试从 6,000 多篇新闻文章中提取一个字符串(日期)。我正在练习一些编造的文本,这些文本遵循与我想要处理的新闻文章相同的模式:
Lorem Ipsum Dolor
Monday, 5/21/2017
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci. Quisque at dignissim lacus.
和:
Lorem Ipsum Dolor
Monday, 7/21/2017
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci. Quisque at dignissim lacus.
我知道这些日期对于所有 .txt 文件都位于同一位置。它们位于每篇文章标题后的换行符 (\n) 和下一个换行符 (\n) 之间。
到目前为止,我已经成功地使用以下代码创建了一个字典:
base_dir = 'C:/Users/Lorem/text'
output = {}
file_list = []
for (dirpath, dirnames, filenames) in os.walk(base_dir):
for f in filenames:
if 'txt' in str(f):
e = os.path.join(str(dirpath), str(f))
file_list.append(e)
for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
if '\n' in line:
output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)
输出看起来不错:
output
{'C:/Users/Lorem/text\lorem.txt': ['Lorem Ipsum Dolor\n','Monday, 5/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n'],'C:/Users/Lorem/text\lorem2.txt': ['Lorem Ipsum Dolor\n','Monday, 7/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n']}
此时我尝试使用正则表达式从字典中的列表中提取日期:
result = []
for out in output.values():
if re.search('Dolor\n,(.*)\n', out):
result.append(out)
但是,正则表达式不适用于列表。我将如何从我的列表中解析出这些日期?理想情况下,我想要一本字典或一些带有文本和日期的数据结构,这样我就可以将它移到 R 中,如果我工作起来更舒服的话。
谢谢!
您可以使用字典理解来解析:
output = {'C:/Users/Lorem/text\lorem.txt': ['Lorem Ipsum Dolor\n','Monday, 5/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n'],'C:/Users/Lorem/text\lorem2.txt': ['Lorem Ipsum Dolor\n','Monday, 7/21/2017\n','\n','Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.\n','Mauris nec maximus purus. Maecenas sit amet pretium tellus. Praesent sed rhoncus eo. Duis id commodo orci.\n']}
dates = {a:b[1:3] for a, b in output.items()}
输出:
{'C:/Users/Lorem/text\lorem2.txt': ['Monday, 7/21/2017\n', '\n'], 'C:/Users/Lorem/text\lorem.txt': ['Monday, 5/21/2017\n', '\n']}