使用正则表达式从文本文件中获取文件夹标题 (python)
using a regex to grab titles of folders from text file (python)
我正在尝试使用正则表达式来读取文本文件,并根据正则表达式找到的内容在特定目录中创建文件夹。我正在阅读的文本文件是我想从中获取文件夹标题的页面的一些 HTML 源代码。 (这就是正则表达式搜索奇数的原因)
This 是我正在读取的文件。 (超长)
这是我的代码:
import os
import re
with open('folders.txt','r', encoding='utf-8') as f:
lines = f.readlines()
match = re.search(r'>[\w\.-]+</a></td>', lines)
match = match.rstrip("</a></td>")
match = match.lstrip(">")
newpath = r'C:\Desktop\scriptFolders\%s' %match
if not os.path.exists(newpath): os.makedirs(newpath)
当我将此代码放入 shell 时,出现以下错误:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Python34\lib\re.py", line 170, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
我偏离轨道有多远?
您的代码中存在许多错误和可能的改进。它们不容易用散文来解释,所以这里是代码的工作版本,注释突出了更改及其背后的原因。
import os
import re
# Precompile the regex so it only happens once. This saves a bit of time,
# especially if your file is large.
# I've also modified the regex to include a capture group [1] for the part
# between the > and the <, allowing us to grab the string there later. There
# are other ways to do it (e.g. with lookbehind and lookahead), but this is the
# simplest.
regex = re.compile(r'>([\w\.-]+)</a></td>')
with open('folders.txt', 'r', encoding='utf-8') as f:
# Loop through the lines in f.
# Alternatively, you can also do
# lines = f.readlines()
# for line in lines:
# ...
# but it's less memory-efficient because it puts the whole file in memory.
for line in f:
match = regex.search(line)
# re.search returns a match object [2], or None if the string doesn't
# match the regex.
if not match: # Throw away non-matching lines.
continue
# Get the value of capture group #1.
match = match.group(1)
newpath = r'C:\Desktop\scriptFolders\%s' % match
if not os.path.exists(newpath):
os.makedirs(newpath)
参考文献:
我正在尝试使用正则表达式来读取文本文件,并根据正则表达式找到的内容在特定目录中创建文件夹。我正在阅读的文本文件是我想从中获取文件夹标题的页面的一些 HTML 源代码。 (这就是正则表达式搜索奇数的原因)
This 是我正在读取的文件。 (超长)
这是我的代码:
import os
import re
with open('folders.txt','r', encoding='utf-8') as f:
lines = f.readlines()
match = re.search(r'>[\w\.-]+</a></td>', lines)
match = match.rstrip("</a></td>")
match = match.lstrip(">")
newpath = r'C:\Desktop\scriptFolders\%s' %match
if not os.path.exists(newpath): os.makedirs(newpath)
当我将此代码放入 shell 时,出现以下错误:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Python34\lib\re.py", line 170, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
我偏离轨道有多远?
您的代码中存在许多错误和可能的改进。它们不容易用散文来解释,所以这里是代码的工作版本,注释突出了更改及其背后的原因。
import os
import re
# Precompile the regex so it only happens once. This saves a bit of time,
# especially if your file is large.
# I've also modified the regex to include a capture group [1] for the part
# between the > and the <, allowing us to grab the string there later. There
# are other ways to do it (e.g. with lookbehind and lookahead), but this is the
# simplest.
regex = re.compile(r'>([\w\.-]+)</a></td>')
with open('folders.txt', 'r', encoding='utf-8') as f:
# Loop through the lines in f.
# Alternatively, you can also do
# lines = f.readlines()
# for line in lines:
# ...
# but it's less memory-efficient because it puts the whole file in memory.
for line in f:
match = regex.search(line)
# re.search returns a match object [2], or None if the string doesn't
# match the regex.
if not match: # Throw away non-matching lines.
continue
# Get the value of capture group #1.
match = match.group(1)
newpath = r'C:\Desktop\scriptFolders\%s' % match
if not os.path.exists(newpath):
os.makedirs(newpath)
参考文献: