使用正则表达式从文本文件中获取文件夹标题 (python)

using a regex to grab titles of folders from text file (python)

我正在尝试使用正则表达式来读取文本文件,并根据正则表达式找到的内容在特定目录中创建文件夹。我正在阅读的文本文件是我想从中获取文件夹标题的页面的一些 HTML 源代码。 (这就是正则表达式搜索奇数的原因)

This 是我正在读取的文件。 (超长)

这是我的代码:

import os
import re
with open('folders.txt','r', encoding='utf-8') as f:
  lines = f.readlines()

  match = re.search(r'>[\w\.-]+</a></td>', lines)
  match = match.rstrip("</a></td>")
  match = match.lstrip(">")
  newpath = r'C:\Desktop\scriptFolders\%s' %match
  if not os.path.exists(newpath): os.makedirs(newpath)

当我将此代码放入 shell 时,出现以下错误:

Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Python34\lib\re.py", line 170, in search
  return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

我偏离轨道有多远?

您的代码中存在许多错误和可能的改进。它们不容易用散文来解释,所以这里是代码的工作版本,注释突出了更改及其背后的原因。

import os
import re

# Precompile the regex so it only happens once. This saves a bit of time,
# especially if your file is large.
# I've also modified the regex to include a capture group [1] for the part
# between the > and the <, allowing us to grab the string there later. There
# are other ways to do it (e.g. with lookbehind and lookahead), but this is the
# simplest.
regex = re.compile(r'>([\w\.-]+)</a></td>')

with open('folders.txt', 'r', encoding='utf-8') as f:
    # Loop through the lines in f.
    # Alternatively, you can also do
    #     lines = f.readlines()
    #     for line in lines:
    #         ...
    # but it's less memory-efficient because it puts the whole file in memory.
    for line in f:
        match = regex.search(line)
        # re.search returns a match object [2], or None if the string doesn't
        # match the regex.
        if not match:  # Throw away non-matching lines.
            continue
        # Get the value of capture group #1.
        match = match.group(1)
        newpath = r'C:\Desktop\scriptFolders\%s' % match
        if not os.path.exists(newpath):
            os.makedirs(newpath)

参考文献:

  1. Capture groups
  2. Match objects