从多个 7-zip 文件中提取特定的文件扩展名

Question

我有一个 RAR 文件和一个 ZIP 文件。在这两个中有一个文件夹。文件夹内有几个 7-zip (.7z) 文件。每个7z里面都有多个扩展名相同但名字不同的文件。

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

我只想从数千个文件中提取我需要的文件... 我需要那些名称包含特定子字符串的文件。例如，如果压缩文件的名称中包含 '[!]' 或 '(U)' 或 '(J)' 则为确定要提取的文件的条件。

我可以毫无问题地提取文件夹，所以我有这个结构：

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

我在 Windows 环境中，但我安装了 Cygwin。我想知道如何轻松提取我需要的文件？也许使用单个命令行。

更新

问题有一些改进：

内部 7z 文件及其各自的文件名中可以包含空格。
有 7z 个文件，其中只有一个文件不符合给定的条件。因此，作为唯一可能的文件，它们也必须被提取。

解决方案

感谢大家。 bash 解决方案帮助了我。我无法测试 Python3 解决方案，因为我在尝试使用 pip 安装库时遇到问题。我不使用 Python 所以我必须研究并克服我在这些解决方案中遇到的错误。现在，我找到了一个合适的答案。谢谢大家。

Answer 1

这是经过一些尝试后的最终版本。以前没有用，所以我将其删除，而不是附加。请阅读到最后，因为并非所有内容都可能是最终解决方案所需要的。

进入正题。我会使用 Python。如果这是一次任务，那么它可能会过大，但在任何其他情况下 - 您可以记录所有步骤以供将来调查、正则表达式、编排一些命令以提供输入，以及获取和处理输出 - 每次。在 Python 中，所有这些情况都非常容易。如果你有的话。

现在，我将写下如何获得环境。配置。并非所有都是强制性的，但尝试安装做了一些步骤，也许对过程的描述本身是有益的。

我有 MinGW - 32 位版本。然而，提取 7zip 并不是强制性的。安装后转到 C:\MinGW\bin 和运行 mingw-get.exe:

Basic Setup 我已经安装了 msys-base（右键单击，标记为安装，从安装菜单 - 应用更改）。这样我就有了 bash、sed、grep 等等。
在 All Packages 中有 mingw32-libarchive with dll as class. Since pythonlibarchive` 包只是一个包装器，你需要这个 dll 来实际包装二进制文件。

示例适用于 Python 3. 我使用的是 32 位版本。您可以从他们的主页 fetch 获取它。我已经安装在默认目录中，这很奇怪。所以建议是安装在你的磁盘的根目录下——比如 mingw。

其他 - conemu 比默认控制台好得多。

正在 Python 中安装软件包。 pip 用于此。从您的控制台转到 Python 主页，那里有 Scripts 子目录。对我来说是：c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts。您可以使用例如 pip search archive 进行搜索，然后使用 pip install libarchive-c:

安装

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

在cd ..调用python后，可以使用/导入新库：

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

所以它失败了。我试图解决这个问题，但失败了：

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

尝试使用 set 命令直接提供信息，但失败了...所以我转到 pylzma - 因为不需要 mingw。 pip 安装失败：

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\Users\texxas\AppData\Local\Temp\pip-build-99t_zgmz\pylzma\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

又失败了。但这很简单 - 我已经安装了 visual studio build tools 2015，并且成功了。我安装了 sevenzip，所以我创建了示例存档。所以最后我可以开始 python 并做：

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

得到空列表。仔细观察......可以更好地理解 - pylzma 不考虑空文件 - 只是为了让你意识到这一点。所以将一个字符放入我的示例文件中，最后一行给出：

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

所以...休息是小菜一碟。实际上这是原始 post:

的一部分

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))

附带说明 - Anaconda 是一个很棒的工具，但完整安装需要 500+MB，所以这太多了。

另外让我分享 wmctrl.py 工具，来自我的 github:

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

这样您就可以编排不同的命令 - 这里是 wmctrl。可以以允许数据处理的方式处理结果。

Answer 2

您声明可以在问题赏金页脚中使用 linux。而且我也不使用 windows。对于那个很抱歉。我在 Python3 上使用，你必须在 linux 环境中（我会尽快在 windows 上测试）。

归档结构

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

提取的结构

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

这是我的做法。

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

在主程序上面，我已经准备好了所有需要的功能。我没有全部使用它们，但我保留了它们以备不时之需。

我在 python3 中使用了几个 python 库，但你只需使用 pip 安装 libarchive and rarfile，其他都是内置库。

这里是 copy of my source tree

控制台输出

这是运行这个 python 文件时的控制台输出，

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

问题

到目前为止我遇到的唯一问题是，在程序根目录下生成了一些临时文件。它无论如何都不会影响程序，但我会尝试修复它。

编辑

你必须运行

sudo apt-get install libarchive-dev

安装实际的 libarchive 程序。 Python 库只是它的包装器。看看 official documentation.

Answer 3

如何使用此命令行：

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

其中：

myDir 是您的解压文件夹
outDir 是你的输出目录
ext 是您的文件扩展名

-y 选项用于在不同存档中具有相同文件名的情况下强制覆盖。

Answer 4

此解决方案基于 bash、grep 和 awk，适用于 Cygwin 和 Ubuntu。

由于你有要求先搜索(X) [!].ext个文件，如果没有这样的文件再寻找(X).ext个文件，我认为写一些单一的表达式是不可能的来处理这个逻辑。

解决方案应该有一些 if/else 条件逻辑来测试存档中的文件列表并决定提取哪些文件。

这是我测试脚本的 zip/rar 存档中的初始结构（我制作了一个 script 来准备这个结构）：

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

输出是这样的：

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

这是要进行提取的 script：

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See 
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index([=12=],"Name")}
      p {print substr([=12=],pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

这里的基本思想是遍历 7zip 档案并使用 7z l 命令（文件列表）获取每个档案的文件列表。

命令的输出非常冗长，所以我们使用 awk 来清理它并获取文件名列表。

之后，我们使用 grep 过滤此列表以获取 [!] 文件列表或 (X) 文件列表。然后我们将这个列表传递给 7zip 以提取我们需要的文件。

从多个 7-zip 文件中提取特定的文件扩展名

Extract specific file extensions from multiple 7-zip files

windows

compression

cygwin

extract

7zip

更新

解决方案

归档结构

提取的结构

这是我的做法。

控制台输出

问题

编辑