Python 中的分而治之列表（使用 pyreadstat 读取 sav 文件）

Question

我正在尝试在 python 中使用 pyreadstat 读取 sav 文件，但在一些罕见的情况下，我收到 UnicodeDecodeError 错误，因为字符串变量具有特殊字符。

为了处理这个问题，我认为我不会加载整个变量集，而是只加载没有这个错误的变量。

下面是我随身携带的伪代码。这不是一个非常有效的代码，因为我使用 try 和 except 检查列表的每个项目中的错误。

# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
    print(var)
    try:
        df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)]) 
        # If no error that means we can store this variable in result
        result.append(var)
    except:
        pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)

对于包含 1000 多个变量的 sav 文件，处理它需要很长时间。我在想是否有一种方法可以使用分而治之的方法并更快地完成。下面是我建议的方法，但我不太擅长实现递归算法。有人可以帮我写伪代码吗，这会很有帮助。

获取列表并尝试读取sav文件
在没有错误的情况下，可以将输出存储在result中，然后我们读取sav文件
如果出现错误，则将列表分成两部分，然后再次运行这些 ....
第 3 步需要再次运行，直到我们得到一个列表，其中它不会给出任何错误

使用第二种方法，我 90% 的 sav 文件将在第一次通过时自行加载，因此我认为递归是一个很好的方法

您可以尝试重现 sav 文件的问题here

Answer 1

对于这种特定情况，我建议采用不同的方法：您可以为 pyreadstat.read_sav 提供参数“encoding”以手动设置编码。如果您不知道它是哪一个，您可以在此处迭代编码列表：https://gist.github.com/hakre/4188459 以找出哪个有意义。例如：

# here codes is a list with all the encodings in the link mentioned before
for c in codes:
    try:
        df, meta = p.read_sav("Test.sav", encoding=c)
        print(encoding)
        print(df.head())
    except:
        pass

我做到了，有一些可能有意义，假设字符串在 non-latin 字母表中。然而，最有前途的一个不在列表中：encoding="UTF8"（列表包含 UTF-8，带有破折号，但失败了）。使用 UTF8（无破折号）我得到这个：

నేను గతంలో వాడిన బ

根据 google 在泰卢固语中翻译的意思是“我曾经来过 b”。不确定这是否完全有意义，但这是一种方式。

这种方法的优点是，如果找到正确的编码，就不会丢失数据，而且读取数据的速度会很快。缺点是你可能找不到正确的编码。

万一您找不到正确的编码，您无论如何都会非常快速地阅读有问题的列，您可以稍后在 pandas 中通过检查哪些字符列不包含拉丁字符来丢弃它们。这将比您建议的算法快得多。

Python 中的分而治之列表（使用 pyreadstat 读取 sav 文件）

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

python

arrays

list

spss

divide-and-conquer