pandas 更新后转义序列无效

Question

我正在 pandas 中解析带有多个字符定界符的 csv，如下所示

big_df = pd.read_csv(os.path.expanduser('~/path/to/csv/with/special/delimiters.csv'), 
                     encoding='utf8', 
                     sep='$$><$$', 
                     decimal=',', 
                     engine='python')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('$$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)

big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>$$'], ['', '', ''], regex=True)

在我最近升级我的 pandas 安装之前，它工作正常。现在我看到很多弃用警告：

<input>:3: DeprecationWarning: invalid escape sequence $
<input>:3: DeprecationWarning: invalid escape sequence $
<input>:3: DeprecationWarning: invalid escape sequence $
<input>:3: DeprecationWarning: invalid escape sequence $
<input>:3: DeprecationWarning: invalid escape sequence $
<ipython-input-6-1ba5b58b9e9e>:3: DeprecationWarning: invalid escape sequence $
  sep='$$><$$',
<ipython-input-6-1ba5b58b9e9e>:7: DeprecationWarning: invalid escape sequence $
  big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('$$>$', '')

因为我需要带有 $ 符号的特殊分隔符，所以我不确定如何正确处理这些警告

Answer 1

问题是字符串中的转义会干扰正则表达式中的转义。虽然 '\s' 是一个有效的正则表达式标记，但对于 python 这将表示一个不存在的特殊字符（字符串文字 '\s' 自动转换为 '\s' 即 r'\s'，我怀疑这个过程显然已被弃用，从 python 3.6).

关键是在构造正则表达式时始终使用 原始字符串文字 ，以确保 python 不会被反斜杠混淆。虽然大多数框架过去都很好地处理了这种歧义（我假设忽略了无效的转义序列），但显然某些库的较新版本正试图迫使程序员明确和明确（我完全支持）。

在您的具体情况下，您的模式应该从 '$$><$$' 更改为 r'$$><$$':

big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace(r'$$>$', '')

实际发生的是反斜杠本身必须转义 python，以便在正则表达式模式中具有文字长度为 2 '$' 的字符串：

>>> r'$$><$$'
'\$\$><\$\$'

pandas 更新后转义序列无效

pandas invalid escape sequence after update

python

escaping

pandas

deprecation-warning