在 python 中将 em-dash 转换为连字符

Question

我正在将 csv 文件转换为 python Dataframe。在原始文件中，其中一列有字符 em-dash。我想用连字符“-”代替它。

来自 csv 的部分原始文件：

 NoDemande     NoUsager     Sens    IdVehicule     NoConduteur     HeureDebutTrajet    HeureArriveeSurSite    HeureEffective'
42192001801   42192002715    —        157Véh       42192000153    ...
42192000003   42192002021    +        157Véh       42192000002    ...
42192001833   42192000485    —      324My3FVéh     42192000157    ...

我的代码：

#coding=latin-1
import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,header=0,sep=';',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                      dayfirst=True)
    df['Sens'].replace(u"\u2014","-",inplace=True,regex=True)
    list_.append(df)

而且它根本不起作用，每次它只将它们转换成?，看起来像：

42191001122  42191002244    ?            181Véh   42191000114  ...
42191001293  42191001203    ?         319M9pVéh   42191000125  ...
42191000700  42191000272    ?            183Véh   42191000072  ...

因为文件中有法语字符，所以我使用 latin-1 而不是 utf-8。如果我删除第一行并这样写：

df = pd.read_csv(file_,index_col=None,header=0,sep=';',encoding='windows-1252',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                          dayfirst=True)

结果将是：

42191001122  42191002244  â??           181VÃ©h   42191000114   ...
42191001293  42191001203  â??        319M9pVÃ©h   42191000125   ...
42191000700  42191000272  â??           183VÃ©h   42191000072   ...

我怎样才能让所有的破折号 — 替换为 -？

我添加了关于repr的部分：

for line in open(file_):
    print repr(line)

结果是：

'"42191002384";"42191000118";"\xe2\x80\x94";"";"42191000182";...
'"42191002464";"42191001671";"+";"";"42191000182";...
'"42191000045";"42191000176";"\xe2\x80\x94";"620M9pV\xc3\xa9h";"42191000003";...
'"42191001305";"42191000823";"\xe2\x80\x94";"310V7pV\xc3\xa9h";"42191000126";...

Answer 1

u'\u2014' (EM DASH) 无法在 latin1/iso-8859-1 中编码，因此值无法出现在正确编码的 latin1 文件中。

可能文件编码为 windows-1252，其中 u'\u2014' 可以编码为 '\x97'。

另一个问题是 CSV 文件显然使用空格作为列分隔符，但您的代码使用分号。您可以使用 delim_whitespace=True:

将空格指定为分隔符

df = pd.read_csv(file_, delim_whitespace=True)

您还可以使用 encoding 参数指定文件的编码。 read_csv() 会将传入的数据转换为 unicode:

df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)

在Python 2（我认为你正在使用它）中，如果你不指定编码，数据将保留原始编码，这可能是你的替换不是的原因正在工作。

正确加载文件后，您可以像之前一样替换字符：

df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
df['Sens'].replace(u'\u2014', '-', inplace=True)

编辑

在显示 repr() 输出的更新之后，您的文件似乎是 UTF8 编码的，而不是 latin1，而不是 Windows-1252。由于您使用的是 Python 2 您需要在加载 CSV 文件时指定编码：

df = pd.read_csv(file_, sep=';', encoding='utf8')
df['Sens'].replace(u'\u2014', '-', inplace=True)

因为您指定了编码，read_csv() 会将传入数据转换为 unicode，因此 replace() 现在应该可以如上所示工作。应该就是这么简单。

在 python 中将 em-dash 转换为连字符

convert em-dash to hyphen in python

python

csv

unicode

iso-8859-1

pandas