Python Pandas - 电子邮件列包含带有分隔符的值

Question

所以我的工作中有这么大的 .csv，看起来像这样：

Name| Adress| Email| Paid Value    
John| x street | John@dmail.com| 0|
Chris| c street | Chris@dmail.com| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|

如您所见，.csv 是用管道分隔的，最后两个人的电子邮件中有管道，导致格式问题。只有 2 个客户有这个问题，但这些家伙每个月都会有越来越多的条目，我们必须在 csv 中手动找到它们并手动更改电子邮件。这是一个非常无聊和耗时的过程，因为文件很大。

我们使用 python 来处理数据，我研究了一下，找不到任何可以帮助我的东西，有什么想法吗？

编辑：所以我想要的是一种通过代码自动更改此电子邮件地址的方法（如 RebeccaFML|@dmail.com -> RebeccaFML@dmail.com）。它确实需要 pandas 或任何东西，我接受任何类型的想法。最主要的是我只知道如何在我读取 python 中的文件后进行替换，但是由于这些寄存器中有管道，它们无法正确读取。

提前联系

Answer 1

为自定义行创建生成器这里应该将电子邮件列设置在第 3 位，但你可以调整它

import pandas
def rows(path: str, sep: str = '|'):
    with open(path) as f:
        header = [*f.readline().split(sep), None]
        for row in f:
            row = row.rsplit('\n', 1)[0].split(sep)
            if len(row) > len(header):
                yield [*row[:2], ''.join((row[2], row[3])), *row[4:]]
            else:
                yield row
pandas.DataFrame(rows('data.csv'))

Answer 2

您可以在 read_csv 中将正则表达式作为分隔符传递。 \|\s*(?!\@) 将在管道上拆分（可能后跟空格），但排除后跟 at 符号的管道。您随后可以使用 replace:

删除剩余的管道

import pandas as pd
import io

data = '''Name| Adress| Email| Paid Value    
John| x street | John@dmail.com| 0|
Chris| c street | Chris@dmail.com| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|'''

df = pd.read_csv(io.StringIO(data), sep = r'(?<!\@)\s*\|\s*(?!\@)', engine='python',index_col=False,usecols=range(4)).replace('\|','', regex=True)

输出：

	Name	Adress	Email	Paid Value
0	John	x street	John@dmail.com	0
1	Chris	c street	Chris@dmail.com	100
2	Rebecca	y street	RebeccaFML@dmail.com	177
3	Bozo	z street	BozoSMH@yahow.com	976

Answer 3

一种方法是使用正则表达式删除文本文件中麻烦的管道 (|)：

import re

data = '''Name| Adress| Email| Paid Value    
John| x street | Jo|hn@dmail.com| 0|
Chris| c street | |Chris@dmail.com| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|'''


pattern = re.compile(r"""(\|[^|]+?) # the previous pipe, i.e. a pipe followed by one or more not pipe characters
                          \| # the troublesome pipe
                          ([^|]*?@.+?\|) # the  rest of the email until the next pipe """, re.VERBOSE)


res = pattern.sub(r"", data)
print(res)

输出

Name| Adress| Email| Paid Value    
John| x street | John@dmail.com| 0|
Chris| c street | Chris@dmail.com| 100|
Rebecca| y street| RebeccaFML@dmail.com|177|
Bozo | z street| BozoSMH@yahow.com|976|

注意为了测试在 data 值中添加了额外的管道。请参阅 here 以获得额外的解释和调试。

Python Pandas - 电子邮件列包含带有分隔符的值

Python Pandas - Email column has values with the delimiter on it

python

csv

parsing

delimiter

pandas