在 python 中使用带有 read_csv() 的正则表达式分隔符？

Question

我有很多格式如下的 csv 文件：

date1::tweet1::location1::language1

date2::tweet2::location2::language2

date3::tweet3::location3::language3

等等。一些文件包含多达 200 000 条推文。我想提取 4 个字段并将它们放入 pandas 数据框中，并计算推文的数量。这是我现在使用的代码：

try:
    data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
    data.columns = ["timestamp", "tweet", "location", "lang"]
    print 'Number of tweets: ' + str(len(data))

except BaseException, e :
    print 'Error: ',str(e)

我收到以下错误提示

Error: expected 4 fields in line 4581, saw 5

我尝试设置 error_bad_lines = False，手动删除使程序出现错误的行，将 nrows 设置为较低的数字.. 但仍然会出现随机行的 "expected fields" 错误。假设我删除文件的下半部分，除了第 1787 行，我会得到相同的错误。这对我来说没有意义，因为它之前已正确处理。目视检查 csv 文件也没有发现突然出现在错误行中的异常模式。

日期字段和推文包含冒号、网址等，所以正则表达式可能有意义吗？

谁能帮我弄清楚我做错了什么？非常感谢！

下面要求的数据样本：

Fri Apr 22 21:41:03 +0000 2016::RT @TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en

Fri Apr 22 21:41:07 +0000 2016::RT @JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en

Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will  have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en

Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en

Answer 1

您尝试过 read_table 吗？我之前尝试使用read_csv时出现了这种错误，我通过使用它解决了问题。请参考此 post，这可能会给您一些解决错误的想法。也许也可以尝试 sep=r":{2}" 作为分隔符。

Answer 2

从这里开始：

pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])

上面应该有 4 列，然后你可以算出有多少行被删除，以及数据是否有意义。

使用此模式：

data["lang"].unique()

因为你的数据有问题，不知道它在哪里。您需要退后一步并使用 python 'csv reader'。这应该让你开始。

import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
    try:  
        tweetList.append(  (row[0].split('::')) )
    except BaseException, e :
        print 'Error: ',str(e)

print tweetList

tweetsDf =   pd.DataFrame(tweetList)



print tweetsDf
                                   0  \
    0   Fri Apr 22 21:41:03 +0000 2016   
    1   Fri Apr 22 21:41:07 +0000 2016   
    2   Fri Apr 22 21:41:07 +0000 2016   
    3   Fri Apr 22 21:41:08 +0000 2016   

                                                       1                   2     3  
0  RT @TalOfer: Barack Obama: Brexit would put UK...      United Kingdom    en  
1  RT @JamieRoss7: It must be awful to strongly b...  The United Kingdom    en  
2  Whether or not it rains on June 23rd will  hav...              Dublin  None  
3  FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi...              Mardan  None

在 python 中使用带有 read_csv() 的正则表达式分隔符？

Using regex separators with read_csv() in python?

python

regex

separator