Webscraping 和解码字符串为 pandas DF

Question

我正在网络抓取，并希望有一个 Pandas 数据框作为我的内容抓取的结果。我能够得到一个 UTF-8 字符串，我想将其读取为 Pandas 数据帧，但我不确定该怎么做，我想避免输出到 CSV 和读回来。我该怎么做？

例如

string='term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminated\r\nGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0\r\nGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0\r\n'

我用

拆分字符串

fcsv_content=[x.split(',') for x in string.split("\r\n")]

但这不起作用，因为某些字段内部有逗号。我能做什么？我可以更改解码以解决此问题吗？对于某些背景，我正在使用 robobrowser 来解码网页。

Answer 1

您可以使用 python 的 csv 模块来读取和吐出您的 csv。它会处理诸如引号字符串中的逗号之类的事情，并且知道不要拆分它们。下面是一个使用您的输入字符串的小例子。正如您将在下面的示例中看到的那样，字段 protein stabilization, positive 不会拆分为单独的列，因为它是带引号的字符串

import csv

string = 'term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminated\r\nGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0\r\nGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0\r\n'
csv_reader = csv.reader(string.splitlines())
for record in csv_reader:
    print(f'number of fields: {len(record)}, Record: {record}'

输出

number of fields: 10, Record: ['term_ID', 'description', 'frequency', 'plot_X', 'plot_Y', 'plot_size', 'uniqueness', 'dispensability', 'representative', 'eliminated']
number of fields: 10, Record: ['GO:0006468', 'protein phosphorylation', '4.137%', ' 4.696', ' 0.927', '5.725', '0.430', '0.000', '6468', '0']
number of fields: 10, Record: ['GO:0050821', 'protein stabilization, positive', '0.045%', '-4.700', ' 0.494', '3.763', '0.413', '0.000', '50821', '0']

Webscraping 和解码字符串为 pandas DF

Webscraping and decoding string into pandas DF

python

web-scraping

python-3.x

pandas

robobrowser