如何使用 python 在 Dataflow 中将字典写入 Bigquery
How to write dictionaries to Bigquery in Dataflow using python
我正在尝试从 GCP 存储中读取 csv,将其转换为字典,然后写入 Bigquery table,如下所示:
p | ReadFromText("gs://bucket/file.csv")
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table',dataset='dds',project='doubleclick-2',schema=ads_schema)
其中:'doubleclick-2'和'dds'是现有的项目和数据集,ads_schema定义如下:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
BuildAdsRecordFn() 定义如下:
class AdsRecord:
dict = {}
def __init__(self, line):
record = line.split(",")
self.dict['Advertiser_ID'] = record[0]
self.dict['Campaign_ID'] = record[1]
self.dict['Ad_ID'] = record[2]
self.dict['Ad_Name'] = record[3]
self.dict['Click_through_URL'] = record[4]
self.dict['Ad_Type'] = record[5]
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = AdsRecord(text_line).dict
return ads_record
但是,当我运行管道时,我得到了以下错误:
"dataflow_job_18146703755411620105-B" failed., (6c011965a92e74fa): BigQuery job "dataflow_job_18146703755411620105-B" in project "doubleclick-2" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON parsing error in row starting at position 0: Value encountered without start of object
这里是我使用的样本测试数据:
100001,1000011,10000111,ut,https://bloomberg.com/aliquam/lacus/morbi.xml,Brand-neutral
100001,1000011,10000112,eu,http://weebly.com/sed/vel/enim/sit.jsp,Dynamic Click
我是 Dataflow 和 python 的新手,所以无法弄清楚上面的代码可能有什么问题。非常感谢任何帮助!
我刚刚实现了您的代码,但效果不佳,但我收到了一条不同的消息错误(类似于 "you can't return a dict
as the result of a ParDo
")。
这段代码对我来说正常工作,请注意,不仅我没有使用 class 属性 dict
,而且现在返回了一个列表:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = self.process_row(element)
return ads_record
def process_row(self, row):
dict_ = {}
record = row.split(",")
dict_['Advertiser_ID'] = int(record[0]) if record[0] else None
dict_['Campaign_ID'] = int(record[1]) if record[1] else None
dict_['Ad_ID'] = int(record[2]) if record[2] else None
dict_['Ad_Name'] = record[3]
dict_['Click_through_URL'] = record[4]
dict_['Ad_Type'] = record[5]
return [dict_]
with beam.Pipeline() as p:
(p | ReadFromText("gs://bucket/file.csv")
| beam.Filter(lambda x: x[0] != 'A')
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table', dataset='dds',
project='doubleclick-2', schema=ads_schema))
#| WriteToText('test.csv'))
这是我模拟的数据:
Advertiser_ID,Campaign_ID,Ad_ID,Ad_Name,Click_through_URL,Ad_Type
1,1,1,name of ad,www.url.com,sales
1,1,2,name of ad2,www.url2.com,sales with sales
我还过滤掉了我在我的文件中创建的 header 行(在 Filter
操作中),如果你没有 header 那么这不是必需的
我正在尝试从 GCP 存储中读取 csv,将其转换为字典,然后写入 Bigquery table,如下所示:
p | ReadFromText("gs://bucket/file.csv")
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table',dataset='dds',project='doubleclick-2',schema=ads_schema)
其中:'doubleclick-2'和'dds'是现有的项目和数据集,ads_schema定义如下:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
BuildAdsRecordFn() 定义如下:
class AdsRecord:
dict = {}
def __init__(self, line):
record = line.split(",")
self.dict['Advertiser_ID'] = record[0]
self.dict['Campaign_ID'] = record[1]
self.dict['Ad_ID'] = record[2]
self.dict['Ad_Name'] = record[3]
self.dict['Click_through_URL'] = record[4]
self.dict['Ad_Type'] = record[5]
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = AdsRecord(text_line).dict
return ads_record
但是,当我运行管道时,我得到了以下错误:
"dataflow_job_18146703755411620105-B" failed., (6c011965a92e74fa): BigQuery job "dataflow_job_18146703755411620105-B" in project "doubleclick-2" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON parsing error in row starting at position 0: Value encountered without start of object
这里是我使用的样本测试数据:
100001,1000011,10000111,ut,https://bloomberg.com/aliquam/lacus/morbi.xml,Brand-neutral
100001,1000011,10000112,eu,http://weebly.com/sed/vel/enim/sit.jsp,Dynamic Click
我是 Dataflow 和 python 的新手,所以无法弄清楚上面的代码可能有什么问题。非常感谢任何帮助!
我刚刚实现了您的代码,但效果不佳,但我收到了一条不同的消息错误(类似于 "you can't return a dict
as the result of a ParDo
")。
这段代码对我来说正常工作,请注意,不仅我没有使用 class 属性 dict
,而且现在返回了一个列表:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = self.process_row(element)
return ads_record
def process_row(self, row):
dict_ = {}
record = row.split(",")
dict_['Advertiser_ID'] = int(record[0]) if record[0] else None
dict_['Campaign_ID'] = int(record[1]) if record[1] else None
dict_['Ad_ID'] = int(record[2]) if record[2] else None
dict_['Ad_Name'] = record[3]
dict_['Click_through_URL'] = record[4]
dict_['Ad_Type'] = record[5]
return [dict_]
with beam.Pipeline() as p:
(p | ReadFromText("gs://bucket/file.csv")
| beam.Filter(lambda x: x[0] != 'A')
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table', dataset='dds',
project='doubleclick-2', schema=ads_schema))
#| WriteToText('test.csv'))
这是我模拟的数据:
Advertiser_ID,Campaign_ID,Ad_ID,Ad_Name,Click_through_URL,Ad_Type
1,1,1,name of ad,www.url.com,sales
1,1,2,name of ad2,www.url2.com,sales with sales
我还过滤掉了我在我的文件中创建的 header 行(在 Filter
操作中),如果你没有 header 那么这不是必需的