Google 数据实验室 jupyter BigQuery 笔记本中 'table.extract()' 的意外行为
Unexpected behavior from 'table.extract()' in Google datalab jupyter BigQuery notebook
正在学习datalab jupyter notebook教程~/datalab/tutorials/BigQuery/'Importing and Exporting Data.ipynb'。
我无法理解以下行为:
table.extract(destination = sample_bucket_object)
此提取的结果 csv 包含:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000
1999,Chevy,Venture Extended Edition,,4900
1999,Chevy,Venture Extended Edition,Very Large,5000
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799
这看起来不完整。当 table 首次填充时,它仅提取从 cars.csv 插入 table 的那 4 行:
sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
它忽略了 cars2.csv 中由命令添加的额外 2 行:
cars2 = storage.Item('cloud-datalab-samples', 'cars2.csv').read_from()
df2 = pd.read_csv(StringIO(cars2))
df2.fillna(value='', inplace=True)
sample_table.insert_data(df2)
确实进入了 table:
%%sql
SELECT * FROM sample.cars
给出:
Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000
1999 Chevy Venture Extended Edition 4900
1999 Chevy Venture Extended Edition Very Large 5000
1996 Jeep Grand Cherokee MUST SELL! air, moon roof, loaded 4799
2015 Tesla Model S 64900
2010 Honda Civic 15000
作为测试,我在笔记本中切换了 cars.csv 和 cars2.csv 并重新 运行 所有命令。 table.extract() 然后只导出 cars2.csv 行:
Year,Make,Model,Description,Price
2010,Honda,Civic,,15000
2015,Tesla,Model S,,64900
我在这里错过了什么?
我的猜测是 sample_table.insert_data(df2)
正在使用流 API 将数据插入 table。以这种方式插入的数据可能需要一些时间才能通过复制和导出操作使用,但可以立即用于查询。
来自BigQuery's documentation on streaming:
Data can take up to 90 minutes to become available for copy and export operations. To see whether data is available for copy and export, check the tables.get response for a section named streamingBuffer. If that section is absent, your data should be available for copy or export.
正在学习datalab jupyter notebook教程~/datalab/tutorials/BigQuery/'Importing and Exporting Data.ipynb'。 我无法理解以下行为:
table.extract(destination = sample_bucket_object)
此提取的结果 csv 包含:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000
1999,Chevy,Venture Extended Edition,,4900
1999,Chevy,Venture Extended Edition,Very Large,5000
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799
这看起来不完整。当 table 首次填充时,它仅提取从 cars.csv 插入 table 的那 4 行:
sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
它忽略了 cars2.csv 中由命令添加的额外 2 行:
cars2 = storage.Item('cloud-datalab-samples', 'cars2.csv').read_from()
df2 = pd.read_csv(StringIO(cars2))
df2.fillna(value='', inplace=True)
sample_table.insert_data(df2)
确实进入了 table:
%%sql
SELECT * FROM sample.cars
给出:
Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000
1999 Chevy Venture Extended Edition 4900
1999 Chevy Venture Extended Edition Very Large 5000
1996 Jeep Grand Cherokee MUST SELL! air, moon roof, loaded 4799
2015 Tesla Model S 64900
2010 Honda Civic 15000
作为测试,我在笔记本中切换了 cars.csv 和 cars2.csv 并重新 运行 所有命令。 table.extract() 然后只导出 cars2.csv 行:
Year,Make,Model,Description,Price
2010,Honda,Civic,,15000
2015,Tesla,Model S,,64900
我在这里错过了什么?
我的猜测是 sample_table.insert_data(df2)
正在使用流 API 将数据插入 table。以这种方式插入的数据可能需要一些时间才能通过复制和导出操作使用,但可以立即用于查询。
来自BigQuery's documentation on streaming:
Data can take up to 90 minutes to become available for copy and export operations. To see whether data is available for copy and export, check the tables.get response for a section named streamingBuffer. If that section is absent, your data should be available for copy or export.