从 csv 文件复制 cassandra table
COPY cassandra table from csv file
我正在为 Cassandra、Apache Spark 和 Flume 在我的 Mac 上设置演示环境(Mac OS X Yosemite with Oracle jdk1.7.0_55).景观应作为新分析平台的概念证明,因此我的 cassandra 数据库中需要一些测试数据。我正在使用卡桑德拉 2.0.8。
我在 excel 中创建了一些演示数据并将其导出为 CSV 文件。结构是这样的:
ProcessUUID;ProcessID;ProcessNumber;ProcessName;ProcessStartTime;ProcessStartTimeUUID;ProcessEndTime;ProcessEndTimeUUID;ProcessStatus;Orderer;VorgangsNummer;VehicleID;FIN;Reference;ReferenceType
0F0D1498-D149-4FCC-87C9-F12783FDF769;AbmeldungKl‰rfall;1;Abmeldung Kl‰rfall;2011-02-03 04:05+0000;;2011-02-17 04:05+0000;;Finished;SIXT;4278;A-XA 1;WAU2345CX67890876;KLA-BR4278;internal
然后我使用以下方法在 cqlsh 中创建了一个键空间和一个列族:
CREATE KEYSPACE dadcargate
WITH REPLICATAION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
use dadcargate;
CREATE COLUMNFAMILY Process (
ProcessUUID uuid, ProcessID varchar, ProcessNumber bigint, ProcessName varchar,
ProcessStartTime timestamp, ProcessStartTimeUUID timeuuid, ProcessEndTime timestamp,
ProcessEndTimeUUID timeuuid, ProcessStatus varchar, Orderer varchar,
VorgangsNummer varchar, VehicleID varchar, FIN varchar, Reference varchar,
ReferenceType varchar,
PRIMARY KEY (ProcessUUID))
WITH COMMENT='A process is like a bracket around multiple process steps';
列族名称及其中的所有列都是用小写字母创建的 - 有一天也必须对此进行调查,但目前并不那么相关。
现在我使用我的 CSV 文件,它有大约 1600 个条目,并想将其导入我的 table 中,命名为 process
,如下所示:
cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname,
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype)
FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
它给出了以下错误:
Record #0 (line 1) has the wrong number of fields (15 instead of 13).
0 rows imported in 0.050 seconds.
这基本上是正确的,因为我的 cvs-export.
中没有 timeUUID 字段
如果我像这样在没有显式 column-names 的情况下尝试 COPY 命令(鉴于事实,我确实错过了两个字段):
cqlsh:dadcargate> COPY process from 'Process_BulkData.csv'
WITH DELIMITER = ';' AND HEADER = TRUE;
我遇到了另一个错误:
Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.009 seconds.
嗯。有点奇怪,但没关系。也许 COPY 命令不喜欢缺少两个字段的事实。我仍然认为这很奇怪,因为缺少的字段当然存在(从结构的角度来看)但只是空的。
我还有另一个镜头:我删除了 excel 中丢失的列,再次将文件导出为 cvs 并尝试在我的 csv 中导入没有 header 行但显式列名,就像这样:
cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname,
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype)
FROM 'Process_BulkData-2.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
我收到这个错误:
Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.034 seconds.
谁能告诉我我做错了什么?根据 documentation of copy-command,我设置命令的方式应该至少适用于其中两个命令。或者我会这么想。
但是,不,我显然在这里遗漏了一些重要的东西。
cqlsh 的 COPY
命令可能很敏感。但是,在 COPY
documentation 中是这一行:
The number of columns in the CSV input is the same as the number of columns in the Cassandra table metadata.
牢记这一点,我确实通过命名空字段(分别为 processstarttimeuuid
和 processendtimeuuid
,设法让您的数据使用 COPY FROM
导入:
aploetz@cqlsh:Whosebug> COPY process (processuuid, processid, processnumber,
processname, processstarttime, processstarttimeuuid, processendtime,
processendtimeuuid, processstatus, orderer, vorgangsnummer, vehicleid, fin, reference,
referencetype) FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
1 rows imported in 0.018 seconds.
aploetz@cqlsh:Whosebug> SELECT * FROM process ;
processuuid | fin | orderer | processendtime | processendtimeuuid | processid | processname | processnumber | processstarttime | processstarttimeuuid | processstatus | reference | referencetype | vehicleid | vorgangsnummer
--------------------------------------+-------------------+---------+---------------------------+--------------------+-------------------+--------------------+---------------+---------------------------+----------------------+---------------+------------+---------------+-----------+----------------
0f0d1498-d149-4fcc-87c9-f12783fdf769 | WAU2345CX67890876 | SIXT | 2011-02-16 22:05:00+-0600 | null | AbmeldungKl‰rfall | Abmeldung Kl‰rfall | 1 | 2011-02-02 22:05:00+-0600 | null | Finished | KLA-BR4278 | internal | A-XA 1 | 4278
(1 rows)
正在将 csv 文件加载到 cassandra table
step1) 使用这个 url
安装 cassandra 加载器
sudo wget https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.23/cassandra-loader
step2)sudo chmod +x cassandra-loader
a)csv文件名是"pt_bms_tkt_success_record_details_new_2016_12_082017-01-0312-30-01.csv"
b)键空间名称是 "bms_test"
c)Table 名字是 "pt_bms_tkt_success_record_details_new"
d)列是 "trx_id......trx_day"
step3)csv 文件位置和 cassandra-loader 是 "cassandra3.7/bin/"
step$)[stp@ril-srv-sp3 bin]$ ./cassandra-loader -f pt_bms_tkt_success_record_details_new_2016_12_082017-01-0312-30-01.csv -host 192.168.1.29 -schema "bms_test.pt_bms_tkt_success_record_details_new(trx_id,max_seq,trx_type,trx_record_type,trx_date,trx_show_date,cinema_str_id,session_id,ttype_code,item_id,item_var_sequence,trx_booking_id,venue_name,screen_by_tnum,price_group_code,area_cat_str_code,area_by_tnum,venue_capacity,amount_currentprice,venue_class,trx_booking_status_committed,booking_status,amount_paymentstatus,event_application,venue_cinema_companyname,venue_cinema_name,venue_cinema_type,venue_cinema_application,region_str_code,venue_city_name,sub_region_str_code,sub_region_str_name,event_code,event_type,event_name,event_language,event_genre,event_censor_rating,event_release_date,event_producer_code,event_item_name,event_itemvariable_name,event_quantity,amount_amount,amount_bookingfee,amount_deliveryfee,amount_additionalcharges,amount_final,amount_tax,offer_isapplied,offer_type,offer_name,offer_amount,payment_lastmode,payment_lastamount,payment_reference1,payment_reference2,payment_bank,customer_loginid,customer_loginstring,offer_referral,customer_mailid,customer_mobile,trans_str_sales_status_at_venue,trans_mny_trans_value_at_venue,payment_ismypayment,click_recordsource,campaign,source,keyword,medium,venue_multiplex,venue_state,mobile_type,transaction_range,life_cyclestate_from,transactions_after_offer,is_premium_transaction,city_type,holiday_season,week_type,event_popularity,transactionrange_after_discount,showminusbooking,input_source_name,channel,time_stamp,life_cyclestate_to,record_status,week_name,number_of_active_customers,event_genre1,event_genre2,event_genre3,event_genre4,event_language1,event_language2,event_language3,event_language4,event_release_date_range,showminusbooking_range,reserve1,reserve2,reserve3,reserve4,reserve5,payment_mode,payment_type,date_of_first_transaction,transaction_time_in_hours,showtime_in_hours,trx_day)";
我正在为 Cassandra、Apache Spark 和 Flume 在我的 Mac 上设置演示环境(Mac OS X Yosemite with Oracle jdk1.7.0_55).景观应作为新分析平台的概念证明,因此我的 cassandra 数据库中需要一些测试数据。我正在使用卡桑德拉 2.0.8。
我在 excel 中创建了一些演示数据并将其导出为 CSV 文件。结构是这样的:
ProcessUUID;ProcessID;ProcessNumber;ProcessName;ProcessStartTime;ProcessStartTimeUUID;ProcessEndTime;ProcessEndTimeUUID;ProcessStatus;Orderer;VorgangsNummer;VehicleID;FIN;Reference;ReferenceType
0F0D1498-D149-4FCC-87C9-F12783FDF769;AbmeldungKl‰rfall;1;Abmeldung Kl‰rfall;2011-02-03 04:05+0000;;2011-02-17 04:05+0000;;Finished;SIXT;4278;A-XA 1;WAU2345CX67890876;KLA-BR4278;internal
然后我使用以下方法在 cqlsh 中创建了一个键空间和一个列族:
CREATE KEYSPACE dadcargate
WITH REPLICATAION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
use dadcargate;
CREATE COLUMNFAMILY Process (
ProcessUUID uuid, ProcessID varchar, ProcessNumber bigint, ProcessName varchar,
ProcessStartTime timestamp, ProcessStartTimeUUID timeuuid, ProcessEndTime timestamp,
ProcessEndTimeUUID timeuuid, ProcessStatus varchar, Orderer varchar,
VorgangsNummer varchar, VehicleID varchar, FIN varchar, Reference varchar,
ReferenceType varchar,
PRIMARY KEY (ProcessUUID))
WITH COMMENT='A process is like a bracket around multiple process steps';
列族名称及其中的所有列都是用小写字母创建的 - 有一天也必须对此进行调查,但目前并不那么相关。
现在我使用我的 CSV 文件,它有大约 1600 个条目,并想将其导入我的 table 中,命名为 process
,如下所示:
cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname,
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype)
FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
它给出了以下错误:
Record #0 (line 1) has the wrong number of fields (15 instead of 13).
0 rows imported in 0.050 seconds.
这基本上是正确的,因为我的 cvs-export.
中没有 timeUUID 字段如果我像这样在没有显式 column-names 的情况下尝试 COPY 命令(鉴于事实,我确实错过了两个字段):
cqlsh:dadcargate> COPY process from 'Process_BulkData.csv'
WITH DELIMITER = ';' AND HEADER = TRUE;
我遇到了另一个错误:
Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.009 seconds.
嗯。有点奇怪,但没关系。也许 COPY 命令不喜欢缺少两个字段的事实。我仍然认为这很奇怪,因为缺少的字段当然存在(从结构的角度来看)但只是空的。
我还有另一个镜头:我删除了 excel 中丢失的列,再次将文件导出为 cvs 并尝试在我的 csv 中导入没有 header 行但显式列名,就像这样:
cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname,
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype)
FROM 'Process_BulkData-2.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
我收到这个错误:
Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.034 seconds.
谁能告诉我我做错了什么?根据 documentation of copy-command,我设置命令的方式应该至少适用于其中两个命令。或者我会这么想。
但是,不,我显然在这里遗漏了一些重要的东西。
cqlsh 的 COPY
命令可能很敏感。但是,在 COPY
documentation 中是这一行:
The number of columns in the CSV input is the same as the number of columns in the Cassandra table metadata.
牢记这一点,我确实通过命名空字段(分别为 processstarttimeuuid
和 processendtimeuuid
,设法让您的数据使用 COPY FROM
导入:
aploetz@cqlsh:Whosebug> COPY process (processuuid, processid, processnumber,
processname, processstarttime, processstarttimeuuid, processendtime,
processendtimeuuid, processstatus, orderer, vorgangsnummer, vehicleid, fin, reference,
referencetype) FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;
1 rows imported in 0.018 seconds.
aploetz@cqlsh:Whosebug> SELECT * FROM process ;
processuuid | fin | orderer | processendtime | processendtimeuuid | processid | processname | processnumber | processstarttime | processstarttimeuuid | processstatus | reference | referencetype | vehicleid | vorgangsnummer
--------------------------------------+-------------------+---------+---------------------------+--------------------+-------------------+--------------------+---------------+---------------------------+----------------------+---------------+------------+---------------+-----------+----------------
0f0d1498-d149-4fcc-87c9-f12783fdf769 | WAU2345CX67890876 | SIXT | 2011-02-16 22:05:00+-0600 | null | AbmeldungKl‰rfall | Abmeldung Kl‰rfall | 1 | 2011-02-02 22:05:00+-0600 | null | Finished | KLA-BR4278 | internal | A-XA 1 | 4278
(1 rows)
正在将 csv 文件加载到 cassandra table
step1) 使用这个 url
安装 cassandra 加载器
sudo wget https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.23/cassandra-loader
step2)sudo chmod +x cassandra-loader
a)csv文件名是"pt_bms_tkt_success_record_details_new_2016_12_082017-01-0312-30-01.csv"
b)键空间名称是 "bms_test"
c)Table 名字是 "pt_bms_tkt_success_record_details_new"
d)列是 "trx_id......trx_day"
step3)csv 文件位置和 cassandra-loader 是 "cassandra3.7/bin/"
step$)[stp@ril-srv-sp3 bin]$ ./cassandra-loader -f pt_bms_tkt_success_record_details_new_2016_12_082017-01-0312-30-01.csv -host 192.168.1.29 -schema "bms_test.pt_bms_tkt_success_record_details_new(trx_id,max_seq,trx_type,trx_record_type,trx_date,trx_show_date,cinema_str_id,session_id,ttype_code,item_id,item_var_sequence,trx_booking_id,venue_name,screen_by_tnum,price_group_code,area_cat_str_code,area_by_tnum,venue_capacity,amount_currentprice,venue_class,trx_booking_status_committed,booking_status,amount_paymentstatus,event_application,venue_cinema_companyname,venue_cinema_name,venue_cinema_type,venue_cinema_application,region_str_code,venue_city_name,sub_region_str_code,sub_region_str_name,event_code,event_type,event_name,event_language,event_genre,event_censor_rating,event_release_date,event_producer_code,event_item_name,event_itemvariable_name,event_quantity,amount_amount,amount_bookingfee,amount_deliveryfee,amount_additionalcharges,amount_final,amount_tax,offer_isapplied,offer_type,offer_name,offer_amount,payment_lastmode,payment_lastamount,payment_reference1,payment_reference2,payment_bank,customer_loginid,customer_loginstring,offer_referral,customer_mailid,customer_mobile,trans_str_sales_status_at_venue,trans_mny_trans_value_at_venue,payment_ismypayment,click_recordsource,campaign,source,keyword,medium,venue_multiplex,venue_state,mobile_type,transaction_range,life_cyclestate_from,transactions_after_offer,is_premium_transaction,city_type,holiday_season,week_type,event_popularity,transactionrange_after_discount,showminusbooking,input_source_name,channel,time_stamp,life_cyclestate_to,record_status,week_name,number_of_active_customers,event_genre1,event_genre2,event_genre3,event_genre4,event_language1,event_language2,event_language3,event_language4,event_release_date_range,showminusbooking_range,reserve1,reserve2,reserve3,reserve4,reserve5,payment_mode,payment_type,date_of_first_transaction,transaction_time_in_hours,showtime_in_hours,trx_day)";