将所有数据从一个 CSV 文件写入另一个文件——但包括新解析的地理编码数据作为附加字段
Write all data from once CSV file to another -- but include new parsed geocoding data as additional fields
我正在尝试编写一个 Python 脚本,它将获取任何 CSV 文件,运行 它通过地理编码器,然后写入生成的地理编码属性(+ 来自原始文件的所有数据文件)到一个新的 csv 文件。
到目前为止我的代码如下,我应该注意到一切都按预期工作,除了将地理编码属性与原始 csv 文件中的数据相结合。目前发生的情况是,特定行的所有原始 csv 文件字段值在 csv 文件中仅显示为一个值(尽管地理编码属性显示正确)。脚本的问题位于最后。为了简洁起见,我省略了不同 类 的代码。
我还应该注意我正在使用 hasattr* 因为虽然我不知道原始 in_file 中的所有字段是什么,但我确实知道输入 csv 中的某个地方会出现这些字段并且这些是地理编码所需的字段。
最初我尝试将 "new_file.writerow([])" 更改为 "new_file.writerow()",此时行输入 -r- 确实正确写入了 csv 文件,但地理编码属性无法再写入 csv 为他们被视为额外的论据。
def locate(file=None):
""" locate by geocoding func"""
start_time = time.time()
count = 0
if file != None:
with open (file) as in_file:
f_csv = csv.reader(in_file)
# regex headers and lowercase to standarize for hasattr func.
headers = [ re.sub('["\s+]', '_', h).lower() for h in next(f_csv)]
# Used namedtuple for headers
Row = namedtuple('Row', headers)
# for row in file
for r in f_csv:
count += 1
# set row values to named tuple values
row = Row(*r)
# Try hasattr to find fields names address, city, state, zipcode
if hasattr(row, 'address'):
address = row.address
elif hasattr(row, 'address1'):
address = row.address1
if hasattr(row, 'city'):
city = row.city
if hasattr(row, 'state'):
state = row.state
elif hasattr(row, 'st'):
state = row.st
if hasattr(row, 'zipcode'):
zipCode = row.zipcode
elif hasattr(row, 'zip'):
zipCode = row.zipcode
# Create new address object
addressObject = Address(address, city, state, zipCode)
# Get response from api
data = requests.get(addressObject.__str__()).json()
try:
data['geocodeStatusCode'] = int(data['geocodeStatusCode'])
except:
data['geocodeStatusCode'] = None
if data['geocodeStatusCode'] == 'SomeNumber':
# geocoded address ideally uses parent class attributes
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode, data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])
else:
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)
# Problem Area
geocoded_file = file.replace('.csv', '_geocoded2') + '.csv'
with open(geocoded_file, 'a', newline='') as geocoded:
# Problem area -- the r -row- attribute writes all within the same cell even though they are comma separated. The geocoding attributes do write correctly to the csv file
new_file = csv.writer(geocoded)
new_file.writerow([r, geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore])
print('The time to geocode {} records: {}'.format(count, (time.time() - start_time)))
CSV Input Data Example:
"UID", "Occupant", "Address", "City", "State", "ZipCode"
"100001", "Playstation Theater", "New York", "NY", "10036"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019"
CSV Output Example (the additional fields are parsed during geocoding)
"UID", "Occupant", "Address", "City", "State", "ZipCode", "GeoCodingLatitude", "GeoCodingLongitude", "GeoCodingScore"
"100001", "Playstation Theater", "New York", "NY", "10036", "45.1234", "-110.4567", "100"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019", "44.1234", "-111.4567", "100"
我想出了一个解决方案,虽然它可能不是最优雅的。我使用 namedtuple._asdict() 将 namedtuple 转换为字典,然后遍历行的值,将它们添加到新列表中。此时我添加了地理编码变量,然后将整个列表写入该行。这是我更改的代码示例!如果您能想到更好的解决方案,请告诉我。
if data['geocodeStatusCode'] == 'SomeNumber':
# geocoded address ideally should use parent class address values and not have to be restated
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode,
data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])
else:
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)
# This is where I made the change - set new list
list_values = []
# Use _asdict for the named tuple
row_content = row._asdict()
# Loop through and strip white space
for key, value in row_content.items():
# print(key, value.strip())
list_values.append(value.strip())
# Extend list rather then append due to multiple values
list_values.extend((geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore))
# Finally write the new list to the csv file - which includes both the row and the geocoded objects
#- and is agnostic as to what data it's passed as long as its utf-8 complaint
new_file.writerow(list_values)
我正在尝试编写一个 Python 脚本,它将获取任何 CSV 文件,运行 它通过地理编码器,然后写入生成的地理编码属性(+ 来自原始文件的所有数据文件)到一个新的 csv 文件。
到目前为止我的代码如下,我应该注意到一切都按预期工作,除了将地理编码属性与原始 csv 文件中的数据相结合。目前发生的情况是,特定行的所有原始 csv 文件字段值在 csv 文件中仅显示为一个值(尽管地理编码属性显示正确)。脚本的问题位于最后。为了简洁起见,我省略了不同 类 的代码。
我还应该注意我正在使用 hasattr* 因为虽然我不知道原始 in_file 中的所有字段是什么,但我确实知道输入 csv 中的某个地方会出现这些字段并且这些是地理编码所需的字段。
最初我尝试将 "new_file.writerow([])" 更改为 "new_file.writerow()",此时行输入 -r- 确实正确写入了 csv 文件,但地理编码属性无法再写入 csv 为他们被视为额外的论据。
def locate(file=None):
""" locate by geocoding func"""
start_time = time.time()
count = 0
if file != None:
with open (file) as in_file:
f_csv = csv.reader(in_file)
# regex headers and lowercase to standarize for hasattr func.
headers = [ re.sub('["\s+]', '_', h).lower() for h in next(f_csv)]
# Used namedtuple for headers
Row = namedtuple('Row', headers)
# for row in file
for r in f_csv:
count += 1
# set row values to named tuple values
row = Row(*r)
# Try hasattr to find fields names address, city, state, zipcode
if hasattr(row, 'address'):
address = row.address
elif hasattr(row, 'address1'):
address = row.address1
if hasattr(row, 'city'):
city = row.city
if hasattr(row, 'state'):
state = row.state
elif hasattr(row, 'st'):
state = row.st
if hasattr(row, 'zipcode'):
zipCode = row.zipcode
elif hasattr(row, 'zip'):
zipCode = row.zipcode
# Create new address object
addressObject = Address(address, city, state, zipCode)
# Get response from api
data = requests.get(addressObject.__str__()).json()
try:
data['geocodeStatusCode'] = int(data['geocodeStatusCode'])
except:
data['geocodeStatusCode'] = None
if data['geocodeStatusCode'] == 'SomeNumber':
# geocoded address ideally uses parent class attributes
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode, data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])
else:
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)
# Problem Area
geocoded_file = file.replace('.csv', '_geocoded2') + '.csv'
with open(geocoded_file, 'a', newline='') as geocoded:
# Problem area -- the r -row- attribute writes all within the same cell even though they are comma separated. The geocoding attributes do write correctly to the csv file
new_file = csv.writer(geocoded)
new_file.writerow([r, geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore])
print('The time to geocode {} records: {}'.format(count, (time.time() - start_time)))
CSV Input Data Example:
"UID", "Occupant", "Address", "City", "State", "ZipCode"
"100001", "Playstation Theater", "New York", "NY", "10036"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019"
CSV Output Example (the additional fields are parsed during geocoding)
"UID", "Occupant", "Address", "City", "State", "ZipCode", "GeoCodingLatitude", "GeoCodingLongitude", "GeoCodingScore"
"100001", "Playstation Theater", "New York", "NY", "10036", "45.1234", "-110.4567", "100"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019", "44.1234", "-111.4567", "100"
我想出了一个解决方案,虽然它可能不是最优雅的。我使用 namedtuple._asdict() 将 namedtuple 转换为字典,然后遍历行的值,将它们添加到新列表中。此时我添加了地理编码变量,然后将整个列表写入该行。这是我更改的代码示例!如果您能想到更好的解决方案,请告诉我。
if data['geocodeStatusCode'] == 'SomeNumber':
# geocoded address ideally should use parent class address values and not have to be restated
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode,
data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])
else:
geocodedAddressObject = GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)
# This is where I made the change - set new list
list_values = []
# Use _asdict for the named tuple
row_content = row._asdict()
# Loop through and strip white space
for key, value in row_content.items():
# print(key, value.strip())
list_values.append(value.strip())
# Extend list rather then append due to multiple values
list_values.extend((geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore))
# Finally write the new list to the csv file - which includes both the row and the geocoded objects
#- and is agnostic as to what data it's passed as long as its utf-8 complaint
new_file.writerow(list_values)