从数据框写入 csv 时如何去除多余的空格

Question

从一个 xlsx 文件中读入多个工作表 (6) 并创建单独的数据帧。想要将每个写出到管道分隔的 csv。

ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')

目前输出如下： 1|value1 |value2 |word1 word2 word3 等等

想要去除尾随空白

Answer 1

以下命令相当容易地修剪左右空格：

 if (!require(dplyr)) {
   install.packages("dplyr")
 }
 library(dplyr)

 if (!require(stringr)) {
   install.packages("stringr")
 }
 library(stringr)

 setwd("~/wherever/you/need/to/get/data")

 outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
 print(head(outputWithSpaces), quote=TRUE)

 #str_trim(string, side = c("both", "left", "right"))

 outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
 print(head(outputWithoutSpaces), quote=TRUE)

起始数据：

                                  V1                           V2                          V3            V4
 1    "Something is interesting.   " "This is also Interesting. "                      "Not " "Intereting "
 2  "  Something with leading space"                  "  Leading" "  Spaces with many words."      " More."
 3 "  Leading and training Space.  "                   "  More  "  "  Leading and trailing. "  "  Spaces. "

结果：

                               V1                          V2                        V3           V4
 1    "Something is interesting." "This is also Interesting."                     "Not" "Intereting"
 2 "Something with leading space"                   "Leading" "Spaces with many words."      "More."
 3  "Leading and training Space."                      "More"   "Leading and trailing."    "Spaces."

Answer 2

建议

将方法 .apply(lambda x: x.str.rstrip()) 包含到您的输出字符串中（在 .to_csv() 调用之前）以从 DataFrame 的每个字段中去除右侧尾随空白。它看起来像：

变化：

ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')

收件人：

ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')

它可以很容易地插入到使用'.'的输出代码字符串中。引用。要处理多种数据类型，我们可以 通过包含参数 dtype='str':

在导入时强制执行 'object' dtype

ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')

或在 DataFrame 上通过：

df = pd.DataFrame(df, dtype='str')

证明

我做了一个模型，其中 .xlsx 文档有 5 sheets，每个 sheet 有三列：第一列有除第 2 行中的一个空单元格外的所有数字；第二列 在字符串上有一个前导空白和一个尾随空白，第 3 行中有一个空单元格，第 4 行中有一个数字；第三列 * 所有字符串都以空格开头，第 4 行 * 为空值。包括整数索引和整数列。每个 sheet 中的文本是：

    0   1   2
0   11111    valueB1     valueC1
1        valueB2     valueC2
2   33333        valueC3
3   44444   44444   
4   55555    valueB5     valueC5

此代码将我们的 .xlsx testing_xlsx_dtype.xlsx 读入 DataFrame 字典 ind_dim。

接下来，它使用 for 循环遍历每个 sheet，将 sheet 名称变量作为引用单个 [=152] 的键=] 数据框。它通过将 lambda x: x.str.rstrip() lambda 函数传递给在 sheet/DataFrame 上调用的 .apply() 方法，将 .str.rstrip() 方法应用于整个 sheet/DataFrame。

最后，它使用 .to_csv() 将 sheet/DataFrame 作为 .csv 文件输出，如 OP post.

中所示

# reads xlsx in 
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')

# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
    ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')

Returns:

|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5

（注意我们的第 2 列字符串 不再有尾随 space）。

我们还可以使用循环遍历字典项的循环来引用每个 sheet；语法看起来像 for k, v in dict.items() 其中 k 和 v 是 key 和 value:

# reads xlsx in 
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')

# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
    v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')

备注：

我们仍然需要根据需要使用 header= 和 names= 参数为 selecting/ignoring 索引和列应用正确的参数。对于这些示例，为了简单起见，我只是传递了 =None。

去除前导和前导和尾随 space 的其他方法分别是：.str.lstrip() 和 .str.strip()。它们也可以应用于整个 DataFrame，使用传递给 DataFrame 上调用的 .apply() 方法的 .apply(lambda x: x.str.strip()) lambda 函数。

只有 1 列： 如果我们只想从一列中剥离，我们可以直接在列本身上调用 .str 方法。例如，要从 DataFrame df 中名为 column2 的列中去除前导和尾随 spaces，我们可以这样写：df.column2.str.strip().

数据类型不是字符串：导入我们的数据时，pandas将假设具有相似数据的列的数据类型类型。我们可以通过在导入时将 dtype='str' 传递给 pd.read_excel() 调用来覆盖它。

pandas pandas.read_excel 上的 1.0.1 文档 (04/30/2020)：

"dtypeType 名称或列的字典 -> 类型，默认 None

数据或列的数据类型。例如。 {‘a’: np.float64, ‘b’: np.int32} 使用对象来保存存储在 Excel 中的数据，而不是解释 dtype。如果指定了转换器，它们将被应用而不是 dtype 转换。"

我们可以在使用 pd.read_excel.() 导入时传递参数 dtype='str'（ 如上所示 ）。如果我们想在我们正在使用的 DataFrame 上强制执行单一数据类型，我们可以将其设置为等于自身并将其传递给 pd.DataFrame() 并使用参数 dtype='str 像：df = pd.DataFrame(df, dtype='str')

希望对您有所帮助！

从数据框写入 csv 时如何去除多余的空格

How to strip extra spaces when writing from dataframe to csv

strip

export-to-csv