使用 R 删除 csv 文件中的换行符、段落符

Question

我有一个包含一些换行符或段落符的 csv 文件。我是怎么知道的，当我在 word 文档中打开这个 csv 文件时，我看到了 pilcrow 符号 ¶，在段落之后和新段落的开头之前。如何从 R 中的这个 csv 文件中删除这些换行符？非常感谢任何帮助。

既往病史

2002 年 10 月持续性心房颤动伴心房扑动，状态-post 心房扑动消融线。
Tachy/brady综合症。
胰岛素依赖型糖尿病。患有糖尿病约 35 年。
高血压，好吧

Answer 1

构造的 csv 文件在每一行的末尾都有换行符，这样任何解析器都可以知道一行何时结束（例如，如果您在 Python 中手动编写 csv 文件，您有在末尾包含 \n 换行符。尝试直接在 R 中打开 csv 文件并使用 head(your_file) 检查内容，您应该会看到它像您一样显示会期待。

Answer 2

这是一个测试用例。您只想删除空行。这是文件 test.txt（包含拼写错误）：（注意：您的示例显然不是 csv 文件。）

some header text

more text
 even omre text

--------------------

 txt= readLines("test.txt")
 newtext <- txt[nchar(txt)>0]
 newtext
#[1] "some header text" "more text"        " even omre text"

要删除带编号的行（以数字开头后跟句点的行），可以 post 使用 sub():

处理结果

 txt <- "PAST MEDICAL HISTORY

 1. Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002.
 2. Tachy/brady syndrome.
 3. Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  
 4. Hypertension, well"


 newtxt= readLines(textConnection(txt))
 sub("^[[:digit:].]+", "", newtxt)
#------------------------
[1] "PAST MEDICAL HISTORY"                                                                                             
[2] ""                                                                                                                 
[3] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
[4] " Tachy/brady syndrome."                                                                                           
[5] " Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  "                                    
[6] " Hypertension, well"

> sub("^[[:digit:].]+", "", newtxt[nchar(newtxt)>0])
[1] "PAST MEDICAL HISTORY"                                                                                             
[2] " Persistent atrial fibrillation with atrial flutter, status-post atrial flutter ablation line in October of 2002."
[3] " Tachy/brady syndrome."                                                                                           
[4] " Insulin-dependent diabetes.  Has been diabetic for approximately 35 years.  "                                    
[5] " Hypertension, well"

使用 R 删除 csv 文件中的换行符、段落符

Remove line breaks, paragraph breaks in csv file using R

regex

csv

r

line-breaks

--------------------