Spark - 使用 scala 拆分 csv 文件

Spark - split csv file using scala

我有以下 csv 文件架构

(Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, Body)

我想使用以下方法拆分数据:

val splitComma = file.map(x => x.split (",")
val splitComma = file.map(x => x.split (",(?![^<>]*</>)(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

它们都不起作用,下面是我的 csv 文件的示例:

90,58,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining <a href=""http://svnbook.red-bean.com/en/1.8/svn.branchmerge.html"" rel=""nofollow"">branching and merging</a> with Apache Subversion? </p>

<p>All the better if it's specific to TortoiseSVN client.</p>
"
120,83,2008-08-01T15:50:08Z,NA,21,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>

<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way for the users of my site to create and modify pages dynamically.</p>

<p>I need to tie page viewing permissions into the standard <code>ASP.NET</code> membership system as well.</p>
"
180,2089740,2008-08-01T18:42:19Z,NA,53,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.</p>
"

处理此问题的最佳方法是什么?

不能使用 Spark 加载具有多行值(即单元格内的换行符)的 CSV:底层 HadoopInputFormat 将根据换行符拆分文件,忽略 CSV 的封装双引号,因此 Spark 对此无能为力(参见讨论 here)。

不幸的是,这意味着在将数据写入磁盘或使用 Spark 加载数据之前,您必须找到 "cleaning" 数据的某些原因(例如,用占位符替换换行符)。