Python - 忽略 non-parsable 字符串

Python - ignore non-parsable strings

我在使用 pandas 解析的文本文件中有一些字符串。它的示例如下所示:

May 6, 2021 12:40:05 AM CEST INFO    [com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
May 6, 2021 9:12:17 AM CEST FINE    [com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
    at com.noelios.restlet.ext.net.HttpUrlConnectionCall.getStatusCode(HttpUrlConnectionCall.java:299)
    at com.noelios.restlet.http.HttpClientCall.sendRequest(HttpClientCall.java:173)
    at com.noelios.restlet.ext.net.HttpUrlConnectionCall.sendRequest(HttpUrlConnectionCall.java:183)
    at com.noelios.restlet.http.HttpClientConverter.commit(HttpClientConverter.java:109)
    at com.noelios.restlet.http.HttpClientHelper.handle(HttpClientHelper.java:88)
    at org.restlet.Client.handle(Client.java:120)
    at org.restlet.Uniform.handle(Uniform.java:106)
    at com.boomi.container.core.MessagePollerThread.run(MessagePollerThread.java:273)
    at java.lang.Thread.run(Thread.java:748)

由于文件没有列 headers 也没有定界符和动态宽度值,我使用 str.strip() 逐行读取,然后创建一个包含列 [=33] 的新文件=] 和逗号分隔。此外,在写入输出文件之前,我使用 dateutil.parser.parse 将日期字符串转换为日期 object:

data = []
with open(inputFile, "r") as f_in:
    for line in map(str.strip, f_in):
        if not line:
            continue
        line = line.split(maxsplit=6)
        logdate = " ".join(line[:6])
        logstatus = line[-1].split(maxsplit=1)[0]
        loginfo = line[-1].split(maxsplit=1)[-1]
        data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})

df = pd.DataFrame(data)

df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)

但是,对于那些以另一个字符串开头的行(即 java.net.Socket...),除了日期之外,我在尝试解析时遇到错误,因为它无法解析,这是正确的。我怎么能通过这个?如果可以解析字符串,我希望这样做,否则忽略并且什么也不做。我已经试过了,但是当它到达 except 块时,它会更新所有输出文件。

try:
    df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
except Exception as e:
    pass

输出文件

LogDate,LogStatus,LogInfo
"May 6, 2021 12:40:05 AM CEST",INFO,[com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
"May 6, 2021 9:12:17 AM CEST",FINE,[com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out,out,out
at java.net.SocketInputStream.socketRead0(Native Method),Method),Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116),java.

我在这里错过了什么?

你可以试试这个:

months = (
    "January",
    "February",
    "March",
    "April",
    "May",
    "June",
    "July",
    "August",
    "September",
    "October",
    "November",
    "December",
)
data = []
with open(inputFile, "r") as f_in:
    for line in map(str.strip, f_in):
        # Add a new condition
        if not line or not line.startswith(months):
            continue
        line = line.split(maxsplit=6)
        logdate = " ".join(line[:6])
        logstatus = line[-1].split(maxsplit=1)[0]
        loginfo = line[-1].split(maxsplit=1)[-1]
        data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})

df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)

print(df)
# Outputs
              LogDate LogStatus                                            LogInfo
0 2021-05-06 00:40:05      INFO  [com.purge.PurgeManager run] PURGE: Purge all ...
1 2021-05-06 09:12:17      FINE  [com.noelios.restlet.http.HttpClientCall sendR...