Python - 忽略 non-parsable 字符串
Python - ignore non-parsable strings
我在使用 pandas
解析的文本文件中有一些字符串。它的示例如下所示:
May 6, 2021 12:40:05 AM CEST INFO [com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
May 6, 2021 9:12:17 AM CEST FINE [com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.getStatusCode(HttpUrlConnectionCall.java:299)
at com.noelios.restlet.http.HttpClientCall.sendRequest(HttpClientCall.java:173)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.sendRequest(HttpUrlConnectionCall.java:183)
at com.noelios.restlet.http.HttpClientConverter.commit(HttpClientConverter.java:109)
at com.noelios.restlet.http.HttpClientHelper.handle(HttpClientHelper.java:88)
at org.restlet.Client.handle(Client.java:120)
at org.restlet.Uniform.handle(Uniform.java:106)
at com.boomi.container.core.MessagePollerThread.run(MessagePollerThread.java:273)
at java.lang.Thread.run(Thread.java:748)
由于文件没有列 headers 也没有定界符和动态宽度值,我使用 str.strip()
逐行读取,然后创建一个包含列 [=33] 的新文件=] 和逗号分隔。此外,在写入输出文件之前,我使用 dateutil.parser.parse
将日期字符串转换为日期 object:
data = []
with open(inputFile, "r") as f_in:
for line in map(str.strip, f_in):
if not line:
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
但是,对于那些以另一个字符串开头的行(即 java.net.Socket...),除了日期之外,我在尝试解析时遇到错误,因为它无法解析,这是正确的。我怎么能通过这个?如果可以解析字符串,我希望这样做,否则忽略并且什么也不做。我已经试过了,但是当它到达 except
块时,它会更新所有输出文件。
try:
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
except Exception as e:
pass
输出文件
LogDate,LogStatus,LogInfo
"May 6, 2021 12:40:05 AM CEST",INFO,[com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
"May 6, 2021 9:12:17 AM CEST",FINE,[com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out,out,out
at java.net.SocketInputStream.socketRead0(Native Method),Method),Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116),java.
我在这里错过了什么?
你可以试试这个:
months = (
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December",
)
data = []
with open(inputFile, "r") as f_in:
for line in map(str.strip, f_in):
# Add a new condition
if not line or not line.startswith(months):
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
print(df)
# Outputs
LogDate LogStatus LogInfo
0 2021-05-06 00:40:05 INFO [com.purge.PurgeManager run] PURGE: Purge all ...
1 2021-05-06 09:12:17 FINE [com.noelios.restlet.http.HttpClientCall sendR...
我在使用 pandas
解析的文本文件中有一些字符串。它的示例如下所示:
May 6, 2021 12:40:05 AM CEST INFO [com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
May 6, 2021 9:12:17 AM CEST FINE [com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.getStatusCode(HttpUrlConnectionCall.java:299)
at com.noelios.restlet.http.HttpClientCall.sendRequest(HttpClientCall.java:173)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.sendRequest(HttpUrlConnectionCall.java:183)
at com.noelios.restlet.http.HttpClientConverter.commit(HttpClientConverter.java:109)
at com.noelios.restlet.http.HttpClientHelper.handle(HttpClientHelper.java:88)
at org.restlet.Client.handle(Client.java:120)
at org.restlet.Uniform.handle(Uniform.java:106)
at com.boomi.container.core.MessagePollerThread.run(MessagePollerThread.java:273)
at java.lang.Thread.run(Thread.java:748)
由于文件没有列 headers 也没有定界符和动态宽度值,我使用 str.strip()
逐行读取,然后创建一个包含列 [=33] 的新文件=] 和逗号分隔。此外,在写入输出文件之前,我使用 dateutil.parser.parse
将日期字符串转换为日期 object:
data = []
with open(inputFile, "r") as f_in:
for line in map(str.strip, f_in):
if not line:
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
但是,对于那些以另一个字符串开头的行(即 java.net.Socket...),除了日期之外,我在尝试解析时遇到错误,因为它无法解析,这是正确的。我怎么能通过这个?如果可以解析字符串,我希望这样做,否则忽略并且什么也不做。我已经试过了,但是当它到达 except
块时,它会更新所有输出文件。
try:
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
except Exception as e:
pass
输出文件
LogDate,LogStatus,LogInfo
"May 6, 2021 12:40:05 AM CEST",INFO,[com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
"May 6, 2021 9:12:17 AM CEST",FINE,[com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.SocketTimeoutException: Read timed out,out,out
at java.net.SocketInputStream.socketRead0(Native Method),Method),Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116),java.
我在这里错过了什么?
你可以试试这个:
months = (
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December",
)
data = []
with open(inputFile, "r") as f_in:
for line in map(str.strip, f_in):
# Add a new condition
if not line or not line.startswith(months):
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate, "LogStatus": logstatus, "LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse, ignoretz=True)
print(df)
# Outputs
LogDate LogStatus LogInfo
0 2021-05-06 00:40:05 INFO [com.purge.PurgeManager run] PURGE: Purge all ...
1 2021-05-06 09:12:17 FINE [com.noelios.restlet.http.HttpClientCall sendR...