从非结构化文本创建 pandas DataFrame
Creating a pandas DataFrame from unstructured text
很抱歉这个菜鸟问题,但就到此为止。我正在尝试分析一些 Facebook 消息。到目前为止,我下载了一个 html 文件,用 BeautifulSoup 将它变成了一个整洁的列表,现在我正在尝试从中创建一个数据框。
我正在查看此资源:https://datatofish.com/list-to-dataframe/,但没有成功。
这是我现在的格式:
list = ['2019-01-07 12:51 PM', 'name1', 'hi how are you', 'im at home', 'wanna come over?', '2019-01-07 01:02 PM', 'name2', 'hell yeah', '', 'ill bring beer', '2019-01-07 01:06 PM', 'name1', 'awesome', 'and so on']
我尝试了几种不同的方法,但我开始觉得我有点吃不消了。我正在学习中。
这是我希望得到的输出:
index date time name message
0 2019-01-07 12:51 PM name1 hi how are you
1 2019-01-07 12:51 PM name1 im at home
2 2019-01-07 12:51 PM name1 wanna come over?
3 2019-01-07 12:56 PM name2 hell yeah
我尝试遍历列表并在进行时填充列并点击日期、名称或消息。
正如我所说,我正在学习,所以如果你能指出我正确的研究方向而不是解决方案,那将是非常棒的。我将不胜感激。谢谢!
编辑:我尝试了几个现有的消息解析器,但由于某种原因,它们在 2018 年都不再受支持。他们也都给我解析错误信息。
它有点难看,但它确实有效。我很乐意赞成更优雅的解决方案!
l = iter(['2019-01-07 12:51 PM', 'name1', 'hi how are you', 'im at home', 'wanna come over?', '2019-01-07 01:02 PM', 'name2', 'hell yeah', '', 'ill bring beer', '2019-01-07 01:06 PM', 'name1', 'awesome', 'and so on'])
df = pd.DataFrame()
# get first element in list
x = next(l)
# if element is the last, catch the IterationError and stop
try:
while 1:
# try to convert element to datetime
datetime = pd.to_datetime(x, format="%Y-%m-%d %H:%M %p")
# if successful get next element as name
x = next(l)
name = x
# get next elements as messages while they do not match datetime format
x = next(l)
while 1:
try:
# if datetime conversion is successful break while
pd.to_datetime(x, format="%Y-%m-%d %H:%M %p");
break
except ValueError:
# else add message to dataframe
df = df.append([{"datetime":datetime,"name":name,"msg":x}])
x = next(l)
except StopIteration:
pass
df["date"] = df["datetime"].dt.date
df["time"] = df["datetime"].dt.time
print(df)
datetime msg name date time
0 2019-01-07 12:51:00 hi how are you name1 2019-01-07 12:51:00
0 2019-01-07 12:51:00 im at home name1 2019-01-07 12:51:00
0 2019-01-07 12:51:00 wanna come over? name1 2019-01-07 12:51:00
0 2019-01-07 01:02:00 hell yeah name2 2019-01-07 01:02:00
0 2019-01-07 01:02:00 name2 2019-01-07 01:02:00
0 2019-01-07 01:02:00 ill bring beer name2 2019-01-07 01:02:00
0 2019-01-07 01:06:00 awesome name1 2019-01-07 01:06:00
0 2019-01-07 01:06:00 and so on name1 2019-01-07 01:06:00
使用正则表达式和列表理解,提取列表内容并将其转换为 Pandas 数据框:
import pandas as pd
import re
datetime_regex = re.compile(r"\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}\sPM")
name_regex = re.compile(r"name\d+")
cols = ["date",
"time",
"name",
"message"
]
l = ['2019-01-07 12:51 PM',
'name1',
'hi how are you',
'im at home',
'wanna come over?',
'2019-01-07 01:02 PM',
'name2',
'hell yeah',
'',
'ill bring beer',
'2019-01-07 01:06 PM',
'name1',
'awesome',
'and so on'
]
tmp = ''.join(l)
datetimes = re.findall(datetime_regex, tmp)
dates = [datetime[:11] for datetime in datetimes]
times = [datetime[11:] for datetime in datetimes]
names = re.findall(name_regex, tmp)
messages = [line
for line in l
if not line.startswith(('2019', 'name1', 'name2'))
]
data = [[[dates[0], times[0], names[0], msg]
for msg in messages[:3]],
[[dates[1], times[1], names[1], messages[3]]],
[[dates[2], times[2], names[2], msg]
for msg in messages[4:]]
]
flatten = [item for sublist in data for item in sublist]
df = pd.DataFrame(flatten, columns=cols)
print(df)
哪个returns:
date time name message
0 2019-01-07 12:51 PM name1 hi how are you
1 2019-01-07 12:51 PM name1 im at home
2 2019-01-07 12:51 PM name1 wanna come over?
3 2019-01-07 01:02 PM name2 hell yeah
4 2019-01-07 01:06 PM name1
5 2019-01-07 01:06 PM name1 ill bring beer
6 2019-01-07 01:06 PM name1 awesome
7 2019-01-07 01:06 PM name1 and so on
很抱歉这个菜鸟问题,但就到此为止。我正在尝试分析一些 Facebook 消息。到目前为止,我下载了一个 html 文件,用 BeautifulSoup 将它变成了一个整洁的列表,现在我正在尝试从中创建一个数据框。
我正在查看此资源:https://datatofish.com/list-to-dataframe/,但没有成功。
这是我现在的格式:
list = ['2019-01-07 12:51 PM', 'name1', 'hi how are you', 'im at home', 'wanna come over?', '2019-01-07 01:02 PM', 'name2', 'hell yeah', '', 'ill bring beer', '2019-01-07 01:06 PM', 'name1', 'awesome', 'and so on']
我尝试了几种不同的方法,但我开始觉得我有点吃不消了。我正在学习中。
这是我希望得到的输出:
index date time name message
0 2019-01-07 12:51 PM name1 hi how are you
1 2019-01-07 12:51 PM name1 im at home
2 2019-01-07 12:51 PM name1 wanna come over?
3 2019-01-07 12:56 PM name2 hell yeah
我尝试遍历列表并在进行时填充列并点击日期、名称或消息。
正如我所说,我正在学习,所以如果你能指出我正确的研究方向而不是解决方案,那将是非常棒的。我将不胜感激。谢谢!
编辑:我尝试了几个现有的消息解析器,但由于某种原因,它们在 2018 年都不再受支持。他们也都给我解析错误信息。
它有点难看,但它确实有效。我很乐意赞成更优雅的解决方案!
l = iter(['2019-01-07 12:51 PM', 'name1', 'hi how are you', 'im at home', 'wanna come over?', '2019-01-07 01:02 PM', 'name2', 'hell yeah', '', 'ill bring beer', '2019-01-07 01:06 PM', 'name1', 'awesome', 'and so on'])
df = pd.DataFrame()
# get first element in list
x = next(l)
# if element is the last, catch the IterationError and stop
try:
while 1:
# try to convert element to datetime
datetime = pd.to_datetime(x, format="%Y-%m-%d %H:%M %p")
# if successful get next element as name
x = next(l)
name = x
# get next elements as messages while they do not match datetime format
x = next(l)
while 1:
try:
# if datetime conversion is successful break while
pd.to_datetime(x, format="%Y-%m-%d %H:%M %p");
break
except ValueError:
# else add message to dataframe
df = df.append([{"datetime":datetime,"name":name,"msg":x}])
x = next(l)
except StopIteration:
pass
df["date"] = df["datetime"].dt.date
df["time"] = df["datetime"].dt.time
print(df)
datetime msg name date time
0 2019-01-07 12:51:00 hi how are you name1 2019-01-07 12:51:00
0 2019-01-07 12:51:00 im at home name1 2019-01-07 12:51:00
0 2019-01-07 12:51:00 wanna come over? name1 2019-01-07 12:51:00
0 2019-01-07 01:02:00 hell yeah name2 2019-01-07 01:02:00
0 2019-01-07 01:02:00 name2 2019-01-07 01:02:00
0 2019-01-07 01:02:00 ill bring beer name2 2019-01-07 01:02:00
0 2019-01-07 01:06:00 awesome name1 2019-01-07 01:06:00
0 2019-01-07 01:06:00 and so on name1 2019-01-07 01:06:00
使用正则表达式和列表理解,提取列表内容并将其转换为 Pandas 数据框:
import pandas as pd
import re
datetime_regex = re.compile(r"\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}\sPM")
name_regex = re.compile(r"name\d+")
cols = ["date",
"time",
"name",
"message"
]
l = ['2019-01-07 12:51 PM',
'name1',
'hi how are you',
'im at home',
'wanna come over?',
'2019-01-07 01:02 PM',
'name2',
'hell yeah',
'',
'ill bring beer',
'2019-01-07 01:06 PM',
'name1',
'awesome',
'and so on'
]
tmp = ''.join(l)
datetimes = re.findall(datetime_regex, tmp)
dates = [datetime[:11] for datetime in datetimes]
times = [datetime[11:] for datetime in datetimes]
names = re.findall(name_regex, tmp)
messages = [line
for line in l
if not line.startswith(('2019', 'name1', 'name2'))
]
data = [[[dates[0], times[0], names[0], msg]
for msg in messages[:3]],
[[dates[1], times[1], names[1], messages[3]]],
[[dates[2], times[2], names[2], msg]
for msg in messages[4:]]
]
flatten = [item for sublist in data for item in sublist]
df = pd.DataFrame(flatten, columns=cols)
print(df)
哪个returns:
date time name message
0 2019-01-07 12:51 PM name1 hi how are you
1 2019-01-07 12:51 PM name1 im at home
2 2019-01-07 12:51 PM name1 wanna come over?
3 2019-01-07 01:02 PM name2 hell yeah
4 2019-01-07 01:06 PM name1
5 2019-01-07 01:06 PM name1 ill bring beer
6 2019-01-07 01:06 PM name1 awesome
7 2019-01-07 01:06 PM name1 and so on