Python/Pandas - 按分隔符将文本拆分为多列;并创建一个 csv 文件
Python/Pandas - split text into columns by delimiter ; and create a csv file
我有一个很长的文本,我在其中插入了一个分隔符“;”正是我想将文本分成不同列的位置。
到目前为止,每当我尝试将文本拆分为 'ID' 和 'ADText' 时,我只会得到第一行。但是,两列中应该有 1439 lines/rows。
我的文字是这样的:
1234;文本由多个句子组成,跨越多行,直到在某个时候下一个 ID 被写为 dwon 2345;然后新的广告文本开始,直到下一个 ID 3456;等等
我想使用;将我的文本分成两列,一列带有 ID,一列带有 AD 文本。
#read the text file into python:
jobads= pd.read_csv("jobads.txt", header=None)
print(jobadsads)
#create dataframe
df=pd.DataFrame(jobads, index=None, columns=None)
type(df)
print(df)
#name column to target it for split
df = df.rename(columns={0:"Job"})
print(df)
#split it into two columns. Problem: I only get the first row.
print(pd.DataFrame(dr.Job.str.split(';',1).tolist(),
columns=['ID','AD']))
不幸的是,这只适用于第一个条目,然后就停止了。输出如下所示:
ID AD
0 1234 text in written from with ...
我哪里错了?我将不胜感激任何建议 =)
谢谢!
示例文本:
FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16
根据“;”创建列分隔符:
import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")
编辑:考虑到我认为文本看起来更像这样的评论:
t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """
在这种情况下,像这样的正则表达式会将您的字符串拆分为 ID 和文本,然后您可以使用它们来生成 pandas 数据框。
import re
r = re.compile("([0-9]+);")
re.split(r,t)
输出:
['',
'1234',
' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ',
'2345',
' then the new Ad-Text begins until the next ID ',
'3456',
' and so on',
'1234',
' text in written from with multiple ']
编辑 2:
这是对提问者在评论中附加问题的回应:
如何将此字符串转换为具有 2 列的 pandas 数据框:ID 和文本
import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list, starting with the FIRST one.
texts = a[::2][1:]
print(texts)
# Create list of ID's. ::1 takes every other item from a list, starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})
我有一个很长的文本,我在其中插入了一个分隔符“;”正是我想将文本分成不同列的位置。 到目前为止,每当我尝试将文本拆分为 'ID' 和 'ADText' 时,我只会得到第一行。但是,两列中应该有 1439 lines/rows。
我的文字是这样的: 1234;文本由多个句子组成,跨越多行,直到在某个时候下一个 ID 被写为 dwon 2345;然后新的广告文本开始,直到下一个 ID 3456;等等
我想使用;将我的文本分成两列,一列带有 ID,一列带有 AD 文本。
#read the text file into python:
jobads= pd.read_csv("jobads.txt", header=None)
print(jobadsads)
#create dataframe
df=pd.DataFrame(jobads, index=None, columns=None)
type(df)
print(df)
#name column to target it for split
df = df.rename(columns={0:"Job"})
print(df)
#split it into two columns. Problem: I only get the first row.
print(pd.DataFrame(dr.Job.str.split(';',1).tolist(),
columns=['ID','AD']))
不幸的是,这只适用于第一个条目,然后就停止了。输出如下所示:
ID AD
0 1234 text in written from with ...
我哪里错了?我将不胜感激任何建议 =) 谢谢!
示例文本:
FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16
根据“;”创建列分隔符:
import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")
编辑:考虑到我认为文本看起来更像这样的评论:
t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """
在这种情况下,像这样的正则表达式会将您的字符串拆分为 ID 和文本,然后您可以使用它们来生成 pandas 数据框。
import re
r = re.compile("([0-9]+);")
re.split(r,t)
输出:
['',
'1234',
' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ',
'2345',
' then the new Ad-Text begins until the next ID ',
'3456',
' and so on',
'1234',
' text in written from with multiple ']
编辑 2: 这是对提问者在评论中附加问题的回应: 如何将此字符串转换为具有 2 列的 pandas 数据框:ID 和文本
import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list, starting with the FIRST one.
texts = a[::2][1:]
print(texts)
# Create list of ID's. ::1 takes every other item from a list, starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})