Python 从文件中提取可变长度的文本
Python extraction of variable length text from a file
我有一个文本文件,其中包含类似
的数据
Tweet_id:"123456789", "text":"What an episode", "truncated":"false",Tweet_id:"12345678910", "text":My number is fascinating", "truncated":false
我只想提取文本字段
Tweet_id:"123456789", **"text":"What an episode", "truncated"**:"false",Tweet_id:"12345678910", **"text":My number is fascinating", "truncated":false**
我不确定你到底想提取哪一部分,但我建议你使用 regular expressions。
>>> import re
>>> string = 'Tweet_id:"123456789","text":"What an episode","truncated":"false,Tweet_id:"12345678910","text":My number is fascinating","truncated":false'
>>> re.findall('\"text\":(.*?),', string)
['"What an episode"', 'My number is fascinating"']
这是 regular expressions 的自然应用。
import re
text_re = re.compile("""
"text":" # This matches the part right before what you want.
(?P<content>[^"]+) # Matches the content
" # Matches the close-quote after the content.
""", re.VERBOSE)
for match in text_re.finditer('Tweet_id:"123456789","text":"What an episode","truncated":"false,Tweet_id:"12345678910","text":"My number is fascinating","truncated":false"'):
print match.group('content')
这将打印:
What an episode
My number is fascinating
正则表达式可能需要变得更复杂,具体取决于数据格式的一致性、推文内容中的双引号字符在数据中的处理方式等细节。
我有一个文本文件,其中包含类似
的数据Tweet_id:"123456789", "text":"What an episode", "truncated":"false",Tweet_id:"12345678910", "text":My number is fascinating", "truncated":false
我只想提取文本字段
Tweet_id:"123456789", **"text":"What an episode", "truncated"**:"false",Tweet_id:"12345678910", **"text":My number is fascinating", "truncated":false**
我不确定你到底想提取哪一部分,但我建议你使用 regular expressions。
>>> import re
>>> string = 'Tweet_id:"123456789","text":"What an episode","truncated":"false,Tweet_id:"12345678910","text":My number is fascinating","truncated":false'
>>> re.findall('\"text\":(.*?),', string)
['"What an episode"', 'My number is fascinating"']
这是 regular expressions 的自然应用。
import re
text_re = re.compile("""
"text":" # This matches the part right before what you want.
(?P<content>[^"]+) # Matches the content
" # Matches the close-quote after the content.
""", re.VERBOSE)
for match in text_re.finditer('Tweet_id:"123456789","text":"What an episode","truncated":"false,Tweet_id:"12345678910","text":"My number is fascinating","truncated":false"'):
print match.group('content')
这将打印:
What an episode
My number is fascinating
正则表达式可能需要变得更复杂,具体取决于数据格式的一致性、推文内容中的双引号字符在数据中的处理方式等细节。