具有特殊字符的大数据自由文本 - 通过 Python 搜索并给出 unicode 错误
Big Data free text with special characters- search via Python and giving unicode errors
每条记录之间带有特殊字符和行空格的自由文本,无法搜索关键字。我有一个包含 3 列的大文本文件(每列由“|”分隔)。似乎每条记录都以 } 符号结尾。每行或记录之间有一个行间距。我的文件大小约为 100 MB+
我的objective是搜索多个关键词和关键词前后的周边词。
在堆栈溢出的帮助下,我正在使用这段代码,但出现了 Unicode 错误。请帮忙。
1.I 只想得到肯定的结果。或者如果搜索不匹配我不想看到任何数据。
2.Is 是否可以看到每个发现的前 4 列以及结果?这四列是固定长度的,每条记录都相同。
我的文件示例:
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}{\s2\cf0\cb1
;}}
\par\par\par\b
FOLLOW-According to the United States Census Bureau, the township has a total area of 15.1 square miles
(39 km2), of which, 14.6 square miles (38 km2) of it is land and 0.5 square miles (1.3 km2) of it
(3.58%) is water. It is drained by the Lehigh River on its western \clvertalt\cellx4320
\pard\intbl\s0\ql\widctlpar\plain\f1\fs20\lang4105\f1\fs16 3.87 10^6/uL \cell
\pard\s0\ql\widctlpar\plain\f1\fs20\par\par\b ASSESSMENT:\plain\f1\fs20 Perfect
As of the census[1] of 2000, there were 4,243 people, 1,671 households, and 1,256 families residing in
the township. The population cc:\tab Dhar xdfsd, MD\par\par\par\par\pard\s0\ql\par}
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;} {\s2\cf2\cb1
;}{\s3\f1\fs22\cf2\cb1\tqc\tx4320\tqr\tx8640 header;} {\s4\fs20\cf2\cb1\tqc\tx4320\tqr\tx8640
footer;}}
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\pard\par\pard\s3\tqc\tx4320\tqr\tx8640\qc\widctlpar\f0\fs28 \caps
There were 1,671 households out of which 28.8% had children under the age of 18 living with them, 64.0%
were married couples living together, 6.9% had a female householder with no husband present, census
24.8% were non-families. 19.5% of all households were made up of
30094 - (770) 761-7260 - FAX (678) 413 -1818\par\lang1024\f0\fs20\par\pard\plain\f1\fs20\par\ql\par\par
}
00010007308000003141|730100036|2007-11-19 12:36:28.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footer y864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\lang1033\f1 Home Care Note: CMN received from Home Medical
In the township the population was spread out with 21.4% under the age of 18, 6.5% from 18 to 24, 29.9%
from 25 to 44, 27.7% from 45 to 64, and 14.6% who were 65 years of age or older. The median age was 40
years. For every 100 females there were 101.1 males. For every 100 females age 18 and over, there were
98.5 males
on RA on the 18th of Oct. Cont. O2 at 2L/N/C was ordered. \plain\f1\fs20\par}
00010007308000003141|730100037|2007-11-15 12:05:02.000|ACCG|Clear Document - Certificate
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \census \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footery864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
called and faxed to Mike.\plain\f1\fs20\par}
在上面的文件中,我正在搜索 'census'(不区分大小写),我在 4 个地方找到了匹配项。 (第一个记录中有 2 次,两个不同记录中有 2 次)
期望的输出低于...
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|United States Census Bureau, the t
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|of the census[1] of 2000
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|husband present, census 24.8% were
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG|fonttbl{\f0 \census \fcharset0 Times
在上面想要的例子中,我确实选择了在人口普查前后只显示两个词。如果我可以灵活地选择 2 个以上 words.Example 前 10 个字和后 15 个字等
,那就太好了
我也是从文本文件中读取的。如果你给我一个读取和写回文本文件的命令,那就太好了。抱歉,我是 Python 的新手,但我喜欢 Python 的力量。
非常感谢您的帮助。
s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
^^
尝试this.Or 否则你将不得不两次转义\plain\f1\fs20\par
您可以使用下面的正则表达式。
>>> s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1152\margb720\head
ery1152\footery720\deftab720\formshade\aendnotes\aftnnrlc
Called Brian with mike
\pgbrdrhead
12/27/06 fax 293-4812\plain\f1\fs20\par}
4200011|4200007|2010-11-29 12:49:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1007\margb576\head
ery1007\footery576\deftab720\formshade\aendnotes\aftnnrlc
\pgbrdrhead them numbers and they pt
minutes\plain\f1\fs20\par}"""
>>> ls = re.findall(r'^(\d+\|\d+)\|(?:(?!\n\n)[\s\S])*?(\S+\s+\S+\s+mike\s+\S+\s+\S+)', s)
>>> print(('|'.join([j for i in ls for j in i])).replace('\n',' '))
4200011|4200002|Brian with mike \pgbrdrhead 12/27/06
每条记录之间带有特殊字符和行空格的自由文本,无法搜索关键字。我有一个包含 3 列的大文本文件(每列由“|”分隔)。似乎每条记录都以 } 符号结尾。每行或记录之间有一个行间距。我的文件大小约为 100 MB+ 我的objective是搜索多个关键词和关键词前后的周边词。 在堆栈溢出的帮助下,我正在使用这段代码,但出现了 Unicode 错误。请帮忙。
1.I 只想得到肯定的结果。或者如果搜索不匹配我不想看到任何数据。
2.Is 是否可以看到每个发现的前 4 列以及结果?这四列是固定长度的,每条记录都相同。
我的文件示例:
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}{\s2\cf0\cb1
;}}
\par\par\par\b
FOLLOW-According to the United States Census Bureau, the township has a total area of 15.1 square miles
(39 km2), of which, 14.6 square miles (38 km2) of it is land and 0.5 square miles (1.3 km2) of it
(3.58%) is water. It is drained by the Lehigh River on its western \clvertalt\cellx4320
\pard\intbl\s0\ql\widctlpar\plain\f1\fs20\lang4105\f1\fs16 3.87 10^6/uL \cell
\pard\s0\ql\widctlpar\plain\f1\fs20\par\par\b ASSESSMENT:\plain\f1\fs20 Perfect
As of the census[1] of 2000, there were 4,243 people, 1,671 households, and 1,256 families residing in
the township. The population cc:\tab Dhar xdfsd, MD\par\par\par\par\pard\s0\ql\par}
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;} {\s2\cf2\cb1
;}{\s3\f1\fs22\cf2\cb1\tqc\tx4320\tqr\tx8640 header;} {\s4\fs20\cf2\cb1\tqc\tx4320\tqr\tx8640
footer;}}
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\pard\par\pard\s3\tqc\tx4320\tqr\tx8640\qc\widctlpar\f0\fs28 \caps
There were 1,671 households out of which 28.8% had children under the age of 18 living with them, 64.0%
were married couples living together, 6.9% had a female householder with no husband present, census
24.8% were non-families. 19.5% of all households were made up of
30094 - (770) 761-7260 - FAX (678) 413 -1818\par\lang1024\f0\fs20\par\pard\plain\f1\fs20\par\ql\par\par
}
00010007308000003141|730100036|2007-11-19 12:36:28.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footer y864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\lang1033\f1 Home Care Note: CMN received from Home Medical
In the township the population was spread out with 21.4% under the age of 18, 6.5% from 18 to 24, 29.9%
from 25 to 44, 27.7% from 45 to 64, and 14.6% who were 65 years of age or older. The median age was 40
years. For every 100 females there were 101.1 males. For every 100 females age 18 and over, there were
98.5 males
on RA on the 18th of Oct. Cont. O2 at 2L/N/C was ordered. \plain\f1\fs20\par}
00010007308000003141|730100037|2007-11-15 12:05:02.000|ACCG|Clear Document - Certificate
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \census \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footery864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
called and faxed to Mike.\plain\f1\fs20\par}
在上面的文件中,我正在搜索 'census'(不区分大小写),我在 4 个地方找到了匹配项。 (第一个记录中有 2 次,两个不同记录中有 2 次)
期望的输出低于...
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|United States Census Bureau, the t
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|of the census[1] of 2000
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|husband present, census 24.8% were
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG|fonttbl{\f0 \census \fcharset0 Times
在上面想要的例子中,我确实选择了在人口普查前后只显示两个词。如果我可以灵活地选择 2 个以上 words.Example 前 10 个字和后 15 个字等
,那就太好了我也是从文本文件中读取的。如果你给我一个读取和写回文本文件的命令,那就太好了。抱歉,我是 Python 的新手,但我喜欢 Python 的力量。
非常感谢您的帮助。
s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
^^
尝试this.Or 否则你将不得不两次转义\plain\f1\fs20\par
您可以使用下面的正则表达式。
>>> s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1152\margb720\head
ery1152\footery720\deftab720\formshade\aendnotes\aftnnrlc
Called Brian with mike
\pgbrdrhead
12/27/06 fax 293-4812\plain\f1\fs20\par}
4200011|4200007|2010-11-29 12:49:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1007\margb576\head
ery1007\footery576\deftab720\formshade\aendnotes\aftnnrlc
\pgbrdrhead them numbers and they pt
minutes\plain\f1\fs20\par}"""
>>> ls = re.findall(r'^(\d+\|\d+)\|(?:(?!\n\n)[\s\S])*?(\S+\s+\S+\s+mike\s+\S+\s+\S+)', s)
>>> print(('|'.join([j for i in ls for j in i])).replace('\n',' '))
4200011|4200002|Brian with mike \pgbrdrhead 12/27/06