Python - 如何提取包含引用标记的句子?
Python - How to Extract sentences that contains Citation mark?
text = "Trondheim is a small city with a university and 140000 inhabitants. Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent."
print re.findall(r"([^.]*?\(.+ [0-9]+\)[^.]*\.)",text)
我正在使用上面的代码来提取其中包含引文的句子。如您所见,最后一句话包含引文(Garry Weber,2005)。
但我得到了这个结果:
[' Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent.']
结果应该是仅包含引文的句子,如下所示:
起点是自动化路由信息代理的功能 (Garry Weber, 2005)。
我猜问题是由括号内的文本引起的,正如您在第二行中看到的那样(departures per),我的代码有什么解决方案吗?
我的尝试。 Live demo.
\b[^.]+\([^()]+\b(\d{2}|\d{4})\s*\)[^.]*\.
它准确地捕捉了句子,并且比你的更具体。
text = "Trondheim is a small city with a university and 140000 inhabitants. Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent."
print re.findall(r"([^.]*?\(.+ [0-9]+\)[^.]*\.)",text)
我正在使用上面的代码来提取其中包含引文的句子。如您所见,最后一句话包含引文(Garry Weber,2005)。
但我得到了这个结果:
[' Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent.']
结果应该是仅包含引文的句子,如下所示:
起点是自动化路由信息代理的功能 (Garry Weber, 2005)。
我猜问题是由括号内的文本引起的,正如您在第二行中看到的那样(departures per),我的代码有什么解决方案吗?
我的尝试。 Live demo.
\b[^.]+\([^()]+\b(\d{2}|\d{4})\s*\)[^.]*\.
它准确地捕捉了句子,并且比你的更具体。