使用 entrez 和 biopython 在 medline 数据库中搜索标题
Searching titles in medline database with entrez and biopython
我正在尝试搜索标题中包含特定字词的论文。更准确地说,2010 年至 2015 年间发表的论文中的病毒或病毒一词。这是我的代码:
import re
from Bio import Medline
handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()
pmid_list = record["IdList"] #list of records
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
titles = [] # start with empty list of titles
for record in records:
ti_list = record['TI'] #titles
for title in ti_list:
if title == "virus" and title not in titles: #searching viral/virus
titles.append(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
如果我简单地打印(记录['TI'],那么我会在我的搜索查询中得到所有标题的列表。但是,我无法搜索特定的词。我想我的错误可能在"if title == "病毒中”(因为显然没有论文会单独使用这个词作为标题)。
我很困惑。有没有更好的方法可以在我查询的论文标题中搜索这个词?
谢谢。
编辑:更新了代码(仍然没有运气)
import re
from Bio import Medline
handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()
pmid_list = record["IdList"] #list of records
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
新代码:
import re
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text",
term="2010[Date - Publication]:2015[Date - Publication]")
records = Medline.parse(handle)
titles = []
for record in records:
ti_list = record['TI']
for title in ti_list:
titles.append(title)
handle.close()
for title in titles:
print(title)
如果要匹配子字符串,请使用 in 查看标题中是否包含任何单词:
words = ("viral","virus")
if any(w in title for w in words) and title not in titles: #
但您似乎想要过滤记录,以获取任何包含病毒或病毒的记录标题:
st = {"viral","virus"}
filtered_records = [ record for record in records if any(w in st for w in record['TI'] )]
如果您想匹配子字符串并使用模式,那么您实际上需要将其设为正则表达式,"vir(al|us)"
只是您代码中的一个字符串:
import re
r = re.compile("vir(al|us)")
filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])]
您自己的循环中的正则表达式将转到您的 if 所在的位置:
import re
r = re.compile(r"vir(al|us)")
if r.search(title) and title not in titles:
.......
如果您不希望病毒等匹配,请为您的正则表达式使用单词边界:
r = re.compile(r"\bvir(al|us)\b")
你还应该将标题设置为一个不能有重复的集合,一个使用你自己的代码的工作示例:
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
哪个可以成为集合推导:
r = re.compile(r"\bvir(al|us)\b")
titles = {title for record in records for title in record['TI'] if r.search(title)} # titles
因为 record['TI']
returns 一个字符串而不是一个列表:
r = re.compile(r"\bvir(al|us)\b")
titles = set()
for record in records:
title = record['TI'] # title is a str not a list
if r.search(title): #
titles.add(title)
对 set comp 或任何其他示例执行相同的操作。
我正在尝试搜索标题中包含特定字词的论文。更准确地说,2010 年至 2015 年间发表的论文中的病毒或病毒一词。这是我的代码:
import re
from Bio import Medline
handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()
pmid_list = record["IdList"] #list of records
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
titles = [] # start with empty list of titles
for record in records:
ti_list = record['TI'] #titles
for title in ti_list:
if title == "virus" and title not in titles: #searching viral/virus
titles.append(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
如果我简单地打印(记录['TI'],那么我会在我的搜索查询中得到所有标题的列表。但是,我无法搜索特定的词。我想我的错误可能在"if title == "病毒中”(因为显然没有论文会单独使用这个词作为标题)。
我很困惑。有没有更好的方法可以在我查询的论文标题中搜索这个词?
谢谢。
编辑:更新了代码(仍然没有运气)
import re
from Bio import Medline
handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()
pmid_list = record["IdList"] #list of records
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
新代码:
import re
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text",
term="2010[Date - Publication]:2015[Date - Publication]")
records = Medline.parse(handle)
titles = []
for record in records:
ti_list = record['TI']
for title in ti_list:
titles.append(title)
handle.close()
for title in titles:
print(title)
如果要匹配子字符串,请使用 in 查看标题中是否包含任何单词:
words = ("viral","virus")
if any(w in title for w in words) and title not in titles: #
但您似乎想要过滤记录,以获取任何包含病毒或病毒的记录标题:
st = {"viral","virus"}
filtered_records = [ record for record in records if any(w in st for w in record['TI'] )]
如果您想匹配子字符串并使用模式,那么您实际上需要将其设为正则表达式,"vir(al|us)"
只是您代码中的一个字符串:
import re
r = re.compile("vir(al|us)")
filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])]
您自己的循环中的正则表达式将转到您的 if 所在的位置:
import re
r = re.compile(r"vir(al|us)")
if r.search(title) and title not in titles:
.......
如果您不希望病毒等匹配,请为您的正则表达式使用单词边界:
r = re.compile(r"\bvir(al|us)\b")
你还应该将标题设置为一个不能有重复的集合,一个使用你自己的代码的工作示例:
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
哪个可以成为集合推导:
r = re.compile(r"\bvir(al|us)\b")
titles = {title for record in records for title in record['TI'] if r.search(title)} # titles
因为 record['TI']
returns 一个字符串而不是一个列表:
r = re.compile(r"\bvir(al|us)\b")
titles = set()
for record in records:
title = record['TI'] # title is a str not a list
if r.search(title): #
titles.add(title)
对 set comp 或任何其他示例执行相同的操作。