Fetching publication information from Entrez¶
Prerequisites
- Python 3.6 or higher is assumed.
entrezpy
is either installed via PyPi or cloned from thegit
repository (Installation).- basic familiarity with object oriented Python, i.e. inheritance
- The full implementation can be found in the repository at examples/tutorials/pubmed/pubmed-fetcher.py
Acknowledgment
I’d like to thank Pedram Hosseini (pdr[dot]hosseini[at]gmail[dot]com) for pointing out the requirement for this tutorial.
Overview¶
This tutorial explains how to write a simple PubMed data record fetcher using
entrezpy.conduit.Conduit
and by adjust entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
.
Outline
- develop a
entrezpy.conduit.Conduit
pipline - implement a PubMed data structure
- inherit
entrezpy.base.result.EutilsResult
andentrezpy.base.analyzer.EutilsAnalyzer
- implement the required virtual methods
- add methods to derived classes
The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. For the first example, we’ll fetch PubMed data in XML, specifically, the UID, authors, title, abstract, and citations. We will test and develop the pipeline using the article the article with PubMed ID (PMID) 26378223 because it has all the required fields. In the end we will see that not all fields are always present.
In entrezpy
, a result (or query), is the sum of all individual requests
required to obtain the whole query. If you want to analyze the number of
citations for a specific author, the result is the number of citations which
you obtained using a query. To obtain the final number, you have to parse
several PubMed records. Therefore, entrezpy
requires a result
entrezpy.base.result.EutilsResult
class to store the partial results obtained from a query.
A quick note on virtual functions
entrezpy
is heavily based on virtual methods [1]. A virtual method is
declared in the the base class but implemented in the derived class. Every
class inheriting the base class has to implement the virtual functions using
the same signature and return the same result type as the base class. To
implement the method in the inherited class, you need to look up the method in
the base class.
PubMed data structure¶
Before we start to write our implementation, we need to understand the
structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its
examination, it can be either stored into a file, or preferably, piped to a
pager, e.g. less
[2] or more
[3]. These are usually
installed on most *NIX systems.
$ efetch -db pubmed -id 26378223 -mode XML | less
The entry should start and end as shown in Listing 2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<!- SKIPPED DATA ->
<Article PubModel="Print">
<!- SKIPPED DATA ->
<ArticleTitle>Cell Walls and the Convergent Evolution of the Viral Envelope.</ArticleTitle>
<!- SKIPPED DATA ->
<Abstract>
<AbstractText>Why some viruses are enveloped while others lack an outer lipid bilayer is a major question in viral evolution but one that has received relatively little attention. The viral envelope serves several functions, including protecting the RNA or DNA molecule(s), evading recognition by the immune system, and facilitating virus entry. Despite these commonalities, viral envelopes come in a wide variety of shapes and configurations. The evolution of the viral envelope is made more puzzling by the fact that nonenveloped viruses are able to infect a diverse range of hosts across the tree of life. We reviewed the entry, transmission, and exit pathways of all (101) viral families on the 2013 International Committee on Taxonomy of Viruses (ICTV) list. By doing this, we revealed a strong association between the lack of a viral envelope and the presence of a cell wall in the hosts these viruses infect. We were able to propose a new hypothesis for the existence of enveloped and nonenveloped viruses, in which the latter represent an adaptation to cells surrounded by a cell wall, while the former are an adaptation to animal cells where cell walls are absent. In particular, cell walls inhibit viral entry and exit, as well as viral transport within an organism, all of which are critical waypoints for successful infection and spread. Finally, we discuss how this new model for the origin of the viral envelope impacts our overall understanding of virus evolution. </AbstractText>
<CopyrightInformation>Copyright © 2015, American Society for Microbiology. All Rights Reserved.</CopyrightInformation>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Buchmann</LastName>
<ForeName>Jan P</ForeName>
<Initials>JP</Initials>
<AffiliationInfo>
<Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Holmes</LastName>
<ForeName>Edward C</ForeName>
<Initials>EC</Initials>
<AffiliationInfo>
<Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia edward.holmes@sydney.edu.au.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<!- SKIPPED DATA ->
</Article>
<!- SKIPPED DATA ->
<ReferenceList>
<Reference>
<Citation>Nature. 2014 Jan 16;505(7483):432-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24336205</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Crit Rev Microbiol. 1988;15(4):339-89</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3060317</ArticleId>
</ArticleIdList>
</Reference>
<!- SKIPPED DATA ->
</ReferenceList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
|
This shows us the XML fields, specifically the tags
, present in a typical
PubMed record. The root tag for each batch of fetched data records is
<PubmedArticleSet>
and each individual data record is described in the nested
tags <PubmedArticle>
. We are interested in the following tags nested within
<PubmedArticle>
:
<ArticleTitle>
<Abstract>
<AuthorList>
<ReferenceList>
The first step is to write a program to fetch the requested records. This can
be done using the entrezpy.conduit.Conduit
class.
Simple Conduit pipeline to fetch PubMed Records¶
We will write simple entrezpy
pipeline named pubmed-fetcher.py
using
entrezpy.conduit.Conduit
to test and run our implementations. A simple entrezpy.conduit.Conduit
pipeline
requires two arguments:
- user email
- PMID (here 15430309)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #!/usr/bin/env python3
import os
import sys
"""
If entrezpy is installed using PyPi uncomment th line 'import entrezpy' and
comment the 'sys.path.insert(...)'
"""
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
def main():
c = entrezpy.conduit.Conduit(sys.argv[1])
fetch_pubmed = c.new_pipeline()
fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2]], 'retmode':'xml'})
c.run(fetch_pubmed)
return 0
if __name__ == '__main__':
main()
|
- Lines 3-4: import standard Python libraries
- Lines 12-15: import the module
entrezpy.conduit
(adjust as necessary) - Line 19: create new
entrezpy.conduit.Conduit
instance with an email address from the first - command line argument
- Line 19: create new
- Line 20: create new pipeline
fetch_pubmed
using entrezpy.conduit.Conduit.new_pipeline()
- Line 20: create new pipeline
- Line 21: add fetch request to the
fetch_pubmed
pipeline with the PMID - from the second command line argument using
entrezpy.conduit.Conduit.Pipeline.add_fetch()
- Line 21: add fetch request to the
- Line 22: run pipeline using
entrezpy.conduit.Conduit.run()
Let’s test this program to see if all modules are found and conduit works.
$ python pubmed-fetcher.py your@email 15430309
Since we didn’t specify an analyzer yet, we expect the raw XML output is printed to the standard output. So far, this produces the same output as Listing 1.
If this command fails and/or no output is printed to the standard output,
something went wrong. Possible issues may include no internet connection,
wrongly installed entrezpy
, wrong import statements, or bad permissions.
If everything went smoothly, we wrote a basic but working pipeline to
fetch PubMed data from NCBI’s Entrez database. We can now start to implement our
specific entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
classes. However, before we
implement these classes, we need to decide how want to store a PubMed data
record.
How to store PubMed data records¶
The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a PubMed record.
1 2 3 4 5 6 7 8 9 10 11 12 | class PubmedRecord:
"""Simple data class to store individual Pubmed records. Individual authors will
be stored as dict('lname':last_name, 'fname': first_name) in authors.
Citations as string elements in the list citations. """
def __init__(self):
self.pmid = None
self.title = None
self.abstract = None
self.authors = []
self.references = []
|
Further, we will use the dict
pubmed_records
as attribute of
PubmedResult
to store PubmedRecord
instances using the PMID as key to
avoid duplicates.
Defining PubmedResult
and PubmedAnalyzer
¶
From the documentation or publication, we know that entrezpy.base.analyzer.EutilsAnalyzer
parses
responses and stores results in entrezpy.base.result.EutilsResult
. Therefore, we need to derive
and adjust these classes for our PubmedResult
and PubmedAnalyzer
classes. We will add these classes to our program pubmed-fetcher.py
. The
documentation tells us what the required parameters for each class are and the
virtual methods we need to implement.
Implement PubmedResult
¶
We have to extend the virtual methods declared in
entrezpy.base.result.EutilsResult
. The documentation informs us about the required parameters and
expected return values.
In addition, we declare the method PubmedResult.add_pubmed_record()
to
handle adding new PubMed data record instances as defined in
Listing 4. The PubmedResult
methods in this
tutorial are trivial since and we can implement the class in one go
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class PubmedResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
Individual Pubmed records are implemented in :class:`PubmedRecord` and
stored in :ivar:`pubmed_records`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict pubmed_records: storing PubmedRecord instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.pubmed_records = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.pubmed_records)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.pubmed_records:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching a pubmed record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_pubmed_record(self, pubmed_record):
"""The only non-virtual and therefore PubmedResult-specific method to handle
adding new data records"""
self.pubmed_records[pubmed_record.pmid] = pubmed_record
|
- Line 1: inherit the base class
entrezpy.base.result.EutilsResult
- Line 10-12: initialize
PubmedResult
instance with the required - parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
- Line 10-12: initialize
- Line 14-17: implement
entrezpy.base.result.EutilsResult.size()
- Line 19-24: implement
entrezpy.base.result.EutilsResult.isEmpty()
- Line 26-31: implement
entrezpy.base.result.EutilsResult.get_link_parameter()
- Line 33-41: implement
entrezpy.base.result.EutilsResult.dump()
- Line 43-46: specific
PubmedResult
method to store individualPubmedRecord
- instances
- Line 43-46: specific
Note
Linking PubMed records for subsequent searches is better handled by
creating a pipeline performing esearch
queries followed by elink
queries and a final efetch
query. The fetch result for PubMed records
has no WebEnv value and is missing the originating database since efetch
is usually the last query within a series of Eutils
queries. You can test
this using the following EDirect pipeline:
$ efetch -db pubmed -id 20148030 | elink -target nuccore
Therefore, we implement a warning, informing the user linking is not
possible. Nevertheless, the method could return any parsed information, e.g.
nucleotide UIDs, and used as parameter for a subsequent fetch. However, some
features could not be used, e.g. the Entrez history
server.
Implementing PubmedAnalyzer
¶
We have to extend the virtual methods declared in
entrezpy.base.analyzer.EutilsAnalyzer
. The documentation informs us about the required parameters
and expected return values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse PubMed responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = PubmedResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect XML errors, just print the error to STDOUT for
logging/debugging."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response.getvalue()}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
Parse PubMed XML line by line to extract authors and citations.
xml.etree.ElementTree.iterparse
(https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
reads the XML file incrementally. Each <PubmedArticle> is cleared after processing.
..note:: Adjust this method to include more/different tags to extract.
Remember to adjust :class:`.PubmedRecord` as well."""
self.init_result(response, request)
isAuthorList = False
isAuthor = False
isRefList = False
isRef = False
isArticle = False
medrec = None
for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
if event == 'start':
if elem.tag == 'PubmedArticle':
medrec = PubmedRecord()
if elem.tag == 'AuthorList':
isAuthorList = True
if isAuthorList and elem.tag == 'Author':
isAuthor = True
medrec.authors.append({'fname': None, 'lname': None})
if elem.tag == 'ReferenceList':
isRefList = True
if isRefList and elem.tag == 'Reference':
isRef = True
if elem.tag == 'Article':
isArticle = True
else:
if elem.tag == 'PubmedArticle':
self.result.add_pubmed_record(medrec)
elem.clear()
if elem.tag == 'AuthorList':
isAuthorList = False
if isAuthorList and elem.tag == 'Author':
isAuthor = False
if elem.tag == 'ReferenceList':
isRefList = False
if elem.tag == 'Reference':
isRef = False
if elem.tag == 'Article':
isArticle = False
if elem.tag == 'PMID':
medrec.pmid = elem.text.strip()
if isAuthor and elem.tag == 'LastName':
medrec.authors[-1]['lname'] = elem.text.strip()
if isAuthor and elem.tag == 'ForeName':
medrec.authors[-1]['fname'] = elem.text.strip()
if isRef and elem.tag == 'Citation':
medrec.references.append(elem.text.strip())
if isArticle and elem.tag == 'AbstractText':
if not medrec.abstract:
medrec.abstract = elem.text.strip()
else:
medrec.abstract += elem.text.strip()
if isArticle and elem.tag == 'ArticleTitle':
medrec.title = elem.text.strip()
|
- Line 1: Inherit the base class
entrezpy.base.analyzer.EutilsAnalyzer
- Lines 5-6: initialize
PubmedResult
instance. - Lines 8-12: declare
entrezpy.base.analyzer.EutilsAnalyzer.init_result()
- Lines 14-19: decalre
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
- Lines 21-69: declare
entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()
The XML parser is the critical, and most likely most complex, piece to
implement. However, if you want to parse your Entrez results you anyway need to
develop a parser. If you already have a parser, you can use an object
composition approach [#fn-oocomp].
Further, you can add a method in analyze_result
to store the processed
data in a database or implementing checkpoints.
Note
Explaining the XML parser is beyond the scope of this tutorial (and there are likely better approaches, anyways).
Putting everything together¶
The completed implementation is shown in Listing 7.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | #!/usr/bin/env python3
import os
import sys
import json
import xml.etree.ElementTree
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
class PubmedRecord:
"""Simple data class to store individual Pubmed records. Individual authors will
be stored as dict('lname':last_name, 'fname': first_name) in authors.
Citations as string elements in the list citations. """
def __init__(self):
self.pmid = None
self.title = None
self.abstract = None
self.authors = []
self.references = []
class PubmedResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
Individual Pubmed records are implemented in :class:`PubmedRecord` and
stored in :ivar:`pubmed_records`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict pubmed_records: storing PubmedRecord instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.pubmed_records = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.pubmed_records)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.pubmed_records:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching a pubmed record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_pubmed_record(self, pubmed_record):
"""The only non-virtual and therefore PubmedResult-specific method to handle
adding new data records"""
self.pubmed_records[pubmed_record.pmid] = pubmed_record
class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse PubMed responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = PubmedResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect XML errors, just print the error to STDOUT for
logging/debugging."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response.getvalue()}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
Parse PubMed XML line by line to extract authors and citations.
xml.etree.ElementTree.iterparse
(https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
reads the XML file incrementally. Each <PubmedArticle> is cleared after processing.
..note:: Adjust this method to include more/different tags to extract.
Remember to adjust :class:`.PubmedRecord` as well."""
self.init_result(response, request)
isAuthorList = False
isAuthor = False
isRefList = False
isRef = False
isArticle = False
medrec = None
for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
if event == 'start':
if elem.tag == 'PubmedArticle':
medrec = PubmedRecord()
if elem.tag == 'AuthorList':
isAuthorList = True
if isAuthorList and elem.tag == 'Author':
isAuthor = True
medrec.authors.append({'fname': None, 'lname': None})
if elem.tag == 'ReferenceList':
isRefList = True
if isRefList and elem.tag == 'Reference':
isRef = True
if elem.tag == 'Article':
isArticle = True
else:
if elem.tag == 'PubmedArticle':
self.result.add_pubmed_record(medrec)
elem.clear()
if elem.tag == 'AuthorList':
isAuthorList = False
if isAuthorList and elem.tag == 'Author':
isAuthor = False
if elem.tag == 'ReferenceList':
isRefList = False
if elem.tag == 'Reference':
isRef = False
if elem.tag == 'Article':
isArticle = False
if elem.tag == 'PMID':
medrec.pmid = elem.text.strip()
if isAuthor and elem.tag == 'LastName':
medrec.authors[-1]['lname'] = elem.text.strip()
if isAuthor and elem.tag == 'ForeName':
medrec.authors[-1]['fname'] = elem.text.strip()
if isRef and elem.tag == 'Citation':
medrec.references.append(elem.text.strip())
if isArticle and elem.tag == 'AbstractText':
if not medrec.abstract:
medrec.abstract = elem.text.strip()
else:
medrec.abstract += elem.text.strip()
if isArticle and elem.tag == 'ArticleTitle':
medrec.title = elem.text.strip()
def main():
c = entrezpy.conduit.Conduit(sys.argv[1])
fetch_pubmed = c.new_pipeline()
fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2].split(',')],
'retmode':'xml'}, analyzer=PubmedAnalyzer())
a = c.run(fetch_pubmed)
#print(a)
# Testing PubmedResult
#print("DUMP: {}".format(a.get_result().dump()))
#print("SIZE: {}".format(a.get_result().size()))
#print("LINK: {}".format(a.get_result().get_link_parameter()))
res = a.get_result()
print("PMID","Title","Abstract","Authors","RefCount", "References", sep='=')
for i in res.pubmed_records:
print("{}={}={}={}={}={}".format(res.pubmed_records[i].pmid, res.pubmed_records[i].title,
res.pubmed_records[i].abstract,
';'.join(str(x['lname']+","+x['fname'].replace(' ', '')) for x in res.pubmed_records[i].authors),
len(res.pubmed_records[i].references),
';'.join(x for x in res.pubmed_records[i].references)))
return 0
if __name__ == '__main__':
main()
|
- Line 163: Adjust argument processing to allow several comma-separated PMIDs
- Line 164: add our implemented
PubmedAnalyzer
as parameter to analyze - results as described in
entrezpy.conduit.Conduit.Pipeline.add_fetch()
- Line 164: add our implemented
- Line 166: run the pipeline and store the analyzer in
a
- Lines 168-172: Testing methods
- Line 174: get
PubmedResult
instance - Lines 175-181: process fetched data records into columns
The implementation can be invoked as shown in Listing 8.
$ python pubmed-fetcher.py you@email 6,15430309,31077305,27880757,26378223| column -s= -t |less
You’ll notice that not all data records have all fields. This is because they are missing in these records or some tags have different names.
Running pubmed-fetcher.py
with UID 20148030 will fail
(Listing 9).
$ python pubmed-fetcher.py you@email 20148030
The reason for this is can be found in the requested XML. Running the command in Listing 10 hints the problem. Adjusting and fixing is a task left for interested readers.
$ efetch -db pubmed -id 20148030 -mode xml | grep -A7 \<AuthorList
Footnotes
[1] | https://en.wikipedia.org/wiki/Virtual_function |
[2] | http://www.greenwoodsoftware.com/less/ |
[3] | https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/ |
[4] | https://en.cppreference.com/w/c/language/struct |
[5] | https://en.wikipedia.org/wiki/Object_composition |