Fetching sequence metadata from Entrez¶
Prerequisites
- Python 3.6 or higher is assumed.
entrezpy
is either installed via PyPi or cloned from thegit
repository (Installation).- basic familiarity with object oriented Python, i.e. inheritance
- read the tutorial Fetching publication information from Entrez
- The full implementation can be found in the repository at examples/tutorials/seqmetadata/seqmetadata-fetcher.py
Overview¶
This tutorial explains how to write a simple sequence docsum fetcher using
entrezpy.conduit.Conduit
and by adjust entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
. It is based on a
esearch followed by fetching the data as docsum
JSON. This tutorial is very
similar as Fetching publication information from Entrez, the main difference being parsing JSON and using
two steps in entrezpy.conduit.Conduit
. The main steps are very similar and the reader is should
look there for more details.
Outline
- develop a
entrezpy.conduit.Conduit
pipline - implement a
Docsum
data structure - inherit
entrezpy.base.result.EutilsResult
andentrezpy.base.analyzer.EutilsAnalyzer
- implement the required virtual methods
- add methods to derived classes
The Efetch Entrez Utility is NCBI’s utility responsible for
fetching data records. Its manual lists all possible databases and
which records (Record type) can be fetched in which format. We’ll fetch
Docsum
data in JSON using the EUtil esummary
after performing an
esearch
step using accessions numbers as query. Instead of using efetch, we
will use esummary
and replace the default analyzer with our own.
In entrezpy
, a result (or query), is the sum of all individual requests
required to obtain the whole query. esummary
fetches data in batches. In this
example, all batches are collected prior to printing the infomration to standard
output. The method DocsumAnalyzer.analyze_result()
can be adjusted to
store or analyze the results from each batch as soon as the are fetched.
A quick note on virtual functions
entrezpy
is heavily based on virtual methods [1]. A virtual method is
declared in the the base class but implemented in the derived class. Every
class inheriting the base class has to implement the virtual functions using
the same signature and return the same result type as the base class. To
implement the method in the inherited class, you need to look up the method in
the base class.
Docsum
data structure¶
Before we start to write our implementation, we need to understand the
structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its
examination, it can be either stored into a file, or preferably, piped to a
pager, e.g. less
[2] or more
[3]. These are usually
installed on most *NIX systems.
$ esearch -db nuccore -query HOU142311 | esummary -mode json
The entry should start and end as shown in Listing 12.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | {
"header": {
"type": "esummary",
"version": "0.3"
},
"result": {
"uids": [
"1110864597"
],
"1110864597": {
"uid": "1110864597",
"caption": "KX883530",
"title": "Beihai levi-like virus 30 strain HOU142311 hypothetical protein genes, complete cds",
"extra": "gi|1110864597|gb|KX883530.1|",
"gi": 1110864597,
"createdate": "2016/12/10",
"updatedate": "2016/12/10",
"flags": "",
"taxid": 1922417,
"slen": 4084,
"biomol": "genomic",
"moltype": "rna",
"topology": "linear",
"sourcedb": "insd",
"segsetsize": "",
"projectid": "0",
"genome": "genomic",
"subtype": "strain|host|country|collection_date",
"subname": "HOU142311|horseshoe crab|China|2014",
"assemblygi": "",
"assemblyacc": "",
"tech": "",
"completeness": "",
"geneticcode": "1",
"strand": "",
"organism": "Beihai levi-like virus 30",
"strain": "HOU142311",
"biosample": "",
}
}
}
|
The first step is to write a program to fetch the requested records. This can
be done using the entrezpy.conduit.Conduit
class.
Simple Conduit pipeline to fetch Docsum
Records¶
We will write simple entrezpy
pipeline named seqmetadata-fetcher.py
using
entrezpy.conduit.Conduit
to test and run our implementations. A simple entrezpy.conduit.Conduit
pipeline
requires two arguments:
- user email
- accession numbers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | #!/usr/bin/env python3
import os
import sys
import json
import argparse
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
def main():
ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
Accessions are parsed form STDIN, one accession pre line')
ap.add_argument('--email',
type=str,
required=True,
help='email required by NCBI'),
ap.add_argument('--apikey',
type=str,
default=None,
help='NCBI apikey (optional)')
ap.add_argument('-db',
type=str,
required=True,
help='Database to search ')
args = ap.parse_args()
c = entrezpy.conduit.Conduit(args.email)
fetch_docsum = c.new_pipeline()
sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
dependency=sid, analyzer=DocsumAnalyzer())
|
- Lines 1-17: import standard Python libraries and
entrezpy
modules - Lines 21-35: Setup argument parser
- Line 37: create new
entrezpy.conduit.Conduit
instance with an email address. - Line 38: New pipeline instance
entrezpy.conduit.Conduit.new_pipeline()
- Line 39: add search request to the pipeline with the databse name from the
- user passed argument and a search strin assembled from standard
input. Store the query id in
sid
.entrezpy.conduit.Conduit.Pipeline.add_search()
- Line 40 add summary step with the search query as dependency.
- (
entrezpy.conduit.Conduit.Pipeline.add_summary()
)
- Line 22: run pipeline using
entrezpy.conduit.Conduit.run()
We need to implement the DocsumAnalyzer, but before we have to design a Docsum
data structure.
How to store Docsum
data records¶
The data records can be stored in different ways, but using a class facilitates
collecting and retrieving the requested data. We implement a simple class
(analogous to a C/C++ struct [4]) to represent a Docsum
record.
Becuase we fetch data in JSON format, the class performs a rather dull parsing.
The nested Subtype class handles the subtype
and subname
attributes
in a Docsum
response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | class Docsum:
"""Simple data class to store individual sequence Docsum records."""
class Subtype:
def __init__(self, subtype, subname):
self.strain = None
self.host = None
self.country = None
self.collection = None
self.collection_date = None
for i in range(len(subtype)):
if subtype[i] == 'strain':
self.stain = subname[i]
if subtype[i] == 'host':
self.host = subname[i]
if subtype[i] == 'country':
self.country = subname[i]
if subtype[i] == 'collection_date':
self.collection_date = subname[i]
def __init__(self, json_docsum):
self.uid = int(json_docsum['uid'])
self.caption = json_docsum['caption']
self.title = json_docsum['title']
self.extra = json_docsum['extra']
self.gi = int(json_docsum['gi'])
self.taxid = int(json_docsum['taxid'])
self.slen = int(json_docsum['slen'])
self.biomol = json_docsum['biomol']
self.moltype = json_docsum['moltype']
self.tolopolgy = json_docsum['topology']
self.sourcedb = json_docsum['sourcedb']
self.segsetsize = json_docsum['segsetsize']
self.projectid = int(json_docsum['projectid'])
self.genome = json_docsum['genome']
self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
json_docsum['subname'].split('|'))
self.assemblygi = json_docsum['assemblygi']
self.assemblyacc = json_docsum['assemblyacc']
self.tech = json_docsum['tech']
self.completeness = json_docsum['completeness']
self.geneticcode = int(json_docsum['geneticcode'])
self.strand = json_docsum['strand']
self.organism = self.strand = json_docsum['organism']
self.strain = self.strand = json_docsum['strain']
self.accessionversion = json_docsum['accessionversion']
|
Implement DocsumResult
¶
We have to extend the virtual methods declared in
entrezpy.base.result.EutilsResult
. The documentation informs us about the required parameters and
expected return values.
In addition, we declare the method PubmedResult.add_docsum()
to
handle adding new Docsum
data record instances as defined in
Listing 14. The Docsum
methods in this
tutorial are trivial and we can implement the class in one go
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class DocsumResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
Individual Docsum records are implemented in :class:`Docsum` and
stored in :ivar:`docsums`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict docsums: storing Docsum instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.docsums = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.docsums)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.docsums:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching summary record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'docsum_records':[x for x in self.docsums],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_docsum(self, docsum):
"""The only non-virtual and therefore DocsumResult-specific method to handle
adding new data records"""
self.docsums[docsum.uid] = docsum
|
- Line 1: inherit the base class
entrezpy.base.result.EutilsResult
- Line 10-12: initialize
DocsumResult
instance with the required - parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
- Line 10-12: initialize
- Line 14-17: implement
entrezpy.base.result.EutilsResult.size()
- Line 19-24: implement
entrezpy.base.result.EutilsResult.isEmpty()
- Line 26-31: implement
entrezpy.base.result.EutilsResult.get_link_parameter()
- Line 33-41: implement
entrezpy.base.result.EutilsResult.dump()
- Line 43-46: specific
PubmedResult
method to store individualDocsumResult
- instances
- Line 43-46: specific
Note
The fetch result for Docsum
records has no WebEnv value and is
missing the originating database since esummary
is usually the last
query within a series of Eutils
queries. Therefore, we implement a
warning, informing the user linking is not possible.
Implementing DocsumAnalyzer
¶
We have to extend the virtual methods declared in
entrezpy.base.analyzer.EutilsAnalyzer
. The documentation informs us about the required parameters
and expected return values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse Docsum responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = DocsumResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect JSON, just print the error to STDOUT as string."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
The results is a JSON structure and allows easy parsing"""
self.init_result(response, request)
for i in response['result']['uids']:
self.result.add_docsum(Docsum(response['result'][i]))
|
- Line 1: Inherit the base class
entrezpy.base.analyzer.EutilsAnalyzer
- Lines 5-6: initialize
PubmedResult
instance. - Lines 8-12: declare
entrezpy.base.analyzer.EutilsAnalyzer.init_result()
- Lines 14-18: decalre
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
- Lines 20-25: declare
entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()
Compared to the pubmed analyzer, parsing the JOSN
output is very easy. If you already have a parser, you can use an object
composition approach [#fn-oocomp]. Further, you can add a method in
analyze_result
to store the processed data in a database or
implementing checkpoints.
Putting everything together¶
The completed implementation is shown in Listing 17.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | #!/usr/bin/env python3
import os
import sys
import json
import argparse
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
class Docsum:
"""Simple data class to store individual sequence Docsum records."""
class Subtype:
def __init__(self, subtype, subname):
self.strain = None
self.host = None
self.country = None
self.collection = None
self.collection_date = None
for i in range(len(subtype)):
if subtype[i] == 'strain':
self.stain = subname[i]
if subtype[i] == 'host':
self.host = subname[i]
if subtype[i] == 'country':
self.country = subname[i]
if subtype[i] == 'collection_date':
self.collection_date = subname[i]
def __init__(self, json_docsum):
self.uid = int(json_docsum['uid'])
self.caption = json_docsum['caption']
self.title = json_docsum['title']
self.extra = json_docsum['extra']
self.gi = int(json_docsum['gi'])
self.taxid = int(json_docsum['taxid'])
self.slen = int(json_docsum['slen'])
self.biomol = json_docsum['biomol']
self.moltype = json_docsum['moltype']
self.tolopolgy = json_docsum['topology']
self.sourcedb = json_docsum['sourcedb']
self.segsetsize = json_docsum['segsetsize']
self.projectid = int(json_docsum['projectid'])
self.genome = json_docsum['genome']
self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
json_docsum['subname'].split('|'))
self.assemblygi = json_docsum['assemblygi']
self.assemblyacc = json_docsum['assemblyacc']
self.tech = json_docsum['tech']
self.completeness = json_docsum['completeness']
self.geneticcode = int(json_docsum['geneticcode'])
self.strand = json_docsum['strand']
self.organism = self.strand = json_docsum['organism']
self.strain = self.strand = json_docsum['strain']
self.accessionversion = json_docsum['accessionversion']
class DocsumResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
Individual Docsum records are implemented in :class:`Docsum` and
stored in :ivar:`docsums`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict docsums: storing Docsum instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.docsums = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.docsums)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.docsums:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching summary record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'docsum_records':[x for x in self.docsums],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_docsum(self, docsum):
"""The only non-virtual and therefore DocsumResult-specific method to handle
adding new data records"""
self.docsums[docsum.uid] = docsum
class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse Docsum responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = DocsumResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect JSON, just print the error to STDOUT as string."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
The results is a JSON structure and allows easy parsing"""
self.init_result(response, request)
for i in response['result']['uids']:
self.result.add_docsum(Docsum(response['result'][i]))
def main():
ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
Accessions are parsed form STDIN, one accession pre line')
ap.add_argument('--email',
type=str,
required=True,
help='email required by NCBI'),
ap.add_argument('--apikey',
type=str,
default=None,
help='NCBI apikey (optional)')
ap.add_argument('-db',
type=str,
required=True,
help='Database to search ')
args = ap.parse_args()
c = entrezpy.conduit.Conduit(args.email)
fetch_docsum = c.new_pipeline()
sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
dependency=sid, analyzer=DocsumAnalyzer())
docsums = c.run(fetch_docsum).get_result().docsums
for i in docsums:
print(i, docsums[i].uid, docsums[i].caption,docsums[i].strain, docsums[i].subtype.host)
return 0
if __name__ == '__main__':
main()
|
The implementaion can be invoked as shown in Listing 18.
$ cat "NC_016134.3" > accs
$ cat "HOU142311" >> accs
$ cat accs | python seqmetadata-fetcher.py --email email -db nuccore
Footnotes
[1] | https://en.wikipedia.org/wiki/Virtual_function |
[2] | http://www.greenwoodsoftware.com/less/ |
[3] | https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/ |
[4] | https://en.cppreference.com/w/c/language/struct |
[5] | https://en.wikipedia.org/wiki/Object_composition |