Entrezpy: NCBI Entrez databases at your fingertips¶
Synopsis¶
$ pip install entrezpy --user
>>> import entrezpy.conduit
>>> c = entrezpy.conduit.Conduit('myemail')
>>> fetch_influenza = c.new_pipeline()
>>> sid = fetch_influenza.add_search({'db' : 'nucleotide', 'term' : 'H3N2 [organism] AND HA', 'rettype':'count', 'sort' : 'Date Released', 'mindate': 2000, 'maxdate':2019, 'datetype' : 'pdat'})
>>> fid = fetch_influenza.add_fetch({'retmax' : 10, 'retmode' : 'text', 'rettype': 'fasta'}, dependency=sid)
>>> c.run(fetch_influenza)
Entrezpy is a dedicated Python library to interact with NCBI Entrez
databases [Entrez2016] via the E-Utilities [Sayers2018]. Entrezpy facilitates
the implementation of queries to query or download data from the Entrez
databases, e.g. search for specific sequences or publiations or fetch your
favorite genome. For more complex queries entrezpy
offers the class
entrezpy.conduit.Conduit
to run query pipelines or reuse previous queries.
Supported E-Utility functions:
- Entrez pipeline design helper class: Conduit module
- NCBI Entrez utilities and asociated parameters: https://dataguide.nlm.nih.gov/eutilities/utilities.html
entrezpy
publication: [Buchmann2019]
Licence and Copyright¶
entrezpy
is licensed under the GNU Lesser General Public License v3
(LGPLv3) or later.
Concerning the copyright of the material available through E-Utilities, please read their disclaimer and copyright statement at https://www.ncbi.nlm.nih.gov/home/about/policies/.
Contact¶
To report bugs and/or errors, please open an issue at https://gitlab.com/ncbipy/entrezpy or contact me at: jan.buchmann@sydney.edu.au
Of course, feel free to fork the code, improve it, and/or open a pull request.
NCBI API key¶
NCBI offers API keys to allow more requests per second. For more details and
rational see [Sayers2018]. entrezpy
checks for NCBI API keys as follows:
- The NCBI API key can be passed as parameter to
entrezpy
classes- Entrezpy checks for the environment variable
$NCBI_API_KEY
- The enviroment variable, e.g.
NCBI_API_KEY
, can be passed via theapikey_var
parameter to any derivedentrezpy.base.query.EutilsQuery
class.
Work in progress¶
- easier logging configuration via file
- simplify Elink results
- Deploy cleaner testing
- Status indicating of request
Manual¶
Installation¶
entrezpy
can be installed or included into your own pipeline using two
approaches: PyPi or Append to sys.path.
Requirements¶
Python version >= 3.6
- Python Standard Library :
The standard library should be installed with Python. Just in case, these modules from the Python Standard Library are required:
- base64
- io
- json
- logging
- math
- os
- queue
- random
- socket
- sys
- threading
- time
- urllib
- uuid
- xml.etree.ElementTree
Test your Python version¶
Test if we have at least Python 3.6 :
$ python
>>> import sys
>>> sys.version_info
>>> sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
^ ^
PyPi¶
Install entrezpy
via PyPi and check:
$ pip install entrezpy --user
Test if we can import entrezpy
:
$ python
>>> import entrezpy
Append to sys.path
¶
Add entrezpy
to your pipeline via sys.path
. This requires to clone
the source code adjusting sys.path
.
Assuming following directory structure where entrezpy was cloned into
include
:
$ git clone https://gitlab.com/ncbipy/entrezpy.git project_root/include
project_root
|
|-- src
| `-- pipeline.py
`-- include
`-- entrezpy
`-- src
`-- entrezpy
`-- efetch
Importing the module efetcher
in pipeline.py
by adjust sys.path
in
project_root/src/pipeline.py
sys.path.insert(1, os.path.join(sys.path[0], '../include/entrezpy/src'))
import entrezpy.efetch.efetcher
ef = entrezpy.efetch.efetcher.Efetcher('toolname', 'email')
Test entrezpy
¶
Run the examples in the git repository in entrezpy/examples
, e.g:
$ ./path/to/entrezpy/examples/entrezpy-example.elink.py --email you@email
To adjust the examples for testing an installation via PyPi, remove the
sys.path
line in the examples prior to invoking them, e.g.
for i in entrezpy/examples/*.py; do \
fname=$(basename $i | sed 's/\.py/\.adjust.py/'); \
sed '/sys.path.insert/d' $i > $fname; \
chmod +x $fname; \
done;
The examples print the results onto the standard output and additional
information onto standard error. Currently, we propose to run the examples and
redirecting standard error to a file. For example, testing efetch
,
run examples/entrezpy-example.efetch.py
as follows:
./examples/entrezpy-example.efetch.py --email you@email 2> efetch.stderr
efetch.stderr
can be monitored as follows:
tail -f efetch.stderr
Entrezpy tutorials¶
Esearch¶
Esearch searches the specified Entrez database for data records matching the query. It can return the found UIDs or a WebEnv/query_key referencing for the UIDs
Esearch returning UIDs¶
Search the nucleotide database for virus sequences and fetch the first 110,000 UIDs.
- Create an Esearcher instance
- Run the query and store the analyzer
- Print the fetched UIDs
1 2 3 4 5 | import entrezpy.esearch.esearcher
e = entrezpy.esearch.esearcher.Esearcher('esearcher', 'email')
a = es.inquire('db':'nucleotide','term':'viruses[orgn]', 'retmax': 110000, 'rettype': 'uilist')
print(a.get_result().uids)
|
Line 1: Import the esearcher module
- Line 3: Instantiate an esearcher instance with the required parameter
- tool (using ‘esearcher’) and email
- Line 4: Run query to search the database
nucleotide
, using the term viruses[orgn]
, limit the result to the first 110,000 UIDs, and request UIDs. Store the returned default analyzer ina
.
Line 5: Print the fetched UIDs
Esearch returning History server reference to UIDs¶
Same example as above, but in place of UIDs WebeEnv
and query_key
are
returned. By default, entrezpy uses the history server (setting the POST
parameter usehistory=y
) and is not required to be passed as parameter
explicitly.
1 2 3 4 5 6 | import entrezpy.esearch.esearcher
e = entrezpy.esearch.esearcher.Esearcher('esearcher', 'email')
a = es.inquire('db':'nucleotide','term':'viruses[orgn]', 'retmax': 110000)
print(a.size())
print(a.reference().webenv, a.reference().querykey)
|
Line 1: Import the esearcher module
- Line 3: Instantiate an esearcher instance with the required parameter
- tool (using ‘esearcher’) and email
- Line 4: Run query to search the database
nucleotide
, using the term viruses[orgn]
and limit the result to the first 110,000 UIDs. Store the returned default analyzer ina
Line 5: Print the number of fetched UIDs, which should be 0
Line 6: Print the WebEnv
and query_key
Conduit¶
The Conduit module facilitates creating pipelines to link individual Eutils request, i.e. linking the results of an Esearch to the corresponding nucleotide data records.
Conduit pipelines¶
Conduit pipelines store a sequence of E-Utility queries. Let’s create a simple Conduit pipeline to fetch sequences for virus nulceotide sequences. This requires to (i) search the nucleotide database which will return the found UIDs (data records), and (ii) fetch the found UIDs.
- The first step in the pipeline is to search the Entrez nucleotide database for viruses sequences (Line 6). We add a search query to the pipeline and store its id for later use. We set the parameter
rettype
tocount
to avoid downloading the UIDs and limit the number of UIDs to 100 with ‘retmax’. The result will tell us how many UIDs were found and a reference to the Entrez History server which we can use later to fetch the sequences.- The second step in our pipline is the actual step to download the found sequences. We add a fetch step to our pipeline and use its id as dependency. Conduit will automatically set the ‘db’, ‘WebEnv’ and ‘query_key’ parameters for the fetch step. In addition, we specify that we want the sequences as text FASTA format.
- The last step is to run the queries in the pipeline. This is done py passing the pipeline to Conduit’s run method which will request the queries. If no request errors have occured, Conduit returns the default analyzer for this type of query. Sine this uses the default Efetch analzyer, results are just printed to the standard output.
1 2 3 4 5 6 7 8 9 10 | import entrezpy.conduit
w = entrezpy.conduit.Conduit('email')
get_sequences = w.new_pipeline()
sid = get_sequenced.add_search({'db' : 'nucleotide', 'term' : 'viruses[Organism]', 'rettype' : 'count'})
get_sequences.add_fetch({'retmode' : 'text', 'rettype' : 'fasta'}, dependency=sid)
analyzer = w.run(get_sequences)
|
Line 1: Import the conduit module
Line 3: Create a Conduit instance with the required email address
Line 4: Create a new pipeline and store it in get_sequences
Line 6: Add search query to the pipeline and store its id in ‘’sid’’
Line 10: Add fetch query to the pipeline
Line 13: Run pipeline and store the resulting analyzer
Linking within and between Entrezpy databases¶
Using multiple links in a Conduit pipeline requires to run an Esearch afterwards to keep track of the proper UIDs. This is a quirk of the E-Eutilties (Entrez-Direct uses the same trick).
- Search the Pubmed Enrez database
- Increase the number of possible UIDs by searching pubmed again using the first UIDs to find publications linked to initial search
- Link the Pubmed UIDs to
nuccore
UIDs- Fetch the found UIDs from
nuccore
The following code shows howto use multiple links within a Conduit pipeline.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import entrezpy.conduit
w = entrezpy.conduit.Conduit(args.email)
find_genomes = w.new_pipeline()
sid = find_genomes.add_search({'db':'pubmed', 'term' : 'capsid AND infection', 'rettype':'count'})
lid1 = find_genomes.add_link({'cmd':'neighbor_history', 'db':'pubmed'}, dependency=sid)
lid1 = find_genomes.add_search({'rettype': 'count', 'cmd':'neighbor_history'}, dependency=lid1)
lid2 = find_genomes.add_link({'db':'nuccore', 'cmd':'neighbor_history'}, dependency=lid1)
lid2 = find_genomes.add_search({'rettype': 'count', 'cmd':'neighbor_history'}, dependency=lid2)
find_genomes.add_fetch({'retmode':'xml', 'rettype':'fasta'}, dependency=lid2)
a = w.run(find_genomes)
|
Lines 1 - 4: Analogoues as shown in Conduit pipelines
- Line 6: Addsa search query to the Conduit pipline in Entrez database pubmed
- without downloading UIDs and store it in
sid
- Line 8: Add a link query to the Conduit pipline to link the UIDs found in search
sid
topubmed
and store the result on the history server. Store the query in lid1- Line 9: Update the link results for later use and store in the history server.
- Overwrite
lid1
with the updated query. - Line 11: Link the pubmed UIDs to nuccore and store in the history server. Store
- the query in
lid2
. - Line 12: Update the link results for later use and store in the history server.
- Overwirte
lid2
with the updated query - Line 14: Add fetch step to Conduit pipeline with the last link result as
- dependency. Request the data as FASTA sequences in XML format (Tinyseq XML).
Line 15: Run the pipeline.
Extending entrezpy
¶
entrezpy
can be extended by inheriting its base classes. This will be the
case when the final step is to fetch data records and do something with them,
e.g. processing them for a database or parsing for specific information.
Fetching publication information from Entrez¶
Prerequisites
- Python 3.6 or higher is assumed.
entrezpy
is either installed via PyPi or cloned from thegit
repository (Installation).- basic familiarity with object oriented Python, i.e. inheritance
- The full implementation can be found in the repository at examples/tutorials/pubmed/pubmed-fetcher.py
Acknowledgment
I’d like to thank Pedram Hosseini (pdr[dot]hosseini[at]gmail[dot]com) for pointing out the requirement for this tutorial.
Overview¶
This tutorial explains how to write a simple PubMed data record fetcher using
entrezpy.conduit.Conduit
and by adjust entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
.
Outline
- develop a
entrezpy.conduit.Conduit
pipline - implement a PubMed data structure
- inherit
entrezpy.base.result.EutilsResult
andentrezpy.base.analyzer.EutilsAnalyzer
- implement the required virtual methods
- add methods to derived classes
The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. For the first example, we’ll fetch PubMed data in XML, specifically, the UID, authors, title, abstract, and citations. We will test and develop the pipeline using the article the article with PubMed ID (PMID) 26378223 because it has all the required fields. In the end we will see that not all fields are always present.
In entrezpy
, a result (or query), is the sum of all individual requests
required to obtain the whole query. If you want to analyze the number of
citations for a specific author, the result is the number of citations which
you obtained using a query. To obtain the final number, you have to parse
several PubMed records. Therefore, entrezpy
requires a result
entrezpy.base.result.EutilsResult
class to store the partial results obtained from a query.
A quick note on virtual functions
entrezpy
is heavily based on virtual methods [1]. A virtual method is
declared in the the base class but implemented in the derived class. Every
class inheriting the base class has to implement the virtual functions using
the same signature and return the same result type as the base class. To
implement the method in the inherited class, you need to look up the method in
the base class.
PubMed data structure¶
Before we start to write our implementation, we need to understand the
structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its
examination, it can be either stored into a file, or preferably, piped to a
pager, e.g. less
[2] or more
[3]. These are usually
installed on most *NIX systems.
efetch
¶$ efetch -db pubmed -id 26378223 -mode XML | less
The entry should start and end as shown in Listing 2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<!- SKIPPED DATA ->
<Article PubModel="Print">
<!- SKIPPED DATA ->
<ArticleTitle>Cell Walls and the Convergent Evolution of the Viral Envelope.</ArticleTitle>
<!- SKIPPED DATA ->
<Abstract>
<AbstractText>Why some viruses are enveloped while others lack an outer lipid bilayer is a major question in viral evolution but one that has received relatively little attention. The viral envelope serves several functions, including protecting the RNA or DNA molecule(s), evading recognition by the immune system, and facilitating virus entry. Despite these commonalities, viral envelopes come in a wide variety of shapes and configurations. The evolution of the viral envelope is made more puzzling by the fact that nonenveloped viruses are able to infect a diverse range of hosts across the tree of life. We reviewed the entry, transmission, and exit pathways of all (101) viral families on the 2013 International Committee on Taxonomy of Viruses (ICTV) list. By doing this, we revealed a strong association between the lack of a viral envelope and the presence of a cell wall in the hosts these viruses infect. We were able to propose a new hypothesis for the existence of enveloped and nonenveloped viruses, in which the latter represent an adaptation to cells surrounded by a cell wall, while the former are an adaptation to animal cells where cell walls are absent. In particular, cell walls inhibit viral entry and exit, as well as viral transport within an organism, all of which are critical waypoints for successful infection and spread. Finally, we discuss how this new model for the origin of the viral envelope impacts our overall understanding of virus evolution. </AbstractText>
<CopyrightInformation>Copyright © 2015, American Society for Microbiology. All Rights Reserved.</CopyrightInformation>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Buchmann</LastName>
<ForeName>Jan P</ForeName>
<Initials>JP</Initials>
<AffiliationInfo>
<Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Holmes</LastName>
<ForeName>Edward C</ForeName>
<Initials>EC</Initials>
<AffiliationInfo>
<Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia edward.holmes@sydney.edu.au.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<!- SKIPPED DATA ->
</Article>
<!- SKIPPED DATA ->
<ReferenceList>
<Reference>
<Citation>Nature. 2014 Jan 16;505(7483):432-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24336205</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Crit Rev Microbiol. 1988;15(4):339-89</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3060317</ArticleId>
</ArticleIdList>
</Reference>
<!- SKIPPED DATA ->
</ReferenceList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
|
This shows us the XML fields, specifically the tags
, present in a typical
PubMed record. The root tag for each batch of fetched data records is
<PubmedArticleSet>
and each individual data record is described in the nested
tags <PubmedArticle>
. We are interested in the following tags nested within
<PubmedArticle>
:
<ArticleTitle>
<Abstract>
<AuthorList>
<ReferenceList>
The first step is to write a program to fetch the requested records. This can
be done using the entrezpy.conduit.Conduit
class.
Simple Conduit pipeline to fetch PubMed Records¶
We will write simple entrezpy
pipeline named pubmed-fetcher.py
using
entrezpy.conduit.Conduit
to test and run our implementations. A simple entrezpy.conduit.Conduit
pipeline
requires two arguments:
- user email
- PMID (here 15430309)
entrezpy.conduit.Conduit
pipeline to fetch PubMed data records. The required
arguments are positional arguments given at the command line.¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #!/usr/bin/env python3
import os
import sys
"""
If entrezpy is installed using PyPi uncomment th line 'import entrezpy' and
comment the 'sys.path.insert(...)'
"""
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
def main():
c = entrezpy.conduit.Conduit(sys.argv[1])
fetch_pubmed = c.new_pipeline()
fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2]], 'retmode':'xml'})
c.run(fetch_pubmed)
return 0
if __name__ == '__main__':
main()
|
- Lines 3-4: import standard Python libraries
- Lines 12-15: import the module
entrezpy.conduit
(adjust as necessary) - Line 19: create new
entrezpy.conduit.Conduit
instance with an email address from the first - command line argument
- Line 19: create new
- Line 20: create new pipeline
fetch_pubmed
using entrezpy.conduit.Conduit.new_pipeline()
- Line 20: create new pipeline
- Line 21: add fetch request to the
fetch_pubmed
pipeline with the PMID - from the second command line argument using
entrezpy.conduit.Conduit.Pipeline.add_fetch()
- Line 21: add fetch request to the
- Line 22: run pipeline using
entrezpy.conduit.Conduit.run()
Let’s test this program to see if all modules are found and conduit works.
$ python pubmed-fetcher.py your@email 15430309
Since we didn’t specify an analyzer yet, we expect the raw XML output is printed to the standard output. So far, this produces the same output as Listing 1.
If this command fails and/or no output is printed to the standard output,
something went wrong. Possible issues may include no internet connection,
wrongly installed entrezpy
, wrong import statements, or bad permissions.
If everything went smoothly, we wrote a basic but working pipeline to
fetch PubMed data from NCBI’s Entrez database. We can now start to implement our
specific entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
classes. However, before we
implement these classes, we need to decide how want to store a PubMed data
record.
How to store PubMed data records¶
The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a PubMed record.
1 2 3 4 5 6 7 8 9 10 11 12 | class PubmedRecord:
"""Simple data class to store individual Pubmed records. Individual authors will
be stored as dict('lname':last_name, 'fname': first_name) in authors.
Citations as string elements in the list citations. """
def __init__(self):
self.pmid = None
self.title = None
self.abstract = None
self.authors = []
self.references = []
|
Further, we will use the dict
pubmed_records
as attribute of
PubmedResult
to store PubmedRecord
instances using the PMID as key to
avoid duplicates.
Defining PubmedResult
and PubmedAnalyzer
¶
From the documentation or publication, we know that entrezpy.base.analyzer.EutilsAnalyzer
parses
responses and stores results in entrezpy.base.result.EutilsResult
. Therefore, we need to derive
and adjust these classes for our PubmedResult
and PubmedAnalyzer
classes. We will add these classes to our program pubmed-fetcher.py
. The
documentation tells us what the required parameters for each class are and the
virtual methods we need to implement.
Implement PubmedResult
¶
We have to extend the virtual methods declared in
entrezpy.base.result.EutilsResult
. The documentation informs us about the required parameters and
expected return values.
In addition, we declare the method PubmedResult.add_pubmed_record()
to
handle adding new PubMed data record instances as defined in
Listing 4. The PubmedResult
methods in this
tutorial are trivial since and we can implement the class in one go
PubmedResult
¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class PubmedResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
Individual Pubmed records are implemented in :class:`PubmedRecord` and
stored in :ivar:`pubmed_records`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict pubmed_records: storing PubmedRecord instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.pubmed_records = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.pubmed_records)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.pubmed_records:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching a pubmed record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_pubmed_record(self, pubmed_record):
"""The only non-virtual and therefore PubmedResult-specific method to handle
adding new data records"""
self.pubmed_records[pubmed_record.pmid] = pubmed_record
|
- Line 1: inherit the base class
entrezpy.base.result.EutilsResult
- Line 10-12: initialize
PubmedResult
instance with the required - parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
- Line 10-12: initialize
- Line 14-17: implement
entrezpy.base.result.EutilsResult.size()
- Line 19-24: implement
entrezpy.base.result.EutilsResult.isEmpty()
- Line 26-31: implement
entrezpy.base.result.EutilsResult.get_link_parameter()
- Line 33-41: implement
entrezpy.base.result.EutilsResult.dump()
- Line 43-46: specific
PubmedResult
method to store individualPubmedRecord
- instances
- Line 43-46: specific
Note
Linking PubMed records for subsequent searches is better handled by
creating a pipeline performing esearch
queries followed by elink
queries and a final efetch
query. The fetch result for PubMed records
has no WebEnv value and is missing the originating database since efetch
is usually the last query within a series of Eutils
queries. You can test
this using the following EDirect pipeline:
$ efetch -db pubmed -id 20148030 | elink -target nuccore
Therefore, we implement a warning, informing the user linking is not
possible. Nevertheless, the method could return any parsed information, e.g.
nucleotide UIDs, and used as parameter for a subsequent fetch. However, some
features could not be used, e.g. the Entrez history
server.
Implementing PubmedAnalyzer
¶
We have to extend the virtual methods declared in
entrezpy.base.analyzer.EutilsAnalyzer
. The documentation informs us about the required parameters
and expected return values.
PubmedAnalyzer
¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse PubMed responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = PubmedResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect XML errors, just print the error to STDOUT for
logging/debugging."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response.getvalue()}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
Parse PubMed XML line by line to extract authors and citations.
xml.etree.ElementTree.iterparse
(https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
reads the XML file incrementally. Each <PubmedArticle> is cleared after processing.
..note:: Adjust this method to include more/different tags to extract.
Remember to adjust :class:`.PubmedRecord` as well."""
self.init_result(response, request)
isAuthorList = False
isAuthor = False
isRefList = False
isRef = False
isArticle = False
medrec = None
for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
if event == 'start':
if elem.tag == 'PubmedArticle':
medrec = PubmedRecord()
if elem.tag == 'AuthorList':
isAuthorList = True
if isAuthorList and elem.tag == 'Author':
isAuthor = True
medrec.authors.append({'fname': None, 'lname': None})
if elem.tag == 'ReferenceList':
isRefList = True
if isRefList and elem.tag == 'Reference':
isRef = True
if elem.tag == 'Article':
isArticle = True
else:
if elem.tag == 'PubmedArticle':
self.result.add_pubmed_record(medrec)
elem.clear()
if elem.tag == 'AuthorList':
isAuthorList = False
if isAuthorList and elem.tag == 'Author':
isAuthor = False
if elem.tag == 'ReferenceList':
isRefList = False
if elem.tag == 'Reference':
isRef = False
if elem.tag == 'Article':
isArticle = False
if elem.tag == 'PMID':
medrec.pmid = elem.text.strip()
if isAuthor and elem.tag == 'LastName':
medrec.authors[-1]['lname'] = elem.text.strip()
if isAuthor and elem.tag == 'ForeName':
medrec.authors[-1]['fname'] = elem.text.strip()
if isRef and elem.tag == 'Citation':
medrec.references.append(elem.text.strip())
if isArticle and elem.tag == 'AbstractText':
if not medrec.abstract:
medrec.abstract = elem.text.strip()
else:
medrec.abstract += elem.text.strip()
if isArticle and elem.tag == 'ArticleTitle':
medrec.title = elem.text.strip()
|
- Line 1: Inherit the base class
entrezpy.base.analyzer.EutilsAnalyzer
- Lines 5-6: initialize
PubmedResult
instance. - Lines 8-12: declare
entrezpy.base.analyzer.EutilsAnalyzer.init_result()
- Lines 14-19: decalre
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
- Lines 21-69: declare
entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()
The XML parser is the critical, and most likely most complex, piece to
implement. However, if you want to parse your Entrez results you anyway need to
develop a parser. If you already have a parser, you can use an object
composition approach [#fn-oocomp].
Further, you can add a method in analyze_result
to store the processed
data in a database or implementing checkpoints.
Note
Explaining the XML parser is beyond the scope of this tutorial (and there are likely better approaches, anyways).
Putting everything together¶
The completed implementation is shown in Listing 7.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | #!/usr/bin/env python3
import os
import sys
import json
import xml.etree.ElementTree
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
class PubmedRecord:
"""Simple data class to store individual Pubmed records. Individual authors will
be stored as dict('lname':last_name, 'fname': first_name) in authors.
Citations as string elements in the list citations. """
def __init__(self):
self.pmid = None
self.title = None
self.abstract = None
self.authors = []
self.references = []
class PubmedResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
Individual Pubmed records are implemented in :class:`PubmedRecord` and
stored in :ivar:`pubmed_records`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict pubmed_records: storing PubmedRecord instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.pubmed_records = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.pubmed_records)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.pubmed_records:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching a pubmed record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_pubmed_record(self, pubmed_record):
"""The only non-virtual and therefore PubmedResult-specific method to handle
adding new data records"""
self.pubmed_records[pubmed_record.pmid] = pubmed_record
class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse PubMed responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = PubmedResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect XML errors, just print the error to STDOUT for
logging/debugging."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response.getvalue()}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
Parse PubMed XML line by line to extract authors and citations.
xml.etree.ElementTree.iterparse
(https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
reads the XML file incrementally. Each <PubmedArticle> is cleared after processing.
..note:: Adjust this method to include more/different tags to extract.
Remember to adjust :class:`.PubmedRecord` as well."""
self.init_result(response, request)
isAuthorList = False
isAuthor = False
isRefList = False
isRef = False
isArticle = False
medrec = None
for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
if event == 'start':
if elem.tag == 'PubmedArticle':
medrec = PubmedRecord()
if elem.tag == 'AuthorList':
isAuthorList = True
if isAuthorList and elem.tag == 'Author':
isAuthor = True
medrec.authors.append({'fname': None, 'lname': None})
if elem.tag == 'ReferenceList':
isRefList = True
if isRefList and elem.tag == 'Reference':
isRef = True
if elem.tag == 'Article':
isArticle = True
else:
if elem.tag == 'PubmedArticle':
self.result.add_pubmed_record(medrec)
elem.clear()
if elem.tag == 'AuthorList':
isAuthorList = False
if isAuthorList and elem.tag == 'Author':
isAuthor = False
if elem.tag == 'ReferenceList':
isRefList = False
if elem.tag == 'Reference':
isRef = False
if elem.tag == 'Article':
isArticle = False
if elem.tag == 'PMID':
medrec.pmid = elem.text.strip()
if isAuthor and elem.tag == 'LastName':
medrec.authors[-1]['lname'] = elem.text.strip()
if isAuthor and elem.tag == 'ForeName':
medrec.authors[-1]['fname'] = elem.text.strip()
if isRef and elem.tag == 'Citation':
medrec.references.append(elem.text.strip())
if isArticle and elem.tag == 'AbstractText':
if not medrec.abstract:
medrec.abstract = elem.text.strip()
else:
medrec.abstract += elem.text.strip()
if isArticle and elem.tag == 'ArticleTitle':
medrec.title = elem.text.strip()
def main():
c = entrezpy.conduit.Conduit(sys.argv[1])
fetch_pubmed = c.new_pipeline()
fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2].split(',')],
'retmode':'xml'}, analyzer=PubmedAnalyzer())
a = c.run(fetch_pubmed)
#print(a)
# Testing PubmedResult
#print("DUMP: {}".format(a.get_result().dump()))
#print("SIZE: {}".format(a.get_result().size()))
#print("LINK: {}".format(a.get_result().get_link_parameter()))
res = a.get_result()
print("PMID","Title","Abstract","Authors","RefCount", "References", sep='=')
for i in res.pubmed_records:
print("{}={}={}={}={}={}".format(res.pubmed_records[i].pmid, res.pubmed_records[i].title,
res.pubmed_records[i].abstract,
';'.join(str(x['lname']+","+x['fname'].replace(' ', '')) for x in res.pubmed_records[i].authors),
len(res.pubmed_records[i].references),
';'.join(x for x in res.pubmed_records[i].references)))
return 0
if __name__ == '__main__':
main()
|
- Line 163: Adjust argument processing to allow several comma-separated PMIDs
- Line 164: add our implemented
PubmedAnalyzer
as parameter to analyze - results as described in
entrezpy.conduit.Conduit.Pipeline.add_fetch()
- Line 164: add our implemented
- Line 166: run the pipeline and store the analyzer in
a
- Lines 168-172: Testing methods
- Line 174: get
PubmedResult
instance - Lines 175-181: process fetched data records into columns
The implementation can be invoked as shown in Listing 8.
$ python pubmed-fetcher.py you@email 6,15430309,31077305,27880757,26378223| column -s= -t |less
You’ll notice that not all data records have all fields. This is because they are missing in these records or some tags have different names.
Running pubmed-fetcher.py
with UID 20148030 will fail
(Listing 9).
$ python pubmed-fetcher.py you@email 20148030
The reason for this is can be found in the requested XML. Running the command in Listing 10 hints the problem. Adjusting and fixing is a task left for interested readers.
$ efetch -db pubmed -id 20148030 -mode xml | grep -A7 \<AuthorList
Footnotes
[1] | https://en.wikipedia.org/wiki/Virtual_function |
[2] | http://www.greenwoodsoftware.com/less/ |
[3] | https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/ |
[4] | https://en.cppreference.com/w/c/language/struct |
[5] | https://en.wikipedia.org/wiki/Object_composition |
Fetching sequence metadata from Entrez¶
Prerequisites
- Python 3.6 or higher is assumed.
entrezpy
is either installed via PyPi or cloned from thegit
repository (Installation).- basic familiarity with object oriented Python, i.e. inheritance
- read the tutorial Fetching publication information from Entrez
- The full implementation can be found in the repository at examples/tutorials/seqmetadata/seqmetadata-fetcher.py
Overview¶
This tutorial explains how to write a simple sequence docsum fetcher using
entrezpy.conduit.Conduit
and by adjust entrezpy.base.result.EutilsResult
and entrezpy.base.analyzer.EutilsAnalyzer
. It is based on a
esearch followed by fetching the data as docsum
JSON. This tutorial is very
similar as Fetching publication information from Entrez, the main difference being parsing JSON and using
two steps in entrezpy.conduit.Conduit
. The main steps are very similar and the reader is should
look there for more details.
Outline
- develop a
entrezpy.conduit.Conduit
pipline - implement a
Docsum
data structure - inherit
entrezpy.base.result.EutilsResult
andentrezpy.base.analyzer.EutilsAnalyzer
- implement the required virtual methods
- add methods to derived classes
The Efetch Entrez Utility is NCBI’s utility responsible for
fetching data records. Its manual lists all possible databases and
which records (Record type) can be fetched in which format. We’ll fetch
Docsum
data in JSON using the EUtil esummary
after performing an
esearch
step using accessions numbers as query. Instead of using efetch, we
will use esummary
and replace the default analyzer with our own.
In entrezpy
, a result (or query), is the sum of all individual requests
required to obtain the whole query. esummary
fetches data in batches. In this
example, all batches are collected prior to printing the infomration to standard
output. The method DocsumAnalyzer.analyze_result()
can be adjusted to
store or analyze the results from each batch as soon as the are fetched.
A quick note on virtual functions
entrezpy
is heavily based on virtual methods [1]. A virtual method is
declared in the the base class but implemented in the derived class. Every
class inheriting the base class has to implement the virtual functions using
the same signature and return the same result type as the base class. To
implement the method in the inherited class, you need to look up the method in
the base class.
Docsum
data structure¶
Before we start to write our implementation, we need to understand the
structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its
examination, it can be either stored into a file, or preferably, piped to a
pager, e.g. less
[2] or more
[3]. These are usually
installed on most *NIX systems.
Docsum
data record for accession HOU142311 using EDirect’s
esearch
and efetch
.¶$ esearch -db nuccore -query HOU142311 | esummary -mode json
The entry should start and end as shown in Listing 12.
Docsum
data record for accession HOU142311. Only the first
few attributes lines are shown for brevity.¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | {
"header": {
"type": "esummary",
"version": "0.3"
},
"result": {
"uids": [
"1110864597"
],
"1110864597": {
"uid": "1110864597",
"caption": "KX883530",
"title": "Beihai levi-like virus 30 strain HOU142311 hypothetical protein genes, complete cds",
"extra": "gi|1110864597|gb|KX883530.1|",
"gi": 1110864597,
"createdate": "2016/12/10",
"updatedate": "2016/12/10",
"flags": "",
"taxid": 1922417,
"slen": 4084,
"biomol": "genomic",
"moltype": "rna",
"topology": "linear",
"sourcedb": "insd",
"segsetsize": "",
"projectid": "0",
"genome": "genomic",
"subtype": "strain|host|country|collection_date",
"subname": "HOU142311|horseshoe crab|China|2014",
"assemblygi": "",
"assemblyacc": "",
"tech": "",
"completeness": "",
"geneticcode": "1",
"strand": "",
"organism": "Beihai levi-like virus 30",
"strain": "HOU142311",
"biosample": "",
}
}
}
|
The first step is to write a program to fetch the requested records. This can
be done using the entrezpy.conduit.Conduit
class.
Simple Conduit pipeline to fetch Docsum
Records¶
We will write simple entrezpy
pipeline named seqmetadata-fetcher.py
using
entrezpy.conduit.Conduit
to test and run our implementations. A simple entrezpy.conduit.Conduit
pipeline
requires two arguments:
- user email
- accession numbers
entrezpy.conduit.Conduit
pipeline to fetch Docsum
data records. The required
arguments are parsed by ArgumentParser.¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | #!/usr/bin/env python3
import os
import sys
import json
import argparse
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
def main():
ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
Accessions are parsed form STDIN, one accession pre line')
ap.add_argument('--email',
type=str,
required=True,
help='email required by NCBI'),
ap.add_argument('--apikey',
type=str,
default=None,
help='NCBI apikey (optional)')
ap.add_argument('-db',
type=str,
required=True,
help='Database to search ')
args = ap.parse_args()
c = entrezpy.conduit.Conduit(args.email)
fetch_docsum = c.new_pipeline()
sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
dependency=sid, analyzer=DocsumAnalyzer())
|
- Lines 1-17: import standard Python libraries and
entrezpy
modules - Lines 21-35: Setup argument parser
- Line 37: create new
entrezpy.conduit.Conduit
instance with an email address. - Line 38: New pipeline instance
entrezpy.conduit.Conduit.new_pipeline()
- Line 39: add search request to the pipeline with the databse name from the
- user passed argument and a search strin assembled from standard
input. Store the query id in
sid
.entrezpy.conduit.Conduit.Pipeline.add_search()
- Line 40 add summary step with the search query as dependency.
- (
entrezpy.conduit.Conduit.Pipeline.add_summary()
)
- Line 22: run pipeline using
entrezpy.conduit.Conduit.run()
We need to implement the DocsumAnalyzer, but before we have to design a Docsum
data structure.
How to store Docsum
data records¶
The data records can be stored in different ways, but using a class facilitates
collecting and retrieving the requested data. We implement a simple class
(analogous to a C/C++ struct [4]) to represent a Docsum
record.
Becuase we fetch data in JSON format, the class performs a rather dull parsing.
The nested Subtype class handles the subtype
and subname
attributes
in a Docsum
response.
Docsum
data record¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | class Docsum:
"""Simple data class to store individual sequence Docsum records."""
class Subtype:
def __init__(self, subtype, subname):
self.strain = None
self.host = None
self.country = None
self.collection = None
self.collection_date = None
for i in range(len(subtype)):
if subtype[i] == 'strain':
self.stain = subname[i]
if subtype[i] == 'host':
self.host = subname[i]
if subtype[i] == 'country':
self.country = subname[i]
if subtype[i] == 'collection_date':
self.collection_date = subname[i]
def __init__(self, json_docsum):
self.uid = int(json_docsum['uid'])
self.caption = json_docsum['caption']
self.title = json_docsum['title']
self.extra = json_docsum['extra']
self.gi = int(json_docsum['gi'])
self.taxid = int(json_docsum['taxid'])
self.slen = int(json_docsum['slen'])
self.biomol = json_docsum['biomol']
self.moltype = json_docsum['moltype']
self.tolopolgy = json_docsum['topology']
self.sourcedb = json_docsum['sourcedb']
self.segsetsize = json_docsum['segsetsize']
self.projectid = int(json_docsum['projectid'])
self.genome = json_docsum['genome']
self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
json_docsum['subname'].split('|'))
self.assemblygi = json_docsum['assemblygi']
self.assemblyacc = json_docsum['assemblyacc']
self.tech = json_docsum['tech']
self.completeness = json_docsum['completeness']
self.geneticcode = int(json_docsum['geneticcode'])
self.strand = json_docsum['strand']
self.organism = self.strand = json_docsum['organism']
self.strain = self.strand = json_docsum['strain']
self.accessionversion = json_docsum['accessionversion']
|
Implement DocsumResult
¶
We have to extend the virtual methods declared in
entrezpy.base.result.EutilsResult
. The documentation informs us about the required parameters and
expected return values.
In addition, we declare the method PubmedResult.add_docsum()
to
handle adding new Docsum
data record instances as defined in
Listing 14. The Docsum
methods in this
tutorial are trivial and we can implement the class in one go
DocsumResult
¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class DocsumResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
Individual Docsum records are implemented in :class:`Docsum` and
stored in :ivar:`docsums`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict docsums: storing Docsum instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.docsums = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.docsums)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.docsums:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching summary record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'docsum_records':[x for x in self.docsums],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_docsum(self, docsum):
"""The only non-virtual and therefore DocsumResult-specific method to handle
adding new data records"""
self.docsums[docsum.uid] = docsum
|
- Line 1: inherit the base class
entrezpy.base.result.EutilsResult
- Line 10-12: initialize
DocsumResult
instance with the required - parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
- Line 10-12: initialize
- Line 14-17: implement
entrezpy.base.result.EutilsResult.size()
- Line 19-24: implement
entrezpy.base.result.EutilsResult.isEmpty()
- Line 26-31: implement
entrezpy.base.result.EutilsResult.get_link_parameter()
- Line 33-41: implement
entrezpy.base.result.EutilsResult.dump()
- Line 43-46: specific
PubmedResult
method to store individualDocsumResult
- instances
- Line 43-46: specific
Note
The fetch result for Docsum
records has no WebEnv value and is
missing the originating database since esummary
is usually the last
query within a series of Eutils
queries. Therefore, we implement a
warning, informing the user linking is not possible.
Implementing DocsumAnalyzer
¶
We have to extend the virtual methods declared in
entrezpy.base.analyzer.EutilsAnalyzer
. The documentation informs us about the required parameters
and expected return values.
PubmedAnalyzer
¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse Docsum responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = DocsumResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect JSON, just print the error to STDOUT as string."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
The results is a JSON structure and allows easy parsing"""
self.init_result(response, request)
for i in response['result']['uids']:
self.result.add_docsum(Docsum(response['result'][i]))
|
- Line 1: Inherit the base class
entrezpy.base.analyzer.EutilsAnalyzer
- Lines 5-6: initialize
PubmedResult
instance. - Lines 8-12: declare
entrezpy.base.analyzer.EutilsAnalyzer.init_result()
- Lines 14-18: decalre
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
- Lines 20-25: declare
entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()
Compared to the pubmed analyzer, parsing the JOSN
output is very easy. If you already have a parser, you can use an object
composition approach [#fn-oocomp]. Further, you can add a method in
analyze_result
to store the processed data in a database or
implementing checkpoints.
Putting everything together¶
The completed implementation is shown in Listing 17.
Docsum
fetcher¶1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | #!/usr/bin/env python3
import os
import sys
import json
import argparse
# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer
class Docsum:
"""Simple data class to store individual sequence Docsum records."""
class Subtype:
def __init__(self, subtype, subname):
self.strain = None
self.host = None
self.country = None
self.collection = None
self.collection_date = None
for i in range(len(subtype)):
if subtype[i] == 'strain':
self.stain = subname[i]
if subtype[i] == 'host':
self.host = subname[i]
if subtype[i] == 'country':
self.country = subname[i]
if subtype[i] == 'collection_date':
self.collection_date = subname[i]
def __init__(self, json_docsum):
self.uid = int(json_docsum['uid'])
self.caption = json_docsum['caption']
self.title = json_docsum['title']
self.extra = json_docsum['extra']
self.gi = int(json_docsum['gi'])
self.taxid = int(json_docsum['taxid'])
self.slen = int(json_docsum['slen'])
self.biomol = json_docsum['biomol']
self.moltype = json_docsum['moltype']
self.tolopolgy = json_docsum['topology']
self.sourcedb = json_docsum['sourcedb']
self.segsetsize = json_docsum['segsetsize']
self.projectid = int(json_docsum['projectid'])
self.genome = json_docsum['genome']
self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
json_docsum['subname'].split('|'))
self.assemblygi = json_docsum['assemblygi']
self.assemblyacc = json_docsum['assemblyacc']
self.tech = json_docsum['tech']
self.completeness = json_docsum['completeness']
self.geneticcode = int(json_docsum['geneticcode'])
self.strand = json_docsum['strand']
self.organism = self.strand = json_docsum['organism']
self.strain = self.strand = json_docsum['strain']
self.accessionversion = json_docsum['accessionversion']
class DocsumResult(entrezpy.base.result.EutilsResult):
"""Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
Individual Docsum records are implemented in :class:`Docsum` and
stored in :ivar:`docsums`.
:param response: inspected response from :class:`PubmedAnalyzer`
:param request: the request for the current response
:ivar dict docsums: storing Docsum instances"""
def __init__(self, response, request):
super().__init__(request.eutil, request.query_id, request.db)
self.docsums = {}
def size(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
returning the number of stored data records."""
return len(self.docsums)
def isEmpty(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
to query if any records have been stored at all."""
if not self.docsums:
return True
return False
def get_link_parameter(self, reqnum=0):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
Fetching summary record has no intrinsic elink capabilities and therefore
should inform users about this."""
print("{} has no elink capability".format(self))
return {}
def dump(self):
"""Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.
:return: instance attributes
:rtype: dict
"""
return {self:{'dump':{'docsum_records':[x for x in self.docsums],
'query_id': self.query_id, 'db':self.db,
'eutil':self.function}}}
def add_docsum(self, docsum):
"""The only non-virtual and therefore DocsumResult-specific method to handle
adding new data records"""
self.docsums[docsum.uid] = docsum
class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
"""Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
parse Docsum responses and requests."""
def __init__(self):
super().__init__()
def init_result(self, response, request):
"""Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
This method initiate a result instance when analyzing the first response"""
if self.result is None:
self.result = DocsumResult(response, request)
def analyze_error(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
we expect JSON, just print the error to STDOUT as string."""
print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
'error' : response}}}))
def analyze_result(self, response, request):
"""Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
The results is a JSON structure and allows easy parsing"""
self.init_result(response, request)
for i in response['result']['uids']:
self.result.add_docsum(Docsum(response['result'][i]))
def main():
ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
Accessions are parsed form STDIN, one accession pre line')
ap.add_argument('--email',
type=str,
required=True,
help='email required by NCBI'),
ap.add_argument('--apikey',
type=str,
default=None,
help='NCBI apikey (optional)')
ap.add_argument('-db',
type=str,
required=True,
help='Database to search ')
args = ap.parse_args()
c = entrezpy.conduit.Conduit(args.email)
fetch_docsum = c.new_pipeline()
sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
dependency=sid, analyzer=DocsumAnalyzer())
docsums = c.run(fetch_docsum).get_result().docsums
for i in docsums:
print(i, docsums[i].uid, docsums[i].caption,docsums[i].strain, docsums[i].subtype.host)
return 0
if __name__ == '__main__':
main()
|
The implementaion can be invoked as shown in Listing 18.
Docsum
data for several accessions¶$ cat "NC_016134.3" > accs
$ cat "HOU142311" >> accs
$ cat accs | python seqmetadata-fetcher.py --email email -db nuccore
Footnotes
[1] | https://en.wikipedia.org/wiki/Virtual_function |
[2] | http://www.greenwoodsoftware.com/less/ |
[3] | https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/ |
[4] | https://en.cppreference.com/w/c/language/struct |
[5] | https://en.wikipedia.org/wiki/Object_composition |
Entrezpy
E-Utility functions¶
Logging¶
entrezpy uses the Python logging module for logging. The base
classes do only log the levels ‘ERROR’ and ‘DEBUG’. The module
entrezpy.log.logger
contains all methods related to logging. A basic
configuration of the logger is given in entrezpy.log.conf
.
Applications using entrezpy can set the level of logging as shown in
(Listing 19). Logging calls can be made in classes inheriting
entrezpy classes as shown in (Listing 20). The
entrezpy.log.logger.get_class_logger()
required the class as its input.
Add logging to applications using entrezpy
¶
Importing the logging module and set the level.
entrezpy
library¶1 2 3 4 5 6 7 8 | import entrezpy.log.logger
entrezpy.log.logger.set_level('DEBUG')
def main():
"""
your application using entrezpy
"""
|
Add logging to a class inheriting a entrezpy
base class¶
1 2 3 4 5 6 7 8 9 10 | import entrezpy.log.logger
class Esearcher(entrezpy.base.query.EutilsQuery):
def __init__(self, tool, email, apikey=None, apikey_var=None, threads=None, qid=None):
super().__init__('esearch.fcgi', tool, email, apikey=apikey, threads=threads, qid=qid)
self.logger = entrezpy.log.logger.get_class_logger(Esearcher)
self.logger.debug(json.dumps({'init':self.dump()}))
|
Esearch¶
entrezpy.esearch.esearcher.Esearcher
implements the E-Utility
ESearch [0]. Esearcher queries return UIDs for data in the requested
Entrez database or WebEnv/QueryKey references from the Entrez History server.
Usage¶
import entrezpy.esearch.esearcher
import entrezpy.log.logger
entrezpy.log.logger.set_level('DEBUG')
e = entrezpy.esearch.esearcher.Esearcher("entrezpy",
"youremail@email.com",
apikey=None,
apikey_var=None,
threads=None,
qid=None)
analyzer = e.inquire({'db' : 'pubmed',
'term' : 'viruses[orgn]',
'retmax' : '20',
'rettype' : 'uilist'})
print(analyzer.result.count, analyzer.result.uids)
This creates an Esearcher instance with the following parameters:
Esearcher
¶
entrezpy.esearch.esearcher.Esearcher
param str tool: String with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline. param str email: A complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy
is a library, not a tool.param str apikey: NCBI API key param str apikey_var: Environment variable storing an NCBI API key param int threads: Number of threads (not processors) param str qid: Unique Esearch query id. Will be generated if not given.
Supported E-Utility parameter¶
Parameters are passed as dictionary to
entrezpy.esearch.esearcher.Esearcher.inquire()
and are expected to be the
same as those for the E-Utility [0]. For example:
{'db' : 'nucleotide', 'term' : 'viruses[orgn]', 'reqsize' : 100, 'retmax' : 99, 'idtype' : 'acc'}
Esearcher
introduces one additional parameter reqsize
. It sets the size
of a request. Numbers grater than the maximum allowed by NCBI will be set to
the maximum.
Parameter | Type | |
---|---|---|
E-Utility | ||
db |
str |
|
WebEnv |
str |
|
query_key |
int |
|
uilist |
bool |
|
retmax |
int |
|
retstart |
int |
|
usehistory |
bool |
|
term |
str |
|
sort |
str |
|
field |
str |
|
reldate |
int |
|
datetype |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
mindate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
maxdate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
idtype |
bool |
|
retmode |
`json , enforced by Esearcher |
|
Esearcher | reqsize |
int |
Result¶
Instance of entrezpy.esearch.esearch_result.EsearchResult
.
If retmax
= 0 or retmode
= count
no UIDs are returned. If
usehistory
is True
(default), WebEnv and query_key for
the request is returned.
Approach¶
- Parameters are checked and the request size is configured
- Initial search is requested
- If more search requests are required, Parameter is adjusted and the remaining requests are done
- If no errors were encountered, returns the analyzer with the result for all requests
Efetch¶
entrezpy.efetch.efetcher.Efetcher
implements the E-Utility
EFetch [0]. Efetcher
queries return data from the Entrez History server.
Usage¶
import entrezpy.efetch.efetcher
e = entrezpy.efetch.efetcher.Efetcher(tool,
email,
apikey=None,
apikey_var=None,
threads=None,
qid=None)
analyzer = e.inquire({'db' : 'pubmed',
'id' : [17284678, 9997],
'retmode' : 'text',
'rettype' : 'abstract'})
print(analyzer.count, analyzer.retmax, analyzer.retstart, analyzer.uids)
Efetcher
¶
entrezpy.efetch.efetcher.Efetcher
param str tool: string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline. param str email: a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy
is this is a library, not a tool.param str apikey: NCBI API key param str apikey_var: NCBI API key param int threads: number of threads param str qid: Unique Esearch query id. Will be generated if not given.
Supported E-Utility parameter¶
Parameters are passed as dictionary to
entrezpy.esearch.esearcher.Esearcher.inquire()
and are expected to be the
same as those for the E-Utility [0]. For example:
{'db' : 'nuccore', 'term' : 'Pythons [Organism]'}
Esearcher
introduces the additional parameter reqsize
. It sets the size
of a request. Numbers grater than the maximum allowed by NCBI will be set to
the maximum.
Parameter | Type | |
---|---|---|
E-Utility | ||
db |
str |
|
WebEnv |
str |
|
query_key |
int |
|
uilist |
bool |
|
retmax |
int |
|
retstart |
int |
|
usehistory |
bool |
|
term |
str |
|
sort |
str |
|
field |
str |
|
reldate |
int |
|
datetype |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
mindate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
maxdate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
idtype |
str |
|
retmode |
`json , enforced by Esearcher |
|
Esearcher | reqsize |
int |
Result¶
Instance of entrezpy.efetch.efetch_result.EfetchResult
.
If retmax
= 0 or retmode
= count
no UIDs are returned. If
usehistory
is True
(default), WebEnv and query_key for
the request is returned.
Approach¶
- Parameters are checked and the request size is configured
- Initial search is requested
- If more search requests are required, Parameter is adjusted and the remaining requests are done
- If no errors were encountered, returns the analyzer with the result for all requests
Elink¶
entrezpy.elink.elinker.Elinker
implements the E-Utility
ELink [1]. Elink queries can link results
Elinker
queries return UIDs for data in the requested Entrez database or
WebEnv/QueryKey reference from the Entrez History server.
If an Elink query is part of a Conduit pipeline, a search query has to run uisng the Elink query as dependency to obtain the proper UIDs. See .. _tutorialpipeline:.
Usage¶
import entrezpy.elink.elinker
e = entrezpy.elink.elinker.Elinker(tool,
email,
apikey=None,
apikey_var=None,
threads=None,
qid=None)
analyzer = e.inquire{'dbfrom' : 'protein',
'db' : 'gene',
'id' : [15718680, 157427902]}
.. print(analyzer.count, analyzer.retmax, analyzer.retstart, analyzer.uids)
Elinker
¶
entrezpy.elink.elinker.Elinker
param str tool: | string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline. |
---|---|
param str email: | |
a complete and valid e-mail address of the software developer
and not that of a third-party end user. entrezpy is this
is a library, not a tool. |
|
param str apikey: | |
NCBI API key | |
param str apikey_var: | |
NCBI API key | |
param int threads: | |
number of threads | |
param str qid: | Unique Esearch query id. Will be generated if not given. |
Supported E-Utility parameter¶
Parameters are passed as dictionary to
entrezpy.elink.elinker.Elinker.inquire()
and are expected to be the
same as those for the E-Utility [0]. For example:
{'db' : 'nucleotide', 'dbfrom' : 'protein, 'cmd' : 'neighbor'}
Elinker
introduces one additional parameter link
. It forces Elinker to
create 1-to-many UID links.
Note
retmode : ref
for the Elink command prlinks
is not supported since
this returns only the link outside Entrez databases.
Parameter | Type | |
---|---|---|
E-Utility | ||
db |
str |
|
dbfrom |
str |
|
id |
list |
|
cmd |
str |
|
linkname |
str |
|
term |
str |
|
holding |
str |
|
term |
str |
|
datetype |
str |
|
reldate |
int |
|
reldate |
int |
|
mindate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
maxdate |
str (YYYY/MM/DD, YYYY/MM, YYYY) |
|
retmode |
str |
|
Elinker | link |
bool |
Elink linknames¶
Elink linknames allow to specifiy a subset from the linked database. This can
greatly incrase the spceificity of your link. By default, entrepy
Elinker
uses linkname for the commands neighbor
, neighbor_history
, and
neighbor_score
. If no linkname is given, the name of dbfrom
and db
are joined to dbfrom_db
.
For all possible linkname, refer to [2].
Result¶
Instance of entrezpy.elink.linkset.ElinkResult
.
Every results are stored as link sets entrezpy.elink.LinkSets.bare.Linkset
which are either linked (entrezpy.elink.LinkSets.linked.LinkedLinkset
)
and store 1-to-many UID links or relaxed
(entrezpy.elink.LinkSets.relaxed.RelaxedLinkset
), storing many-to-many
UID links.
Approach¶
- Parameters are checked and the request size is configured
- Link is requested
- If no errors were encountered, returns the analyzer with the link result
Esummary¶
entrezpy.esummary.esummarizer.Esummarizer
implements the E-Utility
ESummary [0]. Esummarizer fetches document summaries for UIDs in the
requested database. Summaries can contain abstracts, experimental details, etc
Usage¶
import entrezpy.esummary.esummarizer
e = entrezpy.esummary.esummarizer.Esummarizer(tool,
email,
apikey=None,
apikey_var=None,
threads=None,
qid=None)
analyzer = e.inquire('db' : 'pubmed', 'id' : [11850928, 11482001])
print(analyzer.get_result().summaries)
Esummarizer
¶
entrezpy.esummary.esummarizer.Esummarizer
param str tool: string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline. param str email: a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy
is a library, not a tool.param str apikey: NCBI API key param str apikey_var: NCBI API key param int threads: number of threads param str qid: Unique Esummary query id. Will be generated if not given.
Supported E-Utility parameter¶
Parameters are passed as dictionary to
entrezpy.esummary.esummarizer.Esummarizer.inquire()
and are expected to be the
same as those for the E-Utility [0]. For example:
{{'db' : 'pubmed','id' : [11237011,12466850]}
Parameter | Type | |
---|---|---|
E-Utility | ||
db |
str |
|
id |
list |
|
WebEnv |
string |
|
retstart |
int |
|
retmax |
int |
|
retmode |
JSON, enforced by entrezpy |
Not supported E-Utility parameter¶
Parameter | Type | |
---|---|---|
E-Utility | ||
retmode |
JSON, enforced by entrezpy |
|
version |
XML specific parameter |
Result¶
Instance of entrezpy.esummary.esummary_result.EsummaryResult
.
If retmax
= 0 or retmode
= count
no UIDs are returned. If
usehistory
is True
(default), WebEnv and query_key for
the request is returned.
Approach¶
- Parameters are checked and the request size is configured
- UIDs are posted to NCBI
- If no errors were encountered, returns the analyzer with the result storing the WebEnv and query_key for the UIDs.
Epost¶
entrezpy.epost.eposter.Eposter
implements the E-Utility
EPost [0]. Eposter queries post UIDs onto the Entrez History server
and return the corresponding WebEnv and query_key. If an exisitng WebEnv
is
passed as parameter, the posted UIDs will be added to this WebEnv
by
increasing its query_key
.
Usage¶
import entrezpy.epost.eposter
e = entrezpy.epost.eposter.Eposter(tool,
email,
apikey=None,
apikey_var=None,
threads=None,
qid=None)
analyzer = e.inquire({'db' : 'pubmed','id' : [12466850])
print(analyzer.get_result().get_link_parameters())
Eposter
¶
entrezpy.epost.eposter.Eposter
param str tool: string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline. param str email: a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy
is a library, not a tool.param str apikey: NCBI API key param str apikey_var: NCBI API key param int threads: number of threads param str qid: Unique Epost query id. Will be generated if not given.
Supported E-Utility parameter¶
Parameters are passed as dictionary to
entrezpy.epost.eposter.Eposter.inquire()
and are expected to be the
same as those for the E-Utility [0]. For example:
{{'db' : 'pubmed','id' : [11237011,12466850]}
Parameter | Type | |
---|---|---|
E-Utility | ||
db |
str |
|
id |
list |
|
WebEnv |
string |
Result¶
Instance of entrezpy.esearch.esearch_result.EsearchResult
.
If retmax
= 0 or retmode
= count
no UIDs are returned. If
usehistory
is True
(default), WebEnv and query_key for
the request is returned.
Approach¶
- Parameters are checked and the request size is configured.
- UIDs are posted to NCBI.
- If no errors were encountered, returns the analyzer with the result storing the WebEnv and query_key for the UIDs.
Entrezpy
In-depth¶
Entrezpy architcture¶
Queries and requests¶
Entrezpy queries are build from at least one request. A search for all virus sequences in the Entrez database ‘nucleotides’ is one query and has one initial request, the search itself. However, this search will return more UIDs than can be fetched in one go and to obtain all UIDs, several requests are required.
Basic functions¶
Each function is a collection of inherited classes interacting with each other.
Each class implements a specific task of a query. The basic classes required for
an entrezpy query are found in src/entrezpy/base
of the repository.
Each query starts with passing E-Utils parameters as dictionary into the
iquire
method of the query, which are derived from
entrezpy.base.query.EutilsQuery.inquire()
.
The first step in inquire()
is to instantiate a parameter object derived
from entrezpy.base.parameter.EutilsParameter
.
The parameters get checked for errors and if none are found, an instance of
entrezpy.base.parameter.EutilsParameter
is returned. The attributes of
entrezpy.base.parameter.EutilsParameter
configure the query and the
required number of entrezpy.base.request.EutilsRequest
is added to the
queue.
Each request is sent to the corresponding E-Utility and its response received
. All responses from within a query are analyzed by the same instance of a
entrezpy.base.analzyer.EutilsAnalyzer
. The analyzer stores results in
an instance of entrezpy.base.result.EutilsResult
.
Error handling¶
The primary approach of entrezpy
is abort if an error has be been
encountered since it’s not known what the developer had in mind when deploying
entrezpy
.
entrezpy
aborts if :
- errors are found in the parameters
- HTTP error 400
entrezpy
continues, but warns, if:
- empty result
- after 10 retries to obtain request
Logging¶
WIP
E-Utilities by entrezpy
¶
Entrezpy
assembles POST parameters [1], [2], creates the correspondong
requests to interact with the E-Utilities, and reads the received responses.
Entrezpy implements E-Utility functions as queries consisting of at least one
request:
Query
+...............+
| |
0 1 2 3 4 5 6 7 8
| | | | |
+-----+ +-----+ +
R0 R1 R2
\ | /
+----+----+
|
v
entrezpy.base.analyzer.EutilsAnalyzer()
The example depicts the relation between a query and requests in Entrepy.
The example query consists of 9 data records. Using a request size of 4 data
records, Entrezpy
resolves this query using two requests (R0 - R1) with the
given size and adjusts the size of the last query (R2).
Each query passes all request and responses through the same instance of its
corresponding entrezpy.base.analyzer.EutilsAnalyzer
. The analyzer
can be passed as argument to each entrezpy query. Each request is analyzed as
soon as it is received. The analzyer
base class
entrezpy.base.analyzer.EutilsAnalyzer
can be inherited and adjusted
for specific formats or tasks
Entrezpy offers default analzyers, but most likely you want, or have to,
implement a specific Efetche analzyer. You can use
entrezpy.efetch.efetch_analyzer.EfetchAnalyzer
as template.
E-Utilities History server¶
E-Utilities offers to store queries on the NCBI servers and returining a WebEnv and query_key referencing such queries. This can skip unnessecray data downlaods or used to modify queries on the NCBI servers.
Error reponses¶
Reference¶
Logging module¶
Functions¶
-
entrezpy.log.logger.
CONFIG
= {'level': 'INFO', 'propagate': True, 'quiet': True}¶ Store logger settings
-
entrezpy.log.logger.
get_root
()¶ Returns the module root
-
entrezpy.log.logger.
resolve_class_namespace
(cls)¶ Resolves namespace for logger
-
entrezpy.log.logger.
get_class_logger
(cls)¶ Prepares logger for given class
-
entrezpy.log.logger.
set_level
(level)¶ Sets logging level for applications using entrezpy.
Configuration¶
-
entrezpy.log.conf.
default_config
= {'disable_existing_loggers': False, 'formatters': {'default': {'format': '%(asctime)s %(threadName)s [%(levelname)s] %(name)s: %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'default', 'stream': 'ext://sys.stderr'}}, 'loggers': {'': {'handlers': ['console']}}, 'version': 1}¶ Dictionary to store logger configuration
Base modules¶
Query¶
- class
entrezpy.base.query.
EutilsQuery
(eutil, tool, email, apikey=None, apikey_var=None, threads=None, qid=None)¶EutilsQuery implements the base class for all entrezpy queries to E-Utils. It handles the information required by every query, e.g. base query url, email address, allowed requests per second, apikey, etc. It declares the virtual method
inquire()
which needs to be implemented by every request since they differ among queries.An NCBI API key will bet set as follows:
- passed as argument during initialization
- check enviromental variable passed as argument
- check enviromental variable NCBI_API_KEY
Upon initalization, following parameters are set:
- set unique query id
- check for / set NCBI apikey
- initialize
entrezpy.requester.requester.Requester
with allowed requests per second- assemble Eutil url for desire EUtils function
- initialize Multithreading queue and register query at
entrezpy.base.monitor.QueryMonitor
for loggingMultithreading is handled using the nested classes
entrezpy.base.query.EutilsQuery.RequestPool
andentrezpy.base.query.EutilsQuery.ThreadedRequester
.Inits EutilsQuery instance with eutil, toolname, email, apikey, apikey_envar, threads and qid.
Parameters:
- eutil (str) – name of eutil function on EUtils server
- tool (str) – tool name
- email (str) – user email
- apikey (str) – NCBI apikey
- apikey_var (str) – enviroment variable storing NCBI apikey
- threads (int) – set threads for multithreading
- qid (str) – unique query id
Variables:
- id – unique query id
- base_url – unique query id
- requests_per_sec (int) – default limit of requests/sec (set by NCBI)
- max_requests_per_sec (int) – max.requests/sec with apikeyby (set NCBI)
- url (str) – full URL for Eutil function
- contact (str) – user email (required by NCBI)
- tool (str) – tool name (required by NCBI)
- apikey (str) – NCBI apikey
- num_threads (int) – number of threads to use
- failed_requests (list) – store failed requests for analysis if desired
- request_pool –
entrezpy.base.query.EutilsQuery.RequestPool
instance- request_counter (int) – requests counter for a EutilsQuery instance
base_url
= 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils'¶Base url for all Eutil request
inquire
(parameter, analyzer)¶Virtual function starting query. Each query requires its own implementation.
Parameters:
- parameter (dict) – E-Utilities parameters
- analzyer (
entrezpy.base.analyzer.EutilsAnalzyer
) – query response analyzerReturns: analyzer
Return type:
entrezpy.base.analyzer.EutilsAnalzyer
check_requests
()¶Virtual function testing and handling failed requests. These requests fail due to HTTP/URL issues and stored
entrezpy.base.query.EutilsQuery.failed_requests
check_ncbi_apikey
(apikey=None, env_var=None)¶Checks and sets NCBI apikey.
Parameters:
- apikey (str) – NCBI apikey
- env_var (str) – enviromental variable storing NCBI apikey
prepare_request
(request)¶Prepares request for sending to E-Utilities with require quey attributes.
Parameters: request ( entrezpy.base.request.EutilsRequest
) – entrezpy request instanceReturns: request instance with EUtils parameters Return type: entrezpy.base.request.EutilsRequest
add_request
(request, analyzer)¶Adds one request and corresponding analyzer to the request pool.
Parameters:
- request (
entrezpy.base.request.EutilsRequest
) – entrezpy request instance- analzyer – entrezpy analyzer instance
monitor_start
(query_parameters)¶Starts query monitoring
Parameters: query_parameters ( entrezpy.base.parameter.EutilsParameter
) – query parameters
monitor_stop
()¶Stops query monitoring
monitor_update
(updated_query_parameters)¶Updates query monitoring parameters if follow up requests are required.
Parameters: updated_query_parameters ( entrezpy.base.parameter.EutilsParameter
) – updated query parameters
hasFailedRequests
()¶Reports if at least one request failed.
dump
()¶Dump all attributes
isGoodQuery
()¶Tests for request errors
rtype: bool
Parameter¶
- class
entrezpy.base.parameter.
EutilsParameter
(parameter=None)¶EutilsParameter set and check parameters for each query. EutilsParameter is populated from a dictionary with valid E-Utilities parameters for the corresponding query. It declares virtual functions where necessary.
Simple helper functions are presented to test the common parameters db, WebEnv, query_key and usehistory.
Note
usehistory
is the parameter used for Entrez history queries and is set to True (use it) by default. It can be set to False to ommit history server use.
haveExpectedRequests()
tests if the of the number of requests has been calculated.The virtual methods
check()
anddump()
need thrir own implementation since they can vary between queries.Warning
check()
is expected to run after all parameters have been set.
Parameters: parameter (dict) – Eutils query parameters
Variables:
- db (str) – Entrez database name
- webenv (str) – WebEnv
- querykey (int) – querykey
- expected_request (int) – number of expected request for the query
- doseq (bool) – use id= parameter for each uid in POST
haveDb
()¶Check for required db parameter
Return type: bool
haveWebenv
()¶Check for required WebEnv parameter
Return type: bool
haveQuerykey
()¶Check for required QueryKey parameter
Return type: bool
useHistory
()¶Check if history server should be used.
Return type: bool
haveExpectedRequets
()¶Check fo expected requests. Hints an error if no requests are expected.
Return type: bool
check
()¶Virtual function to run a check before starting the query. This is a crucial step and should abort upon failing.
Raises: NotImplementedError – if not implemented
dump
()¶Dump instance attributes
Return type: dict Raises: NotImplementedError – if not implemented
Request¶
- class
entrezpy.base.request.
EutilsRequest
(eutil, db)¶EutilsRequest is the base class for requests from
entrezpy.base.query.EutilsQuery
.EutilsRequests instantiate in
entrezpy.base.query.EutilsQuery.inquire()
before being added to the request pool byentrezpy.base.query.EutilsQuery.add_request()
. Each EutilsRequest triggers an answer at the NCBI Entrez servers if no connection errors occure.
EutilsRequest
stores the required information for POST requests. Its status can be queried from outside byentrezpy.base.request.EutilsRequest.get_observation()
. EutilsRequest instances store information not present in the server response and is required byentrezpy.base.analyzer.EutilsAnalyzer
to parse responses and errors correctly. Several instance attributes are not required for a POST request but help debugging.Each request is automatoically assigned an id to identify and trace requests using the query id and request id.
Parameters:
- eutil (str) – eutil function for this request, e.g. efetch.fcgi
- db (str) – database for request
Initializes a new request with initial attributes as part of a query in
entrezpy.base.query.EutilsQuery
.
Variables:
- tool (str) – tool name to which this request belongs
- url (str) – full Eutil url
- contact (str) – use email
- apikey (str) – NBCI apikey
- query_id (str) –
entrezpy.base.query.EutilsQuery.query_id
which initiated this request- status (int) – request status : 0->success, 1->Fail,2->Queued
- size (int) – size of request, e.g. number of UIDs
- start_time (float) – start time of request in seconds since epoch
- duration – duration for this request in seconds
- doseq – set doseq parameter in
entrezpy.request.Request.request()
Note
status
is work in progress.
get_post_parameter
()¶Virtual function returning the POST parameters for the request from required attributes.
Return type: dict Raises: NotImplemetedError –
prepare_base_qry
(extend=None)¶Returns instance attributes required for every POST request.
Parameters: extend (dict) – parameters extending basic parameters Returns: base parameters for POST request Return type: dict
set_status_success
()¶Set status if request succeeded
set_status_fail
()¶Set status if request failed
report_status
(processed_requests=None, expected_requests=None)¶Reports request status when triggered
get_request_id
()¶
Returns: full request id Return type: str
set_request_error
(error)¶Sets request error and HTTP/URL error message
Parameters: error (str) – HTTP/URL error
start_stopwatch
()¶Starts time to measure request duration.
calc_duration
()¶Calculates request duration
dump_internals
(extend=None)¶Dumps internal attributes for request.
Parameters: extend (dict) – extend dump with additional information
Analyzer¶
- class
entrezpy.base.analyzer.
EutilsAnalyzer
¶EutilsAnalyzer is the base class for an entrezpy analyzer. It prepares the response based on the requested format and checks for E-Utilities errors. The function parse() is invoked after every request by the corresponding query class, e.g. Esearcher. This allows analyzing data as it arrives without waiting until larger queries have been fetched. This approach allows implementing analyzers which can store already downloaded data to establish checkpoints or trigger other actions based on the received data.
Two virtual classes are the core and need their own implementation to support specific queries:
Note
Responses from NCBI are not very well documented and functions will be extended as new errors are encountered.
Inits EutilsAnalyzer with unknown type of result yet. The result needs to be set upon receiving the first response by
init_result()
.
Variables:
- hasErrorResponse (bool) – flag indicating error in response
- result – result instance
known_fmts
= {'json', 'text', 'xml'}¶Store formats known to EutilsAnalzyer
init_result
(response, request)¶Virtual function to initialize result instance. This allows to set attributes from the first response and request.
Parameters: response (dict or io.StringIO) – converted response from convert_response()
Raises: NotImplementedError – if implementation is missing
analyze_error
(response, request)¶Virtual function to handle error responses
Parameters: response (dict or io.StringIO) – converted response from convert_response()
Raises: NotImplementedError – if implementation is missing
analyze_result
(response, request)¶Virtual function to handle responses, i.e. parsing them and prepare them for
entrezpy.base.result.EutilsResult
Parameters: response (dict or io.StringIO) – converted response from convert_response()
Raises: NotImplementedError – if implementation is missing
parse
(raw_response, request)¶Check for errors and calls parser for the raw response.
Parameters:
- raw_response (
urllib.request.Request
) – response fromentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestRaises: NotImplementedError – if request format is not in
EutilsAnalyzer.known_fmts
convert_response
(raw_response_decoded, request)¶Converts raw_response into the expected format, deduced from request and set via the retmode parameter.
Parameters:
- raw_response (
urllib.request.Request
) – responseentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: response in parseable format
Return type: dict or
io.stringIO
- ..note::
- Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
isErrorResponse
(response, request)¶Checking for error messages in response from Entrez Servers and set flag
hasErrorResponse
.
Parameters:
- response (dict or
io.stringIO
) – parseable response fromconvert_response()
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: error status
Return type: bool
check_error_xml
(response)¶Checks for errors in XML responses
Parameters: response ( io.stringIO
) – XML responseReturns: if XML response has error message Return type: bool
check_error_json
(response)¶Checks for errors in JSON responses. Not unified among Eutil functions.
Parameters: response (dict) – reponse Returns: status if JSON response has error message Return type: bool
isSuccess
()¶Test if response has errors
Return type: bool
get_result
()¶Return result
Returns: result instance Return type: entrezpy.base.result.EutilsResult
follow_up
()¶Return follow-up parameters if available
Returns: Follow-up parameters Return type: dict
isEmpty
()¶Test for empty result
Return type: bool
Result¶
- class
entrezpy.base.result.
EutilsResult
(function, qid, db, webenv=None, querykey=None)¶EutilsResult is the base class for an entrezpy result. It sets the required result attributes common for all result and declares virtual functions to interact with other entrezpy classes. Empty results are successful results since no query error has been received.
entrezpy.base.result.EutilsResult.size()
is important to
- determine if and how many follow-up requests are required
- if it’s an empty result
Parameters:
- function (string) – EUtil function of the result
- qid (string) – query id
- db (string) – Entrez database name for result
- webenv (string) – WebEnv of response
- querykey (int) – querykey of response
size
()¶Returns result size in the corresponding ResultSize unit
Return type: int Raises: NotImplementedError – if implementation is missing
dump
()¶Dumps all instance attributes
Return type: dict Raises: NotImplementedError – if implementation is missing
get_link_parameter
(reqnum=0)¶Assembles parameters for automated follow-ups. Use the query key from the first request by default.
Parameters: reqnum (int) – request number for which query_key should be returned Returns: EUtils parameters Return type: dict Raises: NotImplementedError – if implementation is missing
isEmpty
()¶Indicates empty result.
Return type: bool Raises: NotImplementedError – if implementation is missing
Monitor¶
Elink modules¶
Elinker¶
- class
entrezpy.elink.elinker.
Elinker
(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)¶Bases:
entrezpy.base.query.EutilsQuery
Elinker implements elink queries to E-Utilities [0]. Elinker implements the inquire() method to link data sets on NCBI Entrez servers. All parameters described in [0] are acccepted. Elink queries consist of one request linking UIDs or an earlier requests on the history server within the same or different Entrez database. [0]: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ELink
Parameters:
- tool (str) – tool name
- email (str) – user email
- apikey (str) – NCBI apikey
- apikey_var (str) – enviroment variable storing NCBI apikey
- threads (int) – set threads for multithreading
- qid (str) – unique query id
inquire
(parameter, analyzer=<entrezpy.elink.elink_analyzer.ElinkAnalyzer object>)¶Implements virtual function inquire()
- Prepares parameter instance
entrezpy.elink.elink_parameter.ElinkerParameter
- Starts threading monitor
monitor_start()
- Adds ElinkRequests to queue
add_request()
- Runs and analyzes all requests
- Checks for errors
check_requests()
Parameters:
- parameter (dict) – ELink parameter
- analyzer (analyzer) – analyzer for Elink Results, default is
entrezpy.elink.elink_analyzer.ElinkAnalyzer
Returns: analyzer or None if request errors have been encountered
Return type:
entrezpy.base.analyzer.EntrezpyAnalyzer
instance or None
ElinkParameter¶
- class
entrezpy.elink.elink_parameter.
ElinkParameter
(parameter)¶Bases:
entrezpy.base.parameter.EutilsParameter
ElinkParameter checks query specific parameters and configures a
entrezpy.elink.elink_query.ElinkQuery
instance. A link gets its size fromentrezpy.elink.elink_parameter.ElinkParameter.uids
(from the id Eutils parameter) or earlier result stored on the Entrez history server.entrezpy.elink.elink_parameter.ElinkParameter.retmode
is JSON where possible andentrezpy.elink.elink_parameter.ElinkParameter.cmd
is neighbor. ELink has no set maximum for UIDs which can be linked in one query, fixingentrezpy.elink.elink_parameter.ElinkParameter.query_size
,entrezpy.elink.elink_parameter.ElinkParameter.request_size
, andentrezpy.elink.elink_parameter.ElinkParameter.expected_requests
to 1.
Parameters: parameter (dict) – Eutils Elink parameter
nodb_cmds
= {'acheck', 'lcheck', 'llinks', 'llinkslib', 'ncheck', 'prlinks'}¶Elink commands not requiring the db parameter
retmodes
= {'llinkslib': 'xml'}¶The llinkslib elink command is the only command only returning XML
def_retmode
= 'json'¶Use JSON whenever possible
check
()¶Implements
entrezpy.base.parameter.check()
and aborts if required parameters are missing.
haveDb
()¶Check for required db parameter
Return type: bool
haveExpectedRequets
()¶Check fo expected requests. Hints an error if no requests are expected.
Return type: bool
haveQuerykey
()¶Check for required QueryKey parameter
Return type: bool
haveWebenv
()¶Check for required WebEnv parameter
Return type: bool
useHistory
()¶Check if history server should be used.
Return type: bool
set_retmode
(retmode)¶Checks for valid and supported Elink retmodes
Parameters: retmode (str) – requested retmode Returns: default or cmd adjusted cretmode Return type: str
dump
()¶
Returns: Instance attributes Return type: dict
ElinkAnalyzer¶
- class
entrezpy.elink.elink_analyzer.
ElinkAnalyzer
¶Bases:
entrezpy.base.analyzer.EutilsAnalyzer
ElinkAnalyzer implements parsing and superficial analysis of responses from ELink queries. ElinkAnalyzer implements the virtual methods
analyze_result()
andanalyze_error()
. The variety in possible Elink response formats results in several specialized parser. Default is to obtain results in JSON.ElinkAnalyzer instances create
linked.Linkset
orrelaxed.Linkset
instances, depending on the request Elink result.entrezpy.elink.linkset.bare.Linkset.new_unit()
is called to set the type of LinkSet unit based ont he used Elink command.Warning
Expect for ‘llinkslib’, all responses are expected in JSON. ElinkAnalyzer will abort if a response from another command is not in JSON.
Variables: result – entrezpy.elink.elink_result.ElinkResult
init_result
(response, request)¶
analyze_error
(response, request)¶Implements virtual function
entrezpy.base.analyzer.analyze_error()
.
get_linkset_unit
(elink_cmd)¶
analyze_result
(response, request)¶Implements virtual method
entrezpy.base.analyzer.analyze_result()
and checks used elink command to run according result parser.
analyze_linklist
(linksets, lset_unit)¶Parses ELink responses listing link information for UIDs.
Parameters:
- linksets (dict) – ‘linkset’ part in an ELink JSON response from NCBI.
- lset_unit – Elink LinkSet unit instance
analyze_links
(linksets, lset_unit)¶Parses ELink responses with links as UIDs or History server references.
Parameters:
- linksets (dict) – ‘linkset’ part in an ELink JSON response from NCBI.
- lset_unit – Elink LinkSet unit instance
parse_llinkslib
(response, lset_unit, lset=None)¶Exclusive XML parser for ‘llinkslib’ responses. Its approach is ugly but parses the XML. The cmd parametes is always ‘llinkslib’ but retains the calling signature.
Parameters:
- response (io.StringIO) – XML response from Entrez
- lset_unit – Elink LinkSet unit instance
check_error_json
(response)¶Checks for errors in JSON responses. Not unified among Eutil functions.
Parameters: response (dict) – reponse Returns: status if JSON response has error message Return type: bool
check_error_xml
(response)¶Checks for errors in XML responses
Parameters: response ( io.stringIO
) – XML responseReturns: if XML response has error message Return type: bool
convert_response
(raw_response_decoded, request)¶Converts raw_response into the expected format, deduced from request and set via the retmode parameter.
Parameters:
- raw_response (
urllib.request.Request
) – responseentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: response in parseable format
Return type: dict or
io.stringIO
- ..note::
- Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
follow_up
()¶Return follow-up parameters if available
Returns: Follow-up parameters Return type: dict
get_result
()¶Return result
Returns: result instance Return type: entrezpy.base.result.EutilsResult
isEmpty
()¶Test for empty result
Return type: bool
isErrorResponse
(response, request)¶Checking for error messages in response from Entrez Servers and set flag
hasErrorResponse
.
Parameters:
- response (dict or
io.stringIO
) – parseable response fromconvert_response()
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: error status
Return type: bool
isSuccess
()¶Test if response has errors
Return type: bool
known_fmts
= {'json', 'text', 'xml'}¶
parse
(raw_response, request)¶Check for errors and calls parser for the raw response.
Parameters:
- raw_response (
urllib.request.Request
) – response fromentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestRaises: NotImplementedError – if request format is not in
EutilsAnalyzer.known_fmts
ElinkRequest¶
- class
entrezpy.elink.elink_request.
ElinkRequest
(eutil, parameter)¶Bases:
entrezpy.base.request.EutilsRequest
The ElinkRequest class implements a single request as part of a Elinker query. It stores and prepares the parameters for a single request. See
entrezpy.elink.elink_parameter.ElinkParameter
for parameter description.
Parameters:
- parameter – request parameter
- type –
entrezpy.elink.elink_parameter.ElinkParameter
linkname_cmds
= {'neighbor', 'neighbor_history', 'neighbor_score'}¶
get_post_parameter
()¶Implements
entrezpy.base.request.EutilsRequest.get_post_parameter()
.
- If WebEnv and query_key are given the history server will be used.
- If UIDs are given create an id parameter for each UID, i.e. id=123&id=456 (see
entrezpy.elink.elink.elink_parameter.ElinkParameter.doseq
)- Setting
entrezpy.elink.elink.elink_parameter.ElinkParameter.doseq
to False concatenats UIDs with commas, i.e. id=123,456
- linkname: For neighbor or neighbor commands without a given linkname
- one generated. See documentation for more details.
set_linkname
(qry)¶
dump
()¶Dumps instance attributes
Returns: instance attributes Return type: dict
calc_duration
()¶Calculates request duration
dump_internals
(extend=None)¶Dumps internal attributes for request.
Parameters: extend (dict) – extend dump with additional information
get_request_id
()¶
Returns: full request id Return type: str
prepare_base_qry
(extend=None)¶Returns instance attributes required for every POST request.
Parameters: extend (dict) – parameters extending basic parameters Returns: base parameters for POST request Return type: dict
report_status
(processed_requests=None, expected_requests=None)¶Reports request status when triggered
set_request_error
(error)¶Sets request error and HTTP/URL error message
Parameters: error (str) – HTTP/URL error
set_status_fail
()¶Set status if request failed
set_status_success
()¶Set status if request succeeded
start_stopwatch
()¶Starts time to measure request duration.
ElinkResult¶
- class
entrezpy.elink.elink_result.
ElinkResult
(qid, cmd)¶Bases:
entrezpy.base.result.EutilsResult
The ElinkResult class implements the uniform handling of different Elink LinkSets instances. It creates follow-up parameters if possible. ElinkResult instances store all results from one Elinker query as an aggregation of
entrezpy.elink.linkset.bare.LinkSet
instances. The size unit for ElinkResult isentrezpy.elink.linkset.bare.LinkSet
.
Parameters:
- qid (str) – query id
- cmd (str) – used Elink command
Variables:
- linksets (list) – list to store analyzed linskets
- cmd (str) – invoked ELink command
size
()¶Implements
entrezpy.base.result.EutilsResult.size()
. :rtype: int
isEmpty
()¶Test for empty result
Returns: True if empty, False otherwise Return type: bool
add_linkset
(linkset)¶Store linkset in
self.linkset
Parameters: linkset (LinkSet instance) – populated LinkSet
dump
()¶
Returns: all ELinkResult instance attributes Return type: dict
get_link_parameter
(reqnum=0)¶Assemble follow-up parameters depending if the History server has been used.
Returns: parameters for follow-up query Rype: dict
collapse_history_linksets
()¶Assemble follow-up WebEnv and query_key parameters in linksets. Skip those who cannot and test for unexpected result
Returns: parameters for follow-up query using History server Rype: dict
collapse_uid_linksets
()¶Assemble follow-up UID and database parameters in linksets. Skip those who cannot and test for unexpected result
Returns: parameters for follow-up query using UIDs Rype: dict
check_unexpected_dbnum
(dbs)¶Deal with more databases than expected when linking. Expecting one database per request for linking. Abort if more are present since this is unexpected. It shouldn’t happen, but make sure to catch such a case, report it and abort.
Parameters: dbs (dbs) – unique database names encountered in all LinkSets
canLink
(lset)¶Test if linkset can be use to generate automated follow-up queries
Parameters: lset (LinkSet instance) – LinkSet Returns: True if empty, False otherwise Return type: bool
EPost modules¶
Query¶
- class
entrezpy.epost.eposter.
Eposter
(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)¶Eposter implements Epost queries to E-Utilities [0]. EPost posts UIDs to the history server. Without passed WebEnv, a new WebEnv and correspndong QueryKey are returned. With a given WebEvn the posted UIDs will be added to this WebEnv and the corresponding QueryKey is returned. All parameters described in [0] are acccepted. [0]: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost
Inits Eposter instance with given attributes.
Parameters:
- tool (str) – tool name
- email (str) – user email
- apikey (str) – NCBI apikey
- apikey_var (str) – enviroment variable storing NCBI apikey
- threads (int) – set threads for multithreading
- qid (str) – unique query id
inquire
(parameter, analyzer=<entrezpy.epost.epost_analyzer.EpostAnalyzer object>)¶
- Implements
entrezpy.base.query.inquire()
and posts UIDs to Entrez.- Epost is only one request.
Parameters:
- parameter (dict) – Epost parameter
- analyzer (analyzer) – analyzer for Epost results, default is
entrezpy.epost.epost_analyzer.EpostAnalyzer
Returns: analyzer or None if request errors have been encountered
Return type:
entrezpy.base.analyzer.EntrezpyAnalyzer
instance or None
Parameter¶
- class
entrezpy.epost.epost_parameter.
EpostParameter
(parameter)¶EpostParameter checks query specific parameters and configures a
entrezpy.epost.epost_query.EpostQuery
instance. Force XML since Epost responds only XML. Epost requests don’t have follow-ups.
Parameters: parameter (dict) – Eutils Epost parameters
Variables:
- uids (list) – UIDs to post
- retmode (str) – fix retmode to XML
- query_size (int) – size of query, here number of UIDs
- request_size (int) – size of request, here nuber if UIDs
- expected_requests (int) – number of expected requests, here 1
check
()¶Implements
entrezpy.base.parameter.EutilsParameter.check()
by checking for missing database parameter and UIDs.
dump
()¶Dump instance variables
Return type: dict
haveDb
()¶Check for required db parameter
Return type: bool
haveExpectedRequets
()¶Check fo expected requests. Hints an error if no requests are expected.
Return type: bool
haveQuerykey
()¶Check for required QueryKey parameter
Return type: bool
haveWebenv
()¶Check for required WebEnv parameter
Return type: bool
useHistory
()¶Check if history server should be used.
Return type: bool
Request¶
- class
entrezpy.epost.epost_request.
EpostRequest
(eutil, parameter)¶EpostRequest implements a single request as part of an Epost query. It stores and prepares the parameters for a single request. See
entrezpy.epost.epost_parameter.EpostParameter
for parameter description.
Parameters:
- parameter – request parameter
- type –
entrezpy.epost.epost_parameter.EpostParameter
get_post_parameter
()¶Implements
entrezpy.base.request.EutilsRequest.get_post_parameter()
dump
()¶Dump instance attributes
Return type: dict
calc_duration
()¶Calculates request duration
dump_internals
(extend=None)¶Dumps internal attributes for request.
Parameters: extend (dict) – extend dump with additional information
get_request_id
()¶
Returns: full request id Return type: str
prepare_base_qry
(extend=None)¶Returns instance attributes required for every POST request.
Parameters: extend (dict) – parameters extending basic parameters Returns: base parameters for POST request Return type: dict
report_status
(processed_requests=None, expected_requests=None)¶Reports request status when triggered
set_request_error
(error)¶Sets request error and HTTP/URL error message
Parameters: error (str) – HTTP/URL error
set_status_fail
()¶Set status if request failed
set_status_success
()¶Set status if request succeeded
start_stopwatch
()¶Starts time to measure request duration.
Analyzer¶
- class
entrezpy.epost.epost_analyzer.
EpostAnalyzer
¶EpostAnalyzer implements the analysis of EPost responses from E-Utils. Epost puts UIDs onto the History server and returnes the corresponding WebEnv and QueryKey. Epost does only XML as response, therefore a dictionary imitating a JSON input is assembled and passed as result to
entrezpy.epost.epost_result.EpostResult
init_result
(response, request)¶Implements
entrezpy.base.analyzer.EutilsAnalyzer.init_result()
and initsentrezpy.epost.epost_result.EpostResult
.
analyze_result
(response, request)¶Implements
entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()
. The response is one WebEnv and QueryKey and the result can be initiated after parsing them.
Parameters:
- response – EUtils response
- request – entrezpy request
analyze_error
(response, request)¶Implements
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
.
Parameters:
- response – EUtils response
- request – entrezpy request
Result¶
- class
entrezpy.epost.epost_result.
EpostResult
(response, request)¶EpostResult stores WebEnv and QueryKey from posting UIDs to the History server. Since no limit is imposed on the number of UIDs which can be posted in one query, the size of the result is the size of the request and only one WebEnv and QueryKey are returned.
Parameters:
- request – entrezpy Epost request instance
- response (dict) – response
Request type: Variables: uids (list) – posted UIDs
dump
()¶Dumps all instance attributes
Return type: dict Raises: NotImplementedError – if implementation is missing
get_link_parameter
(reqnum=0)¶Assembles parameters for automated follow-ups. Use the query key from the first request by default.
Parameters: reqnum (int) – request number for which query_key should be returned Returns: EUtils parameters Return type: dict Raises: NotImplementedError – if implementation is missing
size
()¶Returns result size in the corresponding ResultSize unit
Return type: int Raises: NotImplementedError – if implementation is missing
isEmpty
()¶Indicates empty result.
Return type: bool Raises: NotImplementedError – if implementation is missing
Esearch modules¶
Esearcher¶
- class
entrezpy.esearch.esearcher.
Esearcher
(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)¶Bases:
entrezpy.base.query.EutilsQuery
Esearcher implements ESearch queries to NCBI’s E-Utilities. Esearch queries return UIDs or WebEnv/QueryKey references to Entrez’ History server. Esearcher implments
entrezpy.base.query.EutilsQuery.inquire()
which analyzes the first result and automatically configures subseqeunt requests to get all queried UIDs if required.
inquire
(parameter, analyzer=<entrezpy.esearch.esearch_analyzer.EsearchAnalyzer object>)¶Implements
entrezpy.base.query.EutilsQuery.inquire()
and configures follow-up requests if required.
Parameters:
- parameter (dict) – ESearch parameter
- analyzer (analyzer) – analyzer for ESearch results, default is
entrezpy.esearch.esearch_analyzer.EsearchAnalyzer
Returns: analyzer instance or None if request errors have been encountered
Return type:
initial_search
(parameter, analyzer)¶Does first request and triggers follow-up if required or possible.
Parameters:
- parameter (
entrezpy.esearch.esearch_parameter.EsearchParamater
) – Esearch parameter instances- analyzer (
entrezpy.esearch.esearch_analyzer.EsearchAnalyzer
) – Esearch analyzer instanceReturns: follow-up parameter or None
Return type:
entrezpy.esearch.esearch_parameter.EsearchParamater
or None
isGoodQuery
()¶Tests for request errors
rtype: bool
entrezpy.esearch.esearcher.
configure_follow_up
(parameter, analyzer)¶Adjusting EsearchParameter to follow-up results based on the initial Esearch result. Fetch remaining UIDs using the history server.
Parameters:
- analyzer (
entrezpy.search.esearch_analyzer.EsearchAnalyzer
) – Esearch analyzer instance- parameter – Initial Esearch parameter
entrezpy.esearch.esearcher.
reachedLimit
(parameter, analyzer)¶Checks if the set limit has been reached
Return type: bool
EsearchParameter¶
entrezpy.esearch.esearch_parameter.
MAX_REQUEST_SIZE
= 100000¶Maximum number of UIDs for one request
- class
entrezpy.esearch.esearch_parameter.
EsearchParameter
(parameter)¶Bases:
entrezpy.base.parameter.EutilsParameter
EsearchParameter checks query specific parameters and configures an Esearch query. If more than one request is required the instance is reconfigured by
entrezpy.esearch.esearcher.Esearcher.configure_follow_up()
.Note
EsearchParameter works best when using the NCBI Entrez history server. If usehistory is not used, linking requests cannot be guaranteed.
goodDateparam
()¶
Return type: bool
useMinMaxDate
()¶
Return type: bool
set_uilist
(rettype)¶
Return type: bool
adjust_retmax
(retmax)¶Adjusts retmax parameter. Order of check is crucial.
Parameters: retmax (int) – retmax value Returns: adjusted retmax Return type: int
adjust_reqsize
(request_size)¶Adjusts request size for low retmax
Returns: adjusted request size Return type: int
calculate_expected_requests
(qsize=None, reqsize=None)¶Calculate anf set the expected number of requests. Uses internal parameters if non are provided.
Parameters:
- or None qsize (int) – query size, i.e. expected number of data sets
- reqsize (int) – number of data sets to fetch in one request
check
()¶Implements
entrezpy.base.parameter.EutilsParameter.check
to check for the minumum required parameters. Aborts if any check fails.
dump
()¶Dump instance attributes
Return type: dict Raises: NotImplementedError – if not implemented
haveDb
()¶Check for required db parameter
Return type: bool
haveExpectedRequets
()¶Check fo expected requests. Hints an error if no requests are expected.
Return type: bool
haveQuerykey
()¶Check for required QueryKey parameter
Return type: bool
haveWebenv
()¶Check for required WebEnv parameter
Return type: bool
useHistory
()¶Check if history server should be used.
Return type: bool
EsearchAnalyzer¶
- class
entrezpy.esearch.esearch_analyzer.
EsearchAnalyzer
¶Bases:
entrezpy.base.analyzer.EutilsAnalyzer
EsearchAnalyzer implements the analysis of ESearch responses from E-Utils. JSON formatted data is enforced in responses. The result are stored as a
entrezpy.esearch.esearch_result.EsearchResult
instance.
Variables: result – entrezpy.esearch.esearch_result.EsearchResult
init_result
(response, request)¶Inits
entrezpy.esearch.esearch_result.EsearchResult
.
Returns: if result is initiated Return type: bool
analyze_result
(response, request)¶Implements
entrezpy.base.analyzer.EsearchAnalyzer.analyze_result()
.
Parameters:
- response (dict) – Esearch response
- request (
entrezpy.esearch.esearch_request.EsearchRequest
) – Esearch request
analyze_error
(response, request)¶Implements
entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
.
param dict response: Esearch response param request: Esearch request type request: entrezpy.esearch.esearch_request.EsearchRequest
size
()¶Returns number of analyzed UIDs in
result
Return type: int
query_size
()¶Returns number of expected UIDs in
result
Return type: int
reference
()¶Returns History Server references from
result
Returns: History Server referencess Return type: entrezpy.base.referencer.EutilReferencer.Reference
adjust_followup
(parameter)¶Adjusts result attributes from follow-up.
Parameters:
- parameter – Esearch parameter
- type –
entrezpy.esearch.esearch_parameter.EsearchParameter
check_error_json
(response)¶Checks for errors in JSON responses. Not unified among Eutil functions.
Parameters: response (dict) – reponse Returns: status if JSON response has error message Return type: bool
check_error_xml
(response)¶Checks for errors in XML responses
Parameters: response ( io.stringIO
) – XML responseReturns: if XML response has error message Return type: bool
convert_response
(raw_response_decoded, request)¶Converts raw_response into the expected format, deduced from request and set via the retmode parameter.
Parameters:
- raw_response (
urllib.request.Request
) – responseentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: response in parseable format
Return type: dict or
io.stringIO
- ..note::
- Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
follow_up
()¶Return follow-up parameters if available
Returns: Follow-up parameters Return type: dict
get_result
()¶Return result
Returns: result instance Return type: entrezpy.base.result.EutilsResult
isEmpty
()¶Test for empty result
Return type: bool
isErrorResponse
(response, request)¶Checking for error messages in response from Entrez Servers and set flag
hasErrorResponse
.
Parameters:
- response (dict or
io.stringIO
) – parseable response fromconvert_response()
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: error status
Return type: bool
isSuccess
()¶Test if response has errors
Return type: bool
known_fmts
= {'json', 'text', 'xml'}¶
parse
(raw_response, request)¶Check for errors and calls parser for the raw response.
Parameters:
- raw_response (
urllib.request.Request
) – response fromentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestRaises: NotImplementedError – if request format is not in
EutilsAnalyzer.known_fmts
EsearchRequest¶
- class
entrezpy.esearch.esearch_request.
EsearchRequest
(eutil, parameter, start, size)¶Bases:
entrezpy.base.request.EutilsRequest
The EsearchRequest class implements a single request as part of a Esearch query. It stores and prepares the parameters for a single request. See
entrezpy.elink.elink_parameter.ElinkParameter
for parameter description. Requests sizes are congifured from setting a start, i.e. the index of the first UID to fetch, and its size, i.e. how many to fetch. These are set byentrezpy.esearch.esearch_query.Esearcher.inquire()
.
Parameters:
- parameter – request parameter
- type –
entrezpy.elink.elink_parameter.ElinkParameter
- start (int) – number of first UID to fetch
- size (int) – requets size
get_post_parameter
()¶Virtual function returning the POST parameters for the request from required attributes.
Return type: dict Raises: NotImplemetedError –
dump
()¶
Return type: dict
calc_duration
()¶Calculates request duration
dump_internals
(extend=None)¶Dumps internal attributes for request.
Parameters: extend (dict) – extend dump with additional information
get_request_id
()¶
Returns: full request id Return type: str
prepare_base_qry
(extend=None)¶Returns instance attributes required for every POST request.
Parameters: extend (dict) – parameters extending basic parameters Returns: base parameters for POST request Return type: dict
report_status
(processed_requests=None, expected_requests=None)¶Reports request status when triggered
set_request_error
(error)¶Sets request error and HTTP/URL error message
Parameters: error (str) – HTTP/URL error
set_status_fail
()¶Set status if request failed
set_status_success
()¶Set status if request succeeded
start_stopwatch
()¶Starts time to measure request duration.
EsearchResult¶
- class
entrezpy.esearch.esearch_result.
EsearchResult
(response, request)¶Bases:
entrezpy.base.result.EutilsResult
EsearchResult sstores fetched UIDs and/or WebEnv-QueryKeys and creates follow-up parameters. UIDs are stored as string, even when UIDs, since responses can contain also accsessions when using the idtype option.
Parameters:
- response (dict) – Esearch response
- request (
entrezpy.esearch.esearch_request.EsearchRequest
) – Esearch request instance for this queryVariables: uids (list) – analyzed UIDs from response
dump
()¶
Return type: dict
get_link_parameter
(reqnum=0)¶Assemble follow-up parameters for linking. The first request returns all required information and using its querykey in such a case.
Return type: dict
isEmpty
()¶Empty search result has no webenv/querykey and/or no fetched UIDs.
size
()¶Returns number of analyzed UIDs.
Return type: int
query_size
()¶Returns number of all UIDs for search (count).
Return type: int
add_response
(response)¶Adds responses from individual requests.
Parameters: response (dict) – Esearch response
Efetch modules¶
Efetcher¶
- class
entrezpy.efetch.efetcher.
Efetcher
(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)¶Bases:
entrezpy.base.query.EutilsQuery
Efetcher implements Efetch E-Utilities queries [0]. It implements
entrezpy.base.query.EutilsQuery.inquire()
to fetch data from NCBI Entrez servers. [0]: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch [1]: https://www.ncbi.nlm.nih.gov/books/NBK25497/table/ chapter2.T._entrez_unique_identifiers_ui/?report=objectonly
Variables: result – entrezpy.base.result.EutilsResult
inquire
(parameter, analyzer=<entrezpy.efetch.efetch_analyzer.EfetchAnalyzer object>)¶Implements
entrezpy.base.query.EutilsQuery.inquire()
and configures fetch.Note
Efetch prefers to know the number of UIDs to fetch, i.e. number of UIDs or retmax. If this information is missing, the max number of UIDs for the specific retmode and rettype are fetched.
Parameters:
- parameter (dict) – EFetch parameter
- analyzer (
entrezpy.base.analyzer.EutilsAnalyzer
) – analyzer for Efetch resultsReturns: analyzer instance or None if request errors have been encountered
Return type:
EfetchParameter¶
entrezpy.efetch.efetch_parameter.
DEF_RETMODE
= 'xml'¶Default retmode for fetch requests
- class
entrezpy.efetch.efetch_parameter.
EfetchParameter
(param)¶Bases:
entrezpy.base.parameter.EutilsParameter
EfetchParameter implements checks and configures an EftechQuery. A fetch query knows its size due to the id parameter or earlier result stored on the Entrez history server using WebEnv and query_key. The default retmode (fetch format) is set to XML because all E-Utilities can retun XML but not JSON, unfortunately.
req_limits
= {'json': 500, 'text': 10000, 'xml': 10000}¶Max number of UIDs to fetch per request mode
valid_retmodes
= {'gene': {'text', 'xml'}, 'nuccore': {'text', 'xml'}, 'pmc': {'xml'}, 'poset': {'text', 'xml'}, 'protein': {'text', 'xml'}, 'pubmed': {'text', 'xml'}, 'sequences': {'text', 'xml'}}¶Enforced request uid sizes by NCBI for fetch requests by format
adjust_retmax
(retmax)¶Adjusts retmax parameter. Order of check is crucial.
Parameters: retmax (int) – retmax value Returns: adjusted retmax or None if all UIDs are fetched Return type: int or None
check_retmode
(retmode)¶Checks for valid retmode and retmode combination
Parameters: retmode (str) – retmode parameter Returns: retmode Return type: str
adjust_reqsize
(reqsize)¶Adjusts request size for query
Parameters: reqsize (str or None) – Request size parameter Returns: adjusted request size Return type: int
calculate_expected_requests
(qsize=None, reqsize=None)¶Calculate anf set the expected number of requests. Uses internal parameters if non are provided.
Parameters:
- or None qsize (int) – query size, i.e. expected number of data sets
- reqsize (int) – number of data sets to fetch in one request
haveDb
()¶Check for required db parameter
Return type: bool
haveExpectedRequets
()¶Check fo expected requests. Hints an error if no requests are expected.
Return type: bool
haveQuerykey
()¶Check for required QueryKey parameter
Return type: bool
haveWebenv
()¶Check for required WebEnv parameter
Return type: bool
useHistory
()¶Check if history server should be used.
Return type: bool
check
()¶Implements
entrezpy.base.parameter.EutilsParameter.check
to check for the minumum required parameters. Aborts if any check fails.
dump
()¶Dump instance attributes
Return type: dict Raises: NotImplementedError – if not implemented
EfetchAnalyzer¶
- class
entrezpy.efetch.efetch_analyzer.
EfetchAnalyzer
¶Bases:
entrezpy.base.analyzer.EutilsAnalyzer
EfetchAnalyzer implements a basic analysis of Efetch E-Utils responses. Stores results in a
entrezpy.efetch.efetch_result.EfetchResult
instance.Note
This is a very superficial analyzer for documentation and educational purposes. In almost all cases a more specific analyzer has to be implemented in inheriting
entrezpy.base.analyzer.EutilsAnalyzer
and implementing the virtual functionsentrezpy.base.analyzer.EutilsAnalzyer.analyze_result()
andentrezpy.base.analyzer.EutilsAnalzyer.analyze_error()
.
Variables: result – entrezpy.efetch.efetch_result.EfetchResult
init_result
(response, request)¶Should be implemented if used properly
analyze_result
(response, request)¶Virtual function to handle responses, i.e. parsing them and prepare them for
entrezpy.base.result.EutilsResult
Parameters: response (dict or io.StringIO) – converted response from convert_response()
Raises: NotImplementedError – if implementation is missing
analyze_error
(response, request)¶Virtual function to handle error responses
Parameters: response (dict or io.StringIO) – converted response from convert_response()
Raises: NotImplementedError – if implementation is missing
norm_response
(response, rettype=None)¶Normalizes response for printing
Parameters: response (dict or io.StringIO) – efetch response Returns: str or dict
isEmpty
()¶Test for empty result
Return type: bool
check_error_json
(response)¶Checks for errors in JSON responses. Not unified among Eutil functions.
Parameters: response (dict) – reponse Returns: status if JSON response has error message Return type: bool
check_error_xml
(response)¶Checks for errors in XML responses
Parameters: response ( io.stringIO
) – XML responseReturns: if XML response has error message Return type: bool
convert_response
(raw_response_decoded, request)¶Converts raw_response into the expected format, deduced from request and set via the retmode parameter.
Parameters:
- raw_response (
urllib.request.Request
) – responseentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: response in parseable format
Return type: dict or
io.stringIO
- ..note::
- Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
follow_up
()¶Return follow-up parameters if available
Returns: Follow-up parameters Return type: dict
get_result
()¶Return result
Returns: result instance Return type: entrezpy.base.result.EutilsResult
isErrorResponse
(response, request)¶Checking for error messages in response from Entrez Servers and set flag
hasErrorResponse
.
Parameters:
- response (dict or
io.stringIO
) – parseable response fromconvert_response()
- request (
entrezpy.base.request.EutilsRequest
) – query requestReturns: error status
Return type: bool
isSuccess
()¶Test if response has errors
Return type: bool
known_fmts
= {'json', 'text', 'xml'}¶
parse
(raw_response, request)¶Check for errors and calls parser for the raw response.
Parameters:
- raw_response (
urllib.request.Request
) – response fromentrezpy.requester.requester.Requester
- request (
entrezpy.base.request.EutilsRequest
) – query requestRaises: NotImplementedError – if request format is not in
EutilsAnalyzer.known_fmts
EfetchRequest¶
- class
entrezpy.efetch.efetch_request.
EfetchRequest
(eutil, parameter, start, size)¶Bases:
entrezpy.base.request.EutilsRequest
The EfetchRequest class implements a single request as part of an Efetch query. It stores and prepares the parameters for a single request.
entrezpy.efetch.efetch_query.Efetch.inquire()
calculates start and size for a single request.
Parameters:
- parameter – request parameter
- type –
entrezpy.efetch.efetch_parameter.EfetchParameter
- start (int) – number of first UID to fetch
- size (int) – requets size
get_post_parameter
()¶Virtual function returning the POST parameters for the request from required attributes.
Return type: dict Raises: NotImplemetedError –
dump
()¶Dumps instance attributes
calc_duration
()¶Calculates request duration
dump_internals
(extend=None)¶Dumps internal attributes for request.
Parameters: extend (dict) – extend dump with additional information
get_request_id
()¶
Returns: full request id Return type: str
prepare_base_qry
(extend=None)¶Returns instance attributes required for every POST request.
Parameters: extend (dict) – parameters extending basic parameters Returns: base parameters for POST request Return type: dict
report_status
(processed_requests=None, expected_requests=None)¶Reports request status when triggered
set_request_error
(error)¶Sets request error and HTTP/URL error message
Parameters: error (str) – HTTP/URL error
set_status_fail
()¶Set status if request failed
set_status_success
()¶Set status if request succeeded
start_stopwatch
()¶Starts time to measure request duration.
Requester module¶
Requester¶
- class
entrezpy.requester.requester.
Requester
(wait, max_retries=9, init_timeout=10, timeout_max=60, timeout_step=5)¶Requester implements the sendong of HTTP POST requests and the receiving of the result. It checks for request connection errors and performs retries when possible. If the maximum number of retries is reached, the request is conisdered failed. In case of connections errors, abort if the error is not due to timeout. The initial timeout is increased insteps until the maximum timeout has been reached.
Parameters:
- wait (float) – wait time in seconds between requests
- max_retries (int) – number of rertries before giving up.
- init_timeout (int) – number of seconds before the initial request is consid considered a timeout error
- timeout_max (int) – maximum requet timeout before giving up
- timeout_steps (int) – increase value for timeout errors
request
(req)¶Request the request
Parameters: req ( entrezpy.base.request.EutilsRequest
) – entrezpy request
run_one_request
(request, monitor)¶Processes one request from the queue and logs its progress.
Parameters: request ( entrezpy.base.request.EutilsRequest
) – single entrezpy request
Conduit module¶
Conduit¶
- class
entrezpy.conduit.
Conduit
(email, apikey=None, apikey_envar=None, threads=None)¶Conduit simplifies to create pipelines and queries for entrezpy. Conduit stores results from previous requests, allowing to concatenate queries and retrieve obtained results later if required to reduce the need to redownload data. Conduit can use multiple threads to speed up data download, but some external libraries can break, e.g. SQLite3.
Queries instances in pipelines of
Conduit.Pipeline
are stored in the dictionaryConduit.queries
with the query id as key and are accessible by all Conduit instances. A singleConduit.Pipeline
stores only the query id for this instance
Parameters:
- email (str) – user email
- apikey (str) – NCBI apikey
- apikey_var (str) – enviroment variable storing NCBI apikey
- threads (int) – set threads for multithreading
queries
= {}¶Query storage
analyzers
= {}¶Analyzed query storage
- class
Query
(function, parameter, dependency=None, analyzer=None)¶Entrezpy query for a Conduit pipeline. Conduit assembles pipelines using several Query() instances. If a dependency is given, it uses those parameters as basis using :meth:.resolve_dependency`.
Parameters:
- function (str) – Eutils function
- parameter (dict) – function parameters
- dependency (str) – query id from earlier query
- analyzer (
entrezpy.base.analyzer.EutilsAnalyzer
) – analyzer instance for this query
resolve_dependency
()¶Resolves dependencies to obtain paremeters from earlier query. Parameters passed to this instance will overwrite dependency parameters
dump
()¶
- class
Pipeline
¶The Pipeline class implements a query pipeline with several consecutive queries. New pipelines are obtained through
Conduit
. Query instances are stored inConduit.queries
and the corresponding query id’s inqueries
. Every added query returns its id which can be used to retrieve it.
Variables: queries – queries for this Pipeline instance
add_search
(parameter=None, dependency=None, analyzer=None)¶Adds Esearch query
Parameters:
- parameter (dict) – Esearch E-Eutility parameters
- dependency (str) – query id from earlier query
- analyzer (
entrezpy.base.analyzer.EutilsAnalyzer
) – analyzer for this queryReturns: Conduit query
Return type:
ConduitQuery
add_link
(parameter=None, dependency=None, analyzer=None)¶Adds Elink query. Signature as
Conduit.Pipeline.add_search()
add_post
(parameter=None, dependency=None, analyzer=None)¶Adds Epost query. Signature as
Conduit.Pipeline.add_search()
add_summary
(parameter=None, dependency=None, analyzer=None)¶Adds Esummary query. Signature as
Conduit.Pipeline.add_search()
add_fetch
(parameter=None, dependency=None, analyzer=None)¶Adds Efetch query. Same signature as
Conduit.Pipeline.add_search()
but analyzer is required as this step obtains highly variable results.
add_query
(query)¶Adds query to own pipeline and storage
Parameters: query ( Conduit.Query
) – Conduit queryReturns: query id of added query Return type: str
run
(pipeline)¶Runs one query in pipeline and checks for errors. If errors are encounterd the pipeline aborts.
Parameters: pipeline ( Conduit.Pipeline
) – Conduit pipeline
check_query
(query)¶Check for successful query.
Parameters: query ( Conduit.Query
) – Conduit query
get_result
(query_id)¶“Returns stored result from previous run.
Parameters: query_id (str) – query id Returns: Result from this query Return type: entrezpy.base.result.EutilsResult
new_pipeline
()¶Retrurns new Conduit pipeline.
Returns: Conduit pipeline Return type: Conduit.Pipeline
search
(query, analyzer=<class 'entrezpy.esearch.esearch_analyzer.EsearchAnalyzer'>)¶Configures and runs an Esearch query. Analyzer are class references and instantiated here.
Parameters:
- query (
Conduit.Query
) – Conduit Query- analyzer – reference to analyzer class
Returns: analyzer
Return type:
summarize
(query, analyzer=<class 'entrezpy.esummary.esummary_analyzer.EsummaryAnalyzer'>)¶Configures and runs an Esummary query. Analyzer are class references and instantiated here.
Parameters:
- query (
Conduit.Query
) – Conduit Query- analyzer – reference to analyzer class
Returns: analyzer
Return type:
entrezpy.esummary.esummary_analyzer.EsummaryAnalyzer
link
(query, analyzer=<class 'entrezpy.elink.elink_analyzer.ElinkAnalyzer'>)¶Configures and runs an Elink query. Analyzer are class references and instantiated here.
Parameters:
- query (
Conduit.Query
) – Conduit Query- analyzer – reference to analyzer class
Returns: analyzer
Return type:
post
(query, analyzer=<class 'entrezpy.epost.epost_analyzer.EpostAnalyzer'>)¶Configures and runs an Epost query. Analyzer are class references and instantiated here.
Parameters:
- query (
Conduit.Query
) – Conduit Query- analyzer – reference to analyzer class
Returns: analyzer
Return type:
fetch
(query, analyzer=<class 'entrezpy.efetch.efetch_analyzer.EfetchAnalyzer'>)¶uns an Efetch query. The Analyzer needs to be added to the quuery
Parameters:
- query (
Conduit.Query
) – Conduit Query- analyzer – reference to analyzer class
Returns: analyzer
Returns: analyzer
Return type:
Glossary¶
Glossary¶
- NCBI
- National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov
- E-Utilities
- E-Utility
- Collection of NCBI tools handling queries to Entrez
- Entrez
- NCBI database servers storing biomedical data and literature
- UID
- UIDs
- Document identifier unique within one Entrez database
- source database
- The database from which UIDs are linkd from
- target database
- The database from which UIDs are linked
- WebEnv
- String referencing a E-Utility query
- querykey
- query_key
- Number referencing a specific request for a WebEnv
- Entrezpy query
- Entrezpy querys
- entrezpy query
- entrezpy querys
- Entrezpy querys
- A query to onn E-Utility function in entrezpy is considered one query, which can have several entrezpy requests.
- Entrezpy request
- Entrezpy requests
- entrezpy request
- entrezpy requests
- Entrezpy requests
- One request as part of an entrezpy query.