Entrezpy: NCBI Entrez databases at your fingertips

https://img.shields.io/pypi/pyversions/entrezpy.svg?style=popout-square:alt:PyPI-PythonVersion https://img.shields.io/pypi/l/entrezpy.svg?style=popout-square:alt:PyPI-License https://img.shields.io/pypi/v/entrezpy.svg?style=popout-square:alt:PyPI https://img.shields.io/pypi/format/entrezpy.svg?style=popout-square:alt:PyPI-Format https://img.shields.io/pypi/status/entrezpy.svg?style=popout-square:alt:PyPI-Status

Synopsis

$ pip install entrezpy --user
>>> import entrezpy.conduit
>>> c = entrezpy.conduit.Conduit('myemail')
>>> fetch_influenza = c.new_pipeline()
>>> sid = fetch_influenza.add_search({'db' : 'nucleotide', 'term' : 'H3N2 [organism] AND HA', 'rettype':'count', 'sort' : 'Date Released', 'mindate': 2000, 'maxdate':2019, 'datetype' : 'pdat'})
>>> fid = fetch_influenza.add_fetch({'retmax' : 10, 'retmode' : 'text', 'rettype': 'fasta'}, dependency=sid)
>>> c.run(fetch_influenza)

Entrezpy is a dedicated Python library to interact with NCBI Entrez databases [Entrez2016] via the E-Utilities [Sayers2018]. Entrezpy facilitates the implementation of queries to query or download data from the Entrez databases, e.g. search for specific sequences or publiations or fetch your favorite genome. For more complex queries entrezpy offers the class entrezpy.conduit.Conduit to run query pipelines or reuse previous queries.

Supported E-Utility functions:

Source code

git clone https://gitlab.com/ncbipy/entrezpy.git

Contact

To report bugs and/or errors, please open an issue at https://gitlab.com/ncbipy/entrezpy or contact me at: jan.buchmann@sydney.edu.au

Of course, feel free to fork the code, improve it, and/or open a pull request.

NCBI API key

NCBI offers API keys to allow more requests per second. For more details and rational see [Sayers2018]. entrezpy checks for NCBI API keys as follows:

  • The NCBI API key can be passed as parameter to entrezpy classes
  • Entrezpy checks for the environment variable $NCBI_API_KEY
  • The enviroment variable, e.g. NCBI_API_KEY, can be passed via the apikey_var parameter to any derived entrezpy.base.query.EutilsQuery class.

Work in progress

  • easier logging configuration via file
  • simplify Elink results
  • Deploy cleaner testing
  • Status indicating of request

Manual

Installation

entrezpy can be installed or included into your own pipeline using two approaches: PyPi or Append to sys.path.

Requirements

  • Python version >= 3.6

  • Python Standard Library :

    The standard library should be installed with Python. Just in case, these modules from the Python Standard Library are required:

    • base64
    • io
    • json
    • logging
    • math
    • os
    • queue
    • random
    • socket
    • sys
    • threading
    • time
    • urllib
    • uuid
    • xml.etree.ElementTree

Test your Python version

Test if we have at least Python 3.6 :

$ python
>>> import sys
>>> sys.version_info
>>> sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
                         ^        ^

PyPi

Install entrezpy via PyPi and check:

$ pip install entrezpy --user

Test if we can import entrezpy:

$ python
>>> import entrezpy

Append to sys.path

Add entrezpy to your pipeline via sys.path. This requires to clone the source code adjusting sys.path.

Assuming following directory structure where entrezpy was cloned into include:

$ git clone https://gitlab.com/ncbipy/entrezpy.git project_root/include

project_root
|
|-- src
|   `-- pipeline.py
`-- include
    `-- entrezpy
        `-- src
            `-- entrezpy
                `-- efetch

Importing the module efetcher in pipeline.py by adjust sys.path in project_root/src/pipeline.py

sys.path.insert(1, os.path.join(sys.path[0], '../include/entrezpy/src'))
import entrezpy.efetch.efetcher

ef = entrezpy.efetch.efetcher.Efetcher('toolname', 'email')

Test entrezpy

Run the examples in the git repository in entrezpy/examples, e.g:

$ ./path/to/entrezpy/examples/entrezpy-example.elink.py --email you@email

To adjust the examples for testing an installation via PyPi, remove the sys.path line in the examples prior to invoking them, e.g.

for i in entrezpy/examples/*.py; do                 \
  fname=$(basename $i | sed 's/\.py/\.adjust.py/'); \
  sed '/sys.path.insert/d' $i > $fname;             \
  chmod +x $fname;                                  \
done;

The examples print the results onto the standard output and additional information onto standard error. Currently, we propose to run the examples and redirecting standard error to a file. For example, testing efetch, run examples/entrezpy-example.efetch.py as follows:

./examples/entrezpy-example.efetch.py --email you@email 2> efetch.stderr

efetch.stderr can be monitored as follows:

tail -f efetch.stderr

Entrezpy tutorials

Esearch

Esearch searches the specified Entrez database for data records matching the query. It can return the found UIDs or a WebEnv/query_key referencing for the UIDs

Esearch returning UIDs

Search the nucleotide database for virus sequences and fetch the first 110,000 UIDs.

  1. Create an Esearcher instance
  2. Run the query and store the analyzer
  3. Print the fetched UIDs
1
2
3
4
5
import entrezpy.esearch.esearcher

e = entrezpy.esearch.esearcher.Esearcher('esearcher', 'email')
a = es.inquire('db':'nucleotide','term':'viruses[orgn]', 'retmax': 110000, 'rettype': 'uilist')
print(a.get_result().uids)

Line 1: Import the esearcher module

Line 3: Instantiate an esearcher instance with the required parameter
tool (using ‘esearcher’) and email
Line 4: Run query to search the database nucleotide, using the term
viruses[orgn], limit the result to the first 110,000 UIDs, and request UIDs. Store the returned default analyzer in a.

Line 5: Print the fetched UIDs

Esearch returning History server reference to UIDs

Same example as above, but in place of UIDs WebeEnv and query_key are returned. By default, entrezpy uses the history server (setting the POST parameter usehistory=y) and is not required to be passed as parameter explicitly.

1
2
3
4
5
6
import entrezpy.esearch.esearcher

e = entrezpy.esearch.esearcher.Esearcher('esearcher', 'email')
a = es.inquire('db':'nucleotide','term':'viruses[orgn]', 'retmax': 110000)
print(a.size())
print(a.reference().webenv, a.reference().querykey)

Line 1: Import the esearcher module

Line 3: Instantiate an esearcher instance with the required parameter
tool (using ‘esearcher’) and email
Line 4: Run query to search the database nucleotide, using the term
viruses[orgn] and limit the result to the first 110,000 UIDs. Store the returned default analyzer in a

Line 5: Print the number of fetched UIDs, which should be 0 Line 6: Print the WebEnv and query_key

Conduit

The Conduit module facilitates creating pipelines to link individual Eutils request, i.e. linking the results of an Esearch to the corresponding nucleotide data records.

Conduit pipelines

Conduit pipelines store a sequence of E-Utility queries. Let’s create a simple Conduit pipeline to fetch sequences for virus nulceotide sequences. This requires to (i) search the nucleotide database which will return the found UIDs (data records), and (ii) fetch the found UIDs.

  1. The first step in the pipeline is to search the Entrez nucleotide database for viruses sequences (Line 6). We add a search query to the pipeline and store its id for later use. We set the parameter rettype to count to avoid downloading the UIDs and limit the number of UIDs to 100 with ‘retmax’. The result will tell us how many UIDs were found and a reference to the Entrez History server which we can use later to fetch the sequences.
  2. The second step in our pipline is the actual step to download the found sequences. We add a fetch step to our pipeline and use its id as dependency. Conduit will automatically set the ‘db’, ‘WebEnv’ and ‘query_key’ parameters for the fetch step. In addition, we specify that we want the sequences as text FASTA format.
  3. The last step is to run the queries in the pipeline. This is done py passing the pipeline to Conduit’s run method which will request the queries. If no request errors have occured, Conduit returns the default analyzer for this type of query. Sine this uses the default Efetch analzyer, results are just printed to the standard output.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import entrezpy.conduit

w = entrezpy.conduit.Conduit('email')
get_sequences = w.new_pipeline()

sid = get_sequenced.add_search({'db' : 'nucleotide', 'term' : 'viruses[Organism]', 'rettype' : 'count'})

get_sequences.add_fetch({'retmode' : 'text', 'rettype' : 'fasta'}, dependency=sid)

analyzer = w.run(get_sequences)

Line 1: Import the conduit module

Line 3: Create a Conduit instance with the required email address

Line 4: Create a new pipeline and store it in get_sequences

Line 6: Add search query to the pipeline and store its id in ‘’sid’’

Line 10: Add fetch query to the pipeline

Line 13: Run pipeline and store the resulting analyzer

Linking within and between Entrezpy databases

Using multiple links in a Conduit pipeline requires to run an Esearch afterwards to keep track of the proper UIDs. This is a quirk of the E-Eutilties (Entrez-Direct uses the same trick).

  1. Search the Pubmed Enrez database
  2. Increase the number of possible UIDs by searching pubmed again using the first UIDs to find publications linked to initial search
  3. Link the Pubmed UIDs to nuccore UIDs
  4. Fetch the found UIDs from nuccore

The following code shows howto use multiple links within a Conduit pipeline.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import entrezpy.conduit

w = entrezpy.conduit.Conduit(args.email)
find_genomes = w.new_pipeline()

sid = find_genomes.add_search({'db':'pubmed', 'term' : 'capsid AND infection', 'rettype':'count'})

lid1 = find_genomes.add_link({'cmd':'neighbor_history', 'db':'pubmed'}, dependency=sid)
lid1 = find_genomes.add_search({'rettype': 'count', 'cmd':'neighbor_history'}, dependency=lid1)

lid2 = find_genomes.add_link({'db':'nuccore', 'cmd':'neighbor_history'}, dependency=lid1)
lid2 = find_genomes.add_search({'rettype': 'count', 'cmd':'neighbor_history'}, dependency=lid2)

find_genomes.add_fetch({'retmode':'xml', 'rettype':'fasta'}, dependency=lid2)
a = w.run(find_genomes)

Lines 1 - 4: Analogoues as shown in Conduit pipelines

Line 6: Addsa search query to the Conduit pipline in Entrez database pubmed
without downloading UIDs and store it in sid
Line 8: Add a link query to the Conduit pipline to link the UIDs found in search
sid to pubmed and store the result on the history server. Store the query in lid1
Line 9: Update the link results for later use and store in the history server.
Overwrite lid1 with the updated query.
Line 11: Link the pubmed UIDs to nuccore and store in the history server. Store
the query in lid2.
Line 12: Update the link results for later use and store in the history server.
Overwirte lid2 with the updated query
Line 14: Add fetch step to Conduit pipeline with the last link result as
dependency. Request the data as FASTA sequences in XML format (Tinyseq XML).

Line 15: Run the pipeline.

Extending entrezpy

entrezpy can be extended by inheriting its base classes. This will be the case when the final step is to fetch data records and do something with them, e.g. processing them for a database or parsing for specific information.

Fetching publication information from Entrez

Prerequisites

Acknowledgment

I’d like to thank Pedram Hosseini (pdr[dot]hosseini[at]gmail[dot]com) for pointing out the requirement for this tutorial.

Overview

This tutorial explains how to write a simple PubMed data record fetcher using entrezpy.conduit.Conduit and by adjust entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer.

Outline

The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. For the first example, we’ll fetch PubMed data in XML, specifically, the UID, authors, title, abstract, and citations. We will test and develop the pipeline using the article the article with PubMed ID (PMID) 26378223 because it has all the required fields. In the end we will see that not all fields are always present.

In entrezpy, a result (or query), is the sum of all individual requests required to obtain the whole query. If you want to analyze the number of citations for a specific author, the result is the number of citations which you obtained using a query. To obtain the final number, you have to parse several PubMed records. Therefore, entrezpy requires a result entrezpy.base.result.EutilsResult class to store the partial results obtained from a query.

A quick note on virtual functions

entrezpy is heavily based on virtual methods [1]. A virtual method is declared in the the base class but implemented in the derived class. Every class inheriting the base class has to implement the virtual functions using the same signature and return the same result type as the base class. To implement the method in the inherited class, you need to look up the method in the base class.

PubMed data structure

Before we start to write our implementation, we need to understand the structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its examination, it can be either stored into a file, or preferably, piped to a pager, e.g. less [2] or more [3]. These are usually installed on most *NIX systems.

Fetching PubMed data record for PMID 26378223 using EDirect’s efetch
$ efetch -db pubmed -id 26378223 -mode XML | less

The entry should start and end as shown in Listing 2.

XML PubMed data record for publication PMID26378223. Data not related to authors, abstract, title, and references has been removed for clarity.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <!- SKIPPED DATA ->
        <Article PubModel="Print">
            <!- SKIPPED DATA ->
            <ArticleTitle>Cell Walls and the Convergent Evolution of the Viral Envelope.</ArticleTitle>
            <!- SKIPPED DATA ->
            <Abstract>
                <AbstractText>Why some viruses are enveloped while others lack an outer lipid bilayer is a major question in viral evolution but one that has received relatively little attention. The viral envelope serves several functions, including protecting the RNA or DNA molecule(s), evading recognition by the immune system, and facilitating virus entry. Despite these commonalities, viral envelopes come in a wide variety of shapes and configurations. The evolution of the viral envelope is made more puzzling by the fact that nonenveloped viruses are able to infect a diverse range of hosts across the tree of life. We reviewed the entry, transmission, and exit pathways of all (101) viral families on the 2013 International Committee on Taxonomy of Viruses (ICTV) list. By doing this, we revealed a strong association between the lack of a viral envelope and the presence of a cell wall in the hosts these viruses infect. We were able to propose a new hypothesis for the existence of enveloped and nonenveloped viruses, in which the latter represent an adaptation to cells surrounded by a cell wall, while the former are an adaptation to animal cells where cell walls are absent. In particular, cell walls inhibit viral entry and exit, as well as viral transport within an organism, all of which are critical waypoints for successful infection and spread. Finally, we discuss how this new model for the origin of the viral envelope impacts our overall understanding of virus evolution. </AbstractText>
                <CopyrightInformation>Copyright © 2015, American Society for Microbiology. All Rights Reserved.</CopyrightInformation>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Buchmann</LastName>
                    <ForeName>Jan P</ForeName>
                    <Initials>JP</Initials>
                    <AffiliationInfo>
                        <Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Holmes</LastName>
                    <ForeName>Edward C</ForeName>
                    <Initials>EC</Initials>
                    <AffiliationInfo>
                        <Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia edward.holmes@sydney.edu.au.</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <!- SKIPPED DATA ->
        </Article>
        <!- SKIPPED DATA ->
        <ReferenceList>
            <Reference>
                <Citation>Nature. 2014 Jan 16;505(7483):432-5</Citation>
                <ArticleIdList>
                    <ArticleId IdType="pubmed">24336205</ArticleId>
                </ArticleIdList>
            </Reference>
            <Reference>
                <Citation>Crit Rev Microbiol. 1988;15(4):339-89</Citation>
                <ArticleIdList>
                    <ArticleId IdType="pubmed">3060317</ArticleId>
                </ArticleIdList>
            </Reference>
            <!- SKIPPED DATA ->
        </ReferenceList>
    </PubmedData>
</PubmedArticle>

</PubmedArticleSet>

This shows us the XML fields, specifically the tags, present in a typical PubMed record. The root tag for each batch of fetched data records is <PubmedArticleSet> and each individual data record is described in the nested tags <PubmedArticle>. We are interested in the following tags nested within <PubmedArticle>:

  • <ArticleTitle>
  • <Abstract>
  • <AuthorList>
  • <ReferenceList>

The first step is to write a program to fetch the requested records. This can be done using the entrezpy.conduit.Conduit class.

Simple Conduit pipeline to fetch PubMed Records

We will write simple entrezpy pipeline named pubmed-fetcher.py using entrezpy.conduit.Conduit to test and run our implementations. A simple entrezpy.conduit.Conduit pipeline requires two arguments:

  • user email
  • PMID (here 15430309)
Basic entrezpy.conduit.Conduit pipeline to fetch PubMed data records. The required arguments are positional arguments given at the command line.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/env python3


import os
import sys


"""
If entrezpy is installed using PyPi uncomment th line 'import entrezpy'  and
comment the 'sys.path.insert(...)'
"""
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit


def main():
  c = entrezpy.conduit.Conduit(sys.argv[1])
  fetch_pubmed = c.new_pipeline()
  fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2]], 'retmode':'xml'})
  c.run(fetch_pubmed)
  return 0

if __name__ == '__main__':
  main()

Let’s test this program to see if all modules are found and conduit works.

$ python pubmed-fetcher.py your@email 15430309

Since we didn’t specify an analyzer yet, we expect the raw XML output is printed to the standard output. So far, this produces the same output as Listing 1.

If this command fails and/or no output is printed to the standard output, something went wrong. Possible issues may include no internet connection, wrongly installed entrezpy, wrong import statements, or bad permissions.

If everything went smoothly, we wrote a basic but working pipeline to fetch PubMed data from NCBI’s Entrez database. We can now start to implement our specific entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer classes. However, before we implement these classes, we need to decide how want to store a PubMed data record.

How to store PubMed data records

The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a PubMed record.

Implementing a PubMed data record
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class PubmedRecord:
  """Simple data class to store individual Pubmed records. Individual authors will
  be stored as dict('lname':last_name, 'fname': first_name) in authors.
  Citations as string elements in the list citations. """

  def __init__(self):
    self.pmid = None
    self.title = None
    self.abstract = None
    self.authors = []
    self.references = []

Further, we will use the dict pubmed_records as attribute of PubmedResult to store PubmedRecord instances using the PMID as key to avoid duplicates.

Defining PubmedResult and PubmedAnalyzer

From the documentation or publication, we know that entrezpy.base.analyzer.EutilsAnalyzer parses responses and stores results in entrezpy.base.result.EutilsResult. Therefore, we need to derive and adjust these classes for our PubmedResult and PubmedAnalyzer classes. We will add these classes to our program pubmed-fetcher.py. The documentation tells us what the required parameters for each class are and the virtual methods we need to implement.

Implement PubmedResult

We have to extend the virtual methods declared in entrezpy.base.result.EutilsResult. The documentation informs us about the required parameters and expected return values.

In addition, we declare the method PubmedResult.add_pubmed_record() to handle adding new PubMed data record instances as defined in Listing 4. The PubmedResult methods in this tutorial are trivial since and we can implement the class in one go

Implementing PubmedResult
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class PubmedResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
  Individual Pubmed records are implemented in :class:`PubmedRecord` and
  stored in :ivar:`pubmed_records`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict pubmed_records: storing PubmedRecord instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.pubmed_records = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.pubmed_records)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.pubmed_records:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching a pubmed record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_pubmed_record(self, pubmed_record):
    """The only non-virtual and therefore PubmedResult-specific method to handle
    adding new data records"""
    self.pubmed_records[pubmed_record.pmid] = pubmed_record

Note

Linking PubMed records for subsequent searches is better handled by creating a pipeline performing esearch queries followed by elink queries and a final efetch query. The fetch result for PubMed records has no WebEnv value and is missing the originating database since efetch is usually the last query within a series of Eutils queries. You can test this using the following EDirect pipeline: $ efetch -db pubmed -id 20148030 | elink -target nuccore Therefore, we implement a warning, informing the user linking is not possible. Nevertheless, the method could return any parsed information, e.g. nucleotide UIDs, and used as parameter for a subsequent fetch. However, some features could not be used, e.g. the Entrez history server.

Implementing PubmedAnalyzer

We have to extend the virtual methods declared in entrezpy.base.analyzer.EutilsAnalyzer. The documentation informs us about the required parameters and expected return values.

Implementing PubmedAnalyzer
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse PubMed responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = PubmedResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect XML errors, just print the error to STDOUT for
    logging/debugging."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response.getvalue()}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    Parse PubMed  XML line by line to extract authors and citations.
    xml.etree.ElementTree.iterparse
    (https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
    reads the XML file incrementally. Each  <PubmedArticle> is cleared after processing.

    ..note::  Adjust this method to include more/different tags to extract.
              Remember to adjust :class:`.PubmedRecord` as well."""
    self.init_result(response, request)
    isAuthorList = False
    isAuthor = False
    isRefList = False
    isRef = False
    isArticle = False
    medrec = None
    for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
      if event == 'start':
        if elem.tag == 'PubmedArticle':
          medrec = PubmedRecord()
        if elem.tag == 'AuthorList':
          isAuthorList = True
        if isAuthorList and elem.tag == 'Author':
          isAuthor = True
          medrec.authors.append({'fname': None, 'lname': None})
        if elem.tag == 'ReferenceList':
          isRefList = True
        if isRefList and elem.tag == 'Reference':
          isRef = True
        if elem.tag == 'Article':
          isArticle = True
      else:
        if elem.tag == 'PubmedArticle':
          self.result.add_pubmed_record(medrec)
          elem.clear()
        if elem.tag == 'AuthorList':
          isAuthorList = False
        if isAuthorList and elem.tag == 'Author':
          isAuthor = False
        if elem.tag == 'ReferenceList':
          isRefList = False
        if elem.tag == 'Reference':
          isRef = False
        if elem.tag == 'Article':
          isArticle = False
        if elem.tag == 'PMID':
          medrec.pmid = elem.text.strip()
        if isAuthor and elem.tag == 'LastName':
          medrec.authors[-1]['lname'] = elem.text.strip()
        if isAuthor and elem.tag == 'ForeName':
          medrec.authors[-1]['fname'] = elem.text.strip()
        if isRef and elem.tag == 'Citation':
          medrec.references.append(elem.text.strip())
        if isArticle and elem.tag == 'AbstractText':
          if not medrec.abstract:
            medrec.abstract = elem.text.strip()
          else:
            medrec.abstract += elem.text.strip()
        if isArticle and elem.tag == 'ArticleTitle':
          medrec.title = elem.text.strip()

The XML parser is the critical, and most likely most complex, piece to implement. However, if you want to parse your Entrez results you anyway need to develop a parser. If you already have a parser, you can use an object composition approach [#fn-oocomp]. Further, you can add a method in analyze_result to store the processed data in a database or implementing checkpoints.

Note

Explaining the XML parser is beyond the scope of this tutorial (and there are likely better approaches, anyways).

Putting everything together

The completed implementation is shown in Listing 7.

Complete PubMed fetcher to extract author and citations.
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
#!/usr/bin/env python3


import os
import sys
import json
import xml.etree.ElementTree


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


class PubmedRecord:
  """Simple data class to store individual Pubmed records. Individual authors will
  be stored as dict('lname':last_name, 'fname': first_name) in authors.
  Citations as string elements in the list citations. """

  def __init__(self):
    self.pmid = None
    self.title = None
    self.abstract = None
    self.authors = []
    self.references = []

class PubmedResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
  Individual Pubmed records are implemented in :class:`PubmedRecord` and
  stored in :ivar:`pubmed_records`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict pubmed_records: storing PubmedRecord instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.pubmed_records = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.pubmed_records)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.pubmed_records:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching a pubmed record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_pubmed_record(self, pubmed_record):
    """The only non-virtual and therefore PubmedResult-specific method to handle
    adding new data records"""
    self.pubmed_records[pubmed_record.pmid] = pubmed_record

class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse PubMed responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = PubmedResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect XML errors, just print the error to STDOUT for
    logging/debugging."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response.getvalue()}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    Parse PubMed  XML line by line to extract authors and citations.
    xml.etree.ElementTree.iterparse
    (https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
    reads the XML file incrementally. Each  <PubmedArticle> is cleared after processing.

    ..note::  Adjust this method to include more/different tags to extract.
              Remember to adjust :class:`.PubmedRecord` as well."""
    self.init_result(response, request)
    isAuthorList = False
    isAuthor = False
    isRefList = False
    isRef = False
    isArticle = False
    medrec = None
    for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
      if event == 'start':
        if elem.tag == 'PubmedArticle':
          medrec = PubmedRecord()
        if elem.tag == 'AuthorList':
          isAuthorList = True
        if isAuthorList and elem.tag == 'Author':
          isAuthor = True
          medrec.authors.append({'fname': None, 'lname': None})
        if elem.tag == 'ReferenceList':
          isRefList = True
        if isRefList and elem.tag == 'Reference':
          isRef = True
        if elem.tag == 'Article':
          isArticle = True
      else:
        if elem.tag == 'PubmedArticle':
          self.result.add_pubmed_record(medrec)
          elem.clear()
        if elem.tag == 'AuthorList':
          isAuthorList = False
        if isAuthorList and elem.tag == 'Author':
          isAuthor = False
        if elem.tag == 'ReferenceList':
          isRefList = False
        if elem.tag == 'Reference':
          isRef = False
        if elem.tag == 'Article':
          isArticle = False
        if elem.tag == 'PMID':
          medrec.pmid = elem.text.strip()
        if isAuthor and elem.tag == 'LastName':
          medrec.authors[-1]['lname'] = elem.text.strip()
        if isAuthor and elem.tag == 'ForeName':
          medrec.authors[-1]['fname'] = elem.text.strip()
        if isRef and elem.tag == 'Citation':
          medrec.references.append(elem.text.strip())
        if isArticle and elem.tag == 'AbstractText':
          if not medrec.abstract:
            medrec.abstract = elem.text.strip()
          else:
            medrec.abstract += elem.text.strip()
        if isArticle and elem.tag == 'ArticleTitle':
          medrec.title = elem.text.strip()

def main():
  c = entrezpy.conduit.Conduit(sys.argv[1])
  fetch_pubmed = c.new_pipeline()
  fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2].split(',')],
                          'retmode':'xml'}, analyzer=PubmedAnalyzer())

  a = c.run(fetch_pubmed)

  #print(a)
  # Testing PubmedResult
  #print("DUMP: {}".format(a.get_result().dump()))
  #print("SIZE: {}".format(a.get_result().size()))
  #print("LINK: {}".format(a.get_result().get_link_parameter()))

  res = a.get_result()
  print("PMID","Title","Abstract","Authors","RefCount", "References", sep='=')
  for i in res.pubmed_records:
    print("{}={}={}={}={}={}".format(res.pubmed_records[i].pmid, res.pubmed_records[i].title,
                                  res.pubmed_records[i].abstract,
                                  ';'.join(str(x['lname']+","+x['fname'].replace(' ', '')) for x in res.pubmed_records[i].authors),
                                  len(res.pubmed_records[i].references),
                                  ';'.join(x for x in res.pubmed_records[i].references)))
  return 0

if __name__ == '__main__':
  main()
  • Line 163: Adjust argument processing to allow several comma-separated PMIDs
  • Line 164: add our implemented PubmedAnalyzer as parameter to analyze
    results as described in entrezpy.conduit.Conduit.Pipeline.add_fetch()
  • Line 166: run the pipeline and store the analyzer in a
  • Lines 168-172: Testing methods
  • Line 174: get PubmedResult instance
  • Lines 175-181: process fetched data records into columns

The implementation can be invoked as shown in Listing 8.

Fetching and formatting data records for several different PMIDs
$ python pubmed-fetcher.py you@email 6,15430309,31077305,27880757,26378223| column -s= -t |less

You’ll notice that not all data records have all fields. This is because they are missing in these records or some tags have different names.

Running pubmed-fetcher.py with UID 20148030 will fail (Listing 9).

Fetching the data record PMID20148030 results in an error
$ python pubmed-fetcher.py you@email 20148030

The reason for this is can be found in the requested XML. Running the command in Listing 10 hints the problem. Adjusting and fixing is a task left for interested readers.

Hint to find the reason why PMID 20148030 fails
$ efetch -db pubmed -id 20148030  -mode xml | grep -A7 \<AuthorList

Footnotes

[1]https://en.wikipedia.org/wiki/Virtual_function
[2]http://www.greenwoodsoftware.com/less/
[3]https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/
[4]https://en.cppreference.com/w/c/language/struct
[5]https://en.wikipedia.org/wiki/Object_composition

Fetching sequence metadata from Entrez

Prerequisites

Overview

This tutorial explains how to write a simple sequence docsum fetcher using entrezpy.conduit.Conduit and by adjust entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer. It is based on a esearch followed by fetching the data as docsum JSON. This tutorial is very similar as Fetching publication information from Entrez, the main difference being parsing JSON and using two steps in entrezpy.conduit.Conduit. The main steps are very similar and the reader is should look there for more details.

Outline

The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. We’ll fetch Docsum data in JSON using the EUtil esummary after performing an esearch step using accessions numbers as query. Instead of using efetch, we will use esummary and replace the default analyzer with our own.

In entrezpy, a result (or query), is the sum of all individual requests required to obtain the whole query. esummary fetches data in batches. In this example, all batches are collected prior to printing the infomration to standard output. The method DocsumAnalyzer.analyze_result() can be adjusted to store or analyze the results from each batch as soon as the are fetched.

A quick note on virtual functions

entrezpy is heavily based on virtual methods [1]. A virtual method is declared in the the base class but implemented in the derived class. Every class inheriting the base class has to implement the virtual functions using the same signature and return the same result type as the base class. To implement the method in the inherited class, you need to look up the method in the base class.

Docsum data structure

Before we start to write our implementation, we need to understand the structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its examination, it can be either stored into a file, or preferably, piped to a pager, e.g. less [2] or more [3]. These are usually installed on most *NIX systems.

Fetching Docsum data record for accession HOU142311 using EDirect’s esearch and efetch.
$ esearch -db nuccore -query HOU142311 | esummary -mode json

The entry should start and end as shown in Listing 12.

JSON Docsum data record for accession HOU142311. Only the first few attributes lines are shown for brevity.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
{
    "header": {
        "type": "esummary",
        "version": "0.3"
    },
    "result": {
        "uids": [
            "1110864597"
        ],
        "1110864597": {
            "uid": "1110864597",
            "caption": "KX883530",
            "title": "Beihai levi-like virus 30 strain HOU142311 hypothetical protein genes, complete cds",
            "extra": "gi|1110864597|gb|KX883530.1|",
            "gi": 1110864597,
            "createdate": "2016/12/10",
            "updatedate": "2016/12/10",
            "flags": "",
            "taxid": 1922417,
            "slen": 4084,
            "biomol": "genomic",
            "moltype": "rna",
            "topology": "linear",
            "sourcedb": "insd",
            "segsetsize": "",
            "projectid": "0",
            "genome": "genomic",
            "subtype": "strain|host|country|collection_date",
            "subname": "HOU142311|horseshoe crab|China|2014",
            "assemblygi": "",
            "assemblyacc": "",
            "tech": "",
            "completeness": "",
            "geneticcode": "1",
            "strand": "",
            "organism": "Beihai levi-like virus 30",
            "strain": "HOU142311",
            "biosample": "",
        }
    }
}

The first step is to write a program to fetch the requested records. This can be done using the entrezpy.conduit.Conduit class.

Simple Conduit pipeline to fetch Docsum Records

We will write simple entrezpy pipeline named seqmetadata-fetcher.py using entrezpy.conduit.Conduit to test and run our implementations. A simple entrezpy.conduit.Conduit pipeline requires two arguments:

  • user email
  • accession numbers
Basic entrezpy.conduit.Conduit pipeline to fetch Docsum data records. The required arguments are parsed by ArgumentParser.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/env python3


import os
import sys
import json
import argparse


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


def main():
  ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
  Accessions are parsed form STDIN, one accession pre line')
  ap.add_argument('--email',
                  type=str,
                  required=True,
                  help='email required by NCBI'),
  ap.add_argument('--apikey',
                  type=str,
                  default=None,
                  help='NCBI apikey (optional)')
  ap.add_argument('-db',
                  type=str,
                  required=True,
                  help='Database to search ')
  args = ap.parse_args()

  c = entrezpy.conduit.Conduit(args.email)
  fetch_docsum = c.new_pipeline()
  sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
  fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
                            dependency=sid, analyzer=DocsumAnalyzer())

We need to implement the DocsumAnalyzer, but before we have to design a Docsum data structure.

How to store Docsum data records

The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a Docsum record. Becuase we fetch data in JSON format, the class performs a rather dull parsing. The nested Subtype class handles the subtype and subname attributes in a Docsum response.

Implementing a Docsum data record
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class Docsum:
  """Simple data class to store individual sequence Docsum records."""

  class Subtype:

    def __init__(self, subtype, subname):
      self.strain = None
      self.host = None
      self.country = None
      self.collection = None
      self.collection_date = None

      for i in range(len(subtype)):
        if subtype[i] == 'strain':
          self.stain = subname[i]
        if subtype[i] == 'host':
          self.host = subname[i]
        if subtype[i] == 'country':
          self.country = subname[i]
        if subtype[i] == 'collection_date':
          self.collection_date = subname[i]

  def __init__(self, json_docsum):
    self.uid = int(json_docsum['uid'])
    self.caption = json_docsum['caption']
    self.title = json_docsum['title']
    self.extra = json_docsum['extra']
    self.gi = int(json_docsum['gi'])
    self.taxid = int(json_docsum['taxid'])
    self.slen =  int(json_docsum['slen'])
    self.biomol =  json_docsum['biomol']
    self.moltype =  json_docsum['moltype']
    self.tolopolgy = json_docsum['topology']
    self.sourcedb = json_docsum['sourcedb']
    self.segsetsize = json_docsum['segsetsize']
    self.projectid = int(json_docsum['projectid'])
    self.genome = json_docsum['genome']
    self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
                                  json_docsum['subname'].split('|'))
    self.assemblygi = json_docsum['assemblygi']
    self.assemblyacc = json_docsum['assemblyacc']
    self.tech = json_docsum['tech']
    self.completeness = json_docsum['completeness']
    self.geneticcode = int(json_docsum['geneticcode'])
    self.strand = json_docsum['strand']
    self.organism = self.strand = json_docsum['organism']
    self.strain = self.strand = json_docsum['strain']
    self.accessionversion = json_docsum['accessionversion']
Implement DocsumResult

We have to extend the virtual methods declared in entrezpy.base.result.EutilsResult. The documentation informs us about the required parameters and expected return values.

In addition, we declare the method PubmedResult.add_docsum() to handle adding new Docsum data record instances as defined in Listing 14. The Docsum methods in this tutorial are trivial and we can implement the class in one go

Implementing DocsumResult
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class DocsumResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
  Individual Docsum records are implemented in :class:`Docsum` and
  stored in :ivar:`docsums`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict docsums: storing Docsum instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.docsums = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.docsums)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.docsums:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching summary record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'docsum_records':[x for x in self.docsums],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_docsum(self, docsum):
    """The only non-virtual and therefore DocsumResult-specific method to handle
    adding new data records"""
    self.docsums[docsum.uid] = docsum

Note

The fetch result for Docsum records has no WebEnv value and is missing the originating database since esummary is usually the last query within a series of Eutils queries. Therefore, we implement a warning, informing the user linking is not possible.

Implementing DocsumAnalyzer

We have to extend the virtual methods declared in entrezpy.base.analyzer.EutilsAnalyzer. The documentation informs us about the required parameters and expected return values.

Implementing PubmedAnalyzer
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse Docsum responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = DocsumResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect JSON, just print the error to STDOUT as string."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    The results is a JSON structure and allows easy parsing"""
    self.init_result(response, request)
    for i in response['result']['uids']:
      self.result.add_docsum(Docsum(response['result'][i]))

Compared to the pubmed analyzer, parsing the JOSN output is very easy. If you already have a parser, you can use an object composition approach [#fn-oocomp]. Further, you can add a method in analyze_result to store the processed data in a database or implementing checkpoints.

Putting everything together

The completed implementation is shown in Listing 17.

Complete Docsum fetcher
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
#!/usr/bin/env python3

import os
import sys
import json
import argparse


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


class Docsum:
  """Simple data class to store individual sequence Docsum records."""

  class Subtype:

    def __init__(self, subtype, subname):
      self.strain = None
      self.host = None
      self.country = None
      self.collection = None
      self.collection_date = None

      for i in range(len(subtype)):
        if subtype[i] == 'strain':
          self.stain = subname[i]
        if subtype[i] == 'host':
          self.host = subname[i]
        if subtype[i] == 'country':
          self.country = subname[i]
        if subtype[i] == 'collection_date':
          self.collection_date = subname[i]

  def __init__(self, json_docsum):
    self.uid = int(json_docsum['uid'])
    self.caption = json_docsum['caption']
    self.title = json_docsum['title']
    self.extra = json_docsum['extra']
    self.gi = int(json_docsum['gi'])
    self.taxid = int(json_docsum['taxid'])
    self.slen =  int(json_docsum['slen'])
    self.biomol =  json_docsum['biomol']
    self.moltype =  json_docsum['moltype']
    self.tolopolgy = json_docsum['topology']
    self.sourcedb = json_docsum['sourcedb']
    self.segsetsize = json_docsum['segsetsize']
    self.projectid = int(json_docsum['projectid'])
    self.genome = json_docsum['genome']
    self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
                                  json_docsum['subname'].split('|'))
    self.assemblygi = json_docsum['assemblygi']
    self.assemblyacc = json_docsum['assemblyacc']
    self.tech = json_docsum['tech']
    self.completeness = json_docsum['completeness']
    self.geneticcode = int(json_docsum['geneticcode'])
    self.strand = json_docsum['strand']
    self.organism = self.strand = json_docsum['organism']
    self.strain = self.strand = json_docsum['strain']
    self.accessionversion = json_docsum['accessionversion']

class DocsumResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
  Individual Docsum records are implemented in :class:`Docsum` and
  stored in :ivar:`docsums`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict docsums: storing Docsum instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.docsums = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.docsums)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.docsums:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching summary record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'docsum_records':[x for x in self.docsums],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_docsum(self, docsum):
    """The only non-virtual and therefore DocsumResult-specific method to handle
    adding new data records"""
    self.docsums[docsum.uid] = docsum

class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse Docsum responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = DocsumResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect JSON, just print the error to STDOUT as string."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    The results is a JSON structure and allows easy parsing"""
    self.init_result(response, request)
    for i in response['result']['uids']:
      self.result.add_docsum(Docsum(response['result'][i]))

def main():
  ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
  Accessions are parsed form STDIN, one accession pre line')
  ap.add_argument('--email',
                  type=str,
                  required=True,
                  help='email required by NCBI'),
  ap.add_argument('--apikey',
                  type=str,
                  default=None,
                  help='NCBI apikey (optional)')
  ap.add_argument('-db',
                  type=str,
                  required=True,
                  help='Database to search ')
  args = ap.parse_args()

  c = entrezpy.conduit.Conduit(args.email)
  fetch_docsum = c.new_pipeline()
  sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
  fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
                            dependency=sid, analyzer=DocsumAnalyzer())
  docsums = c.run(fetch_docsum).get_result().docsums
  for i in docsums:
    print(i, docsums[i].uid, docsums[i].caption,docsums[i].strain, docsums[i].subtype.host)
  return 0

if __name__ == '__main__':
  main()

The implementaion can be invoked as shown in Listing 18.

Fetching Docsum data for several accessions
$ cat "NC_016134.3" > accs
$ cat "HOU142311" >> accs
$ cat accs | python seqmetadata-fetcher.py --email email -db nuccore

Footnotes

[1]https://en.wikipedia.org/wiki/Virtual_function
[2]http://www.greenwoodsoftware.com/less/
[3]https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/
[4]https://en.cppreference.com/w/c/language/struct
[5]https://en.wikipedia.org/wiki/Object_composition

Entrezpy E-Utility functions

Logging

entrezpy uses the Python logging module for logging. The base classes do only log the levels ‘ERROR’ and ‘DEBUG’. The module entrezpy.log.logger contains all methods related to logging. A basic configuration of the logger is given in entrezpy.log.conf.

Applications using entrezpy can set the level of logging as shown in (Listing 19). Logging calls can be made in classes inheriting entrezpy classes as shown in (Listing 20). The entrezpy.log.logger.get_class_logger() required the class as its input.

Add logging to applications using entrezpy

Importing the logging module and set the level.

Setting the logging level for an application using the entrezpy library
1
2
3
4
5
6
7
8
import entrezpy.log.logger

entrezpy.log.logger.set_level('DEBUG')

def main():
  """
  your application using entrezpy
  """

Add logging to a class inheriting a entrezpy base class

Example of creating a class level entrezpy logger.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import entrezpy.log.logger


class Esearcher(entrezpy.base.query.EutilsQuery):

  def __init__(self, tool, email, apikey=None, apikey_var=None, threads=None, qid=None):
    super().__init__('esearch.fcgi', tool, email, apikey=apikey, threads=threads, qid=qid)
    self.logger = entrezpy.log.logger.get_class_logger(Esearcher)
    self.logger.debug(json.dumps({'init':self.dump()}))

Esearch

entrezpy.esearch.esearcher.Esearcher implements the E-Utility ESearch [0]. Esearcher queries return UIDs for data in the requested Entrez database or WebEnv/QueryKey references from the Entrez History server.

Usage

import entrezpy.esearch.esearcher

e = entrezpy.esearch.esearcher.Esearcher(tool,
                                         email,
                                         apikey=None,
                                         apikey_var=None,
                                         threads=None,
                                         qid=None)
analyzer = e.inquire({'db' : 'pubmed',
                      'id' : [17284678, 9997],
                      'retmode' : 'text',
                      'rettype' : 'abstract'})
print(analyzer.count, analyzer.retmax, analyzer.retstart, analyzer.uids)

This creates an Esearcher instance with the following parameters:

Esearcher

entrezpy.esearch.esearcher.Esearcher

param str tool:String with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline.
param str email:
 A complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy is a library, not a tool.
param str apikey:
 NCBI API key
param str apikey_var:
 Environment variable storing an NCBI API key
param int threads:
 Number of threads (not processors)
param str qid:Unique Esearch query id. Will be generated if not given.

Supported E-Utility parameter

Parameters are passed as dictionary to entrezpy.esearch.esearcher.Esearcher.inquire() and are expected to be the same as those for the E-Utility [0]. For example:

{'db' : 'nucleotide', 'term' : 'viruses[orgn]', 'reqsize' : 100, 'retmax' : 99, 'idtype' : 'acc'}

Esearcher introduces one additional parameter reqsize. It sets the size of a request. Numbers grater than the maximum allowed by NCBI will be set to the maximum.

Parameter   Type
E-Utility    
db str
WebEnv str
query_key int
uilist bool
retmax int
retstart int
usehistory bool
term str
sort str
field str
reldate int
datetype str (YYYY/MM/DD, YYYY/MM, YYYY)
mindate str (YYYY/MM/DD, YYYY/MM, YYYY)
maxdate str (YYYY/MM/DD, YYYY/MM, YYYY)
idtype bool
retmode `json, enforced by Esearcher
Esearcher reqsize int

Result

Instance of entrezpy.esearch.esearch_result.EsearchResult.

If retmax = 0 or retmode = count no UIDs are returned. If usehistory is True (default), WebEnv and query_key for the request is returned.

  • count : number of found UIDs for request
  • retmax : number of UIDs to retrieve
  • retstart : number of first UID to retrieve
  • uids : list of fetched UIDs

Approach

  1. Parameters are checked and the request size is configured
  2. Initial search is requested
  3. If more search requests are required, Parameter is adjusted and the remaining requests are done
  4. If no errors were encountered, returns the analyzer with the result for all requests

Efetch

entrezpy.efetch.efetcher.Efetcher implements the E-Utility EFetch [0]. Efetcher queries return data from the Entrez History server.

Usage

import entrezpy.efetch.efetcher

e = entrezpy.efetch.efetcher.Efetcher(tool,
                                      email,
                                      apikey=None,
                                      apikey_var=None,
                                      threads=None,
                                      qid=None)
analyzer = e.inquire({'db' : 'pubmed',
                      'id' : [17284678, 9997],
                      'retmode' : 'text',
                      'rettype' : 'abstract'})
print(analyzer.count, analyzer.retmax, analyzer.retstart, analyzer.uids)

Efetcher

entrezpy.efetch.efetcher.Efetcher

param str tool:string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline.
param str email:
 a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy is this is a library, not a tool.
param str apikey:
 NCBI API key
param str apikey_var:
 NCBI API key
param int threads:
 number of threads
param str qid:Unique Esearch query id. Will be generated if not given.

Supported E-Utility parameter

Parameters are passed as dictionary to entrezpy.esearch.esearcher.Esearcher.inquire() and are expected to be the same as those for the E-Utility [0]. For example:

{'db' : 'nuccore', 'term' : 'Pythons [Organism]'}

Esearcher introduces the additional parameter reqsize. It sets the size of a request. Numbers grater than the maximum allowed by NCBI will be set to the maximum.

Parameter   Type
E-Utility    
db str
WebEnv str
query_key int
uilist bool
retmax int
retstart int
usehistory bool
term str
sort str
field str
reldate int
datetype str (YYYY/MM/DD, YYYY/MM, YYYY)
mindate str (YYYY/MM/DD, YYYY/MM, YYYY)
maxdate str (YYYY/MM/DD, YYYY/MM, YYYY)
idtype str
retmode `json, enforced by Esearcher
Esearcher reqsize int

Result

Instance of entrezpy.efetch.efetch_result.EfetchResult.

If retmax = 0 or retmode = count no UIDs are returned. If usehistory is True (default), WebEnv and query_key for the request is returned.

  • count : number of found UIDs for request
  • retmax : number of UIDs to retrieve
  • retstart : number of first UID to retrieve
  • uids : list of fetched UIDs

Approach

  1. Parameters are checked and the request size is configured
  2. Initial search is requested
  3. If more search requests are required, Parameter is adjusted and the remaining requests are done
  4. If no errors were encountered, returns the analyzer with the result for all requests

Esummary

entrezpy.esummary.esummarizer.Esummarizer implements the E-Utility ESummary [0]. Esummarizer fetches document summaries for UIDs in the requested database. Summaries can contain abstracts, experimental details, etc

Usage

import entrezpy.esummary.esummarizer

e = entrezpy.esummary.esummarizer.Esummarizer(tool,
                                              email,
                                              apikey=None,
                                              apikey_var=None,
                                              threads=None,
                                              qid=None)

analyzer = e.inquire('db' : 'pubmed', 'id' : [11850928, 11482001])
print(analyzer.get_result().summaries)

Esummarizer

entrezpy.esummary.esummarizer.Esummarizer

param str tool:string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline.
param str email:
 a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy is a library, not a tool.
param str apikey:
 NCBI API key
param str apikey_var:
 NCBI API key
param int threads:
 number of threads
param str qid:Unique Esummary query id. Will be generated if not given.

Supported E-Utility parameter

Parameters are passed as dictionary to entrezpy.esummary.esummarizer.Esummarizer.inquire() and are expected to be the same as those for the E-Utility [0]. For example:

{{'db' : 'pubmed','id' : [11237011,12466850]}

Parameter   Type
E-Utility    
db str
id list
WebEnv string
retstart int
retmax int
retmode JSON, enforced by entrezpy

Not supported E-Utility parameter

Parameter   Type
E-Utility    
retmode JSON, enforced by entrezpy
version XML specific parameter

Result

Instance of entrezpy.esummary.esummary_result.EsummaryResult.

If retmax = 0 or retmode = count no UIDs are returned. If usehistory is True (default), WebEnv and query_key for the request is returned.

  • count : number of found UIDs for request
  • retmax : number of UIDs to retrieve
  • retstart : number of first UID to retrieve
  • uids : list of fetched UIDs

Approach

  1. Parameters are checked and the request size is configured
  2. UIDs are posted to NCBI
  3. If no errors were encountered, returns the analyzer with the result storing the WebEnv and query_key for the UIDs.

Epost

entrezpy.epost.eposter.Eposter implements the E-Utility EPost [0]. Eposter queries post UIDs onto the Entrez History server and return the corresponding WebEnv and query_key. If an exisitng WebEnv is passed as parameter, the posted UIDs will be added to this WebEnv by increasing its query_key.

Usage

import entrezpy.epost.eposter

e = entrezpy.epost.eposter.Eposter(tool,
                                   email,
                                   apikey=None,
                                   apikey_var=None,
                                   threads=None,
                                   qid=None)

analyzer = e.inquire({'db' : 'pubmed','id' : [12466850])
print(analyzer.get_result().get_link_parameters())

Eposter

entrezpy.epost.eposter.Eposter

param str tool:string with no internal spaces uniquely identifying the software producing the request, i.e. your tool/pipeline.
param str email:
 a complete and valid e-mail address of the software developer and not that of a third-party end user. entrezpy is a library, not a tool.
param str apikey:
 NCBI API key
param str apikey_var:
 NCBI API key
param int threads:
 number of threads
param str qid:Unique Epost query id. Will be generated if not given.

Supported E-Utility parameter

Parameters are passed as dictionary to entrezpy.epost.eposter.Eposter.inquire() and are expected to be the same as those for the E-Utility [0]. For example:

{{'db' : 'pubmed','id' : [11237011,12466850]}

Parameter   Type
E-Utility    
db str
id list
WebEnv string

Result

Instance of entrezpy.esearch.esearch_result.EsearchResult.

If retmax = 0 or retmode = count no UIDs are returned. If usehistory is True (default), WebEnv and query_key for the request is returned.

  • count : number of found UIDs for request
  • retmax : number of UIDs to retrieve
  • retstart : number of first UID to retrieve
  • uids : list of fetched UIDs

Approach

  1. Parameters are checked and the request size is configured.
  2. UIDs are posted to NCBI.
  3. If no errors were encountered, returns the analyzer with the result storing the WebEnv and query_key for the UIDs.

Entrezpy In-depth

Entrezpy architcture

Queries and requests

Entrezpy queries are build from at least one request. A search for all virus sequences in the Entrez database ‘nucleotides’ is one query and has one initial request, the search itself. However, this search will return more UIDs than can be fetched in one go and to obtain all UIDs, several requests are required.

Basic functions

Each function is a collection of inherited classes interacting with each other. Each class implements a specific task of a query. The basic classes required for an entrezpy query are found in src/entrezpy/base of the repository.

Each query starts with passing E-Utils parameters as dictionary into the iquire method of the query, which are derived from entrezpy.base.query.EutilsQuery.inquire().

The first step in inquire() is to instantiate a parameter object derived from entrezpy.base.parameter.EutilsParameter. The parameters get checked for errors and if none are found, an instance of entrezpy.base.parameter.EutilsParameter is returned. The attributes of entrezpy.base.parameter.EutilsParameter configure the query and the required number of entrezpy.base.request.EutilsRequest is added to the queue.

Each request is sent to the corresponding E-Utility and its response received . All responses from within a query are analyzed by the same instance of a entrezpy.base.analzyer.EutilsAnalyzer. The analyzer stores results in an instance of entrezpy.base.result.EutilsResult.

Error handling

The primary approach of entrezpy is abort if an error has be been encountered since it’s not known what the developer had in mind when deploying entrezpy.

entrezpy aborts if :

  • errors are found in the parameters
  • HTTP error 400

entrezpy continues, but warns, if:

  • empty result
  • after 10 retries to obtain request

Logging

WIP

E-Utilities by entrezpy

Entrezpy assembles POST parameters [1], [2], creates the correspondong requests to interact with the E-Utilities, and reads the received responses. Entrezpy implements E-Utility functions as queries consisting of at least one request:

       Query
 +...............+
 |               |
 0 1 2 3 4 5 6 7 8
 |     | |     | |
 +-----+ +-----+ +
    R0     R1    R2
     \     |     /
      +----+----+
           |
           v
entrezpy.base.analyzer.EutilsAnalyzer()

The example depicts the relation between a query and requests in Entrepy. The example query consists of 9 data records. Using a request size of 4 data records, Entrezpy resolves this query using two requests (R0 - R1) with the given size and adjusts the size of the last query (R2).

Each query passes all request and responses through the same instance of its corresponding entrezpy.base.analyzer.EutilsAnalyzer. The analyzer can be passed as argument to each entrezpy query. Each request is analyzed as soon as it is received. The analzyer base class entrezpy.base.analyzer.EutilsAnalyzer can be inherited and adjusted for specific formats or tasks

Entrezpy offers default analzyers, but most likely you want, or have to, implement a specific Efetche analzyer. You can use entrezpy.efetch.efetch_analyzer.EfetchAnalyzer as template.

E-Utilities History server

E-Utilities offers to store queries on the NCBI servers and returining a WebEnv and query_key referencing such queries. This can skip unnessecray data downlaods or used to modify queries on the NCBI servers.

Reference

Logging module

Functions

entrezpy.log.logger.CONFIG = {'level': 'INFO', 'propagate': True, 'quiet': True}

Store logger settings

entrezpy.log.logger.get_root()

Returns the module root

entrezpy.log.logger.resolve_class_namespace(cls)

Resolves namespace for logger

entrezpy.log.logger.get_class_logger(cls)

Prepares logger for given class

entrezpy.log.logger.set_level(level)

Sets logging level for applications using entrezpy.

Configuration

entrezpy.log.conf.default_config = {'disable_existing_loggers': False, 'formatters': {'default': {'format': '%(asctime)s %(threadName)s [%(levelname)s] %(name)s: %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'default', 'stream': 'ext://sys.stderr'}}, 'loggers': {'': {'handlers': ['console']}}, 'version': 1}

Dictionary to store logger configuration

Base modules

Query

class entrezpy.base.query.EutilsQuery(eutil, tool, email, apikey=None, apikey_var=None, threads=None, qid=None)

EutilsQuery implements the base class for all entrezpy queries to E-Utils. It handles the information required by every query, e.g. base query url, email address, allowed requests per second, apikey, etc. It declares the virtual method inquire() which needs to be implemented by every request since they differ among queries.

An NCBI API key will bet set as follows:

  • passed as argument during initialization
  • check enviromental variable passed as argument
  • check enviromental variable NCBI_API_KEY

Upon initalization, following parameters are set:

  • set unique query id
  • check for / set NCBI apikey
  • initialize entrezpy.requester.requester.Requester with allowed requests per second
  • assemble Eutil url for desire EUtils function
  • initialize Multithreading queue and register query at entrezpy.base.monitor.QueryMonitor for logging

Multithreading is handled using the nested classes entrezpy.base.query.EutilsQuery.RequestPool and entrezpy.base.query.EutilsQuery.ThreadedRequester.

Inits EutilsQuery instance with eutil, toolname, email, apikey, apikey_envar, threads and qid.

Parameters:
  • eutil (str) – name of eutil function on EUtils server
  • tool (str) – tool name
  • email (str) – user email
  • apikey (str) – NCBI apikey
  • apikey_var (str) – enviroment variable storing NCBI apikey
  • threads (int) – set threads for multithreading
  • qid (str) – unique query id
Variables:
  • id – unique query id
  • base_url – unique query id
  • requests_per_sec (int) – default limit of requests/sec (set by NCBI)
  • max_requests_per_sec (int) – max.requests/sec with apikeyby (set NCBI)
  • url (str) – full URL for Eutil function
  • contact (str) – user email (required by NCBI)
  • tool (str) – tool name (required by NCBI)
  • apikey (str) – NCBI apikey
  • num_threads (int) – number of threads to use
  • failed_requests (list) – store failed requests for analysis if desired
  • request_poolentrezpy.base.query.EutilsQuery.RequestPool instance
  • request_counter (int) – requests counter for a EutilsQuery instance
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils'

Base url for all Eutil request

inquire(parameter, analyzer)

Virtual function starting query. Each query requires its own implementation.

Parameters:
  • parameter (dict) – E-Utilities parameters
  • analzyer (entrezpy.base.analyzer.EutilsAnalzyer) – query response analyzer
Returns:

analyzer

Return type:

entrezpy.base.analyzer.EutilsAnalzyer

check_requests()

Virtual function testing and handling failed requests. These requests fail due to HTTP/URL issues and stored entrezpy.base.query.EutilsQuery.failed_requests

check_ncbi_apikey(apikey=None, env_var=None)

Checks and sets NCBI apikey.

Parameters:
  • apikey (str) – NCBI apikey
  • env_var (str) – enviromental variable storing NCBI apikey
prepare_request(request)

Prepares request for sending to E-Utilities with require quey attributes.

Parameters:request (entrezpy.base.request.EutilsRequest) – entrezpy request instance
Returns:request instance with EUtils parameters
Return type:entrezpy.base.request.EutilsRequest
add_request(request, analyzer)

Adds one request and corresponding analyzer to the request pool.

Parameters:
monitor_start(query_parameters)

Starts query monitoring

Parameters:query_parameters (entrezpy.base.parameter.EutilsParameter) – query parameters
monitor_stop()

Stops query monitoring

monitor_update(updated_query_parameters)

Updates query monitoring parameters if follow up requests are required.

Parameters:updated_query_parameters (entrezpy.base.parameter.EutilsParameter) – updated query parameters
hasFailedRequests()

Reports if at least one request failed.

dump()

Dump all attributes

isGoodQuery()

Tests for request errors

rtype:bool

Parameter

class entrezpy.base.parameter.EutilsParameter(parameter=None)

EutilsParameter set and check parameters for each query. EutilsParameter is populated from a dictionary with valid E-Utilities parameters for the corresponding query. It declares virtual functions where necessary.

Simple helper functions are presented to test the common parameters db, WebEnv, query_key and usehistory.

Note

usehistory is the parameter used for Entrez history queries and is set to True (use it) by default. It can be set to False to ommit history server use.

haveExpectedRequests() tests if the of the number of requests has been calculated.

The virtual methods check() and dump() need thrir own implementation since they can vary between queries.

Warning

check() is expected to run after all parameters have been set.

Parameters:

parameter (dict) – Eutils query parameters

Variables:
  • db (str) – Entrez database name
  • webenv (str) – WebEnv
  • querykey (int) – querykey
  • expected_request (int) – number of expected request for the query
  • doseq (bool) – use id= parameter for each uid in POST
haveDb()

Check for required db parameter

Return type:bool
haveWebenv()

Check for required WebEnv parameter

Return type:bool
haveQuerykey()

Check for required QueryKey parameter

Return type:bool
useHistory()

Check if history server should be used.

Return type:bool
haveExpectedRequets()

Check fo expected requests. Hints an error if no requests are expected.

Return type:bool
check()

Virtual function to run a check before starting the query. This is a crucial step and should abort upon failing.

Raises:NotImplementedError – if not implemented
dump()

Dump instance attributes

Return type:dict
Raises:NotImplementedError – if not implemented

Request

class entrezpy.base.request.EutilsRequest(eutil, db)

EutilsRequest is the base class for requests from entrezpy.base.query.EutilsQuery.

EutilsRequests instantiate in entrezpy.base.query.EutilsQuery.inquire() before being added to the request pool by entrezpy.base.query.EutilsQuery.add_request(). Each EutilsRequest triggers an answer at the NCBI Entrez servers if no connection errors occure.

EutilsRequest stores the required information for POST requests. Its status can be queried from outside by entrezpy.base.request.EutilsRequest.get_observation(). EutilsRequest instances store information not present in the server response and is required by entrezpy.base.analyzer.EutilsAnalyzer to parse responses and errors correctly. Several instance attributes are not required for a POST request but help debugging.

Each request is automatoically assigned an id to identify and trace requests using the query id and request id.

Parameters:
  • eutil (str) – eutil function for this request, e.g. efetch.fcgi
  • db (str) – database for request

Initializes a new request with initial attributes as part of a query in entrezpy.base.query.EutilsQuery.

Variables:
  • tool (str) – tool name to which this request belongs
  • url (str) – full Eutil url
  • contact (str) – use email
  • apikey (str) – NBCI apikey
  • query_id (str) – entrezpy.base.query.EutilsQuery.query_id which initiated this request
  • status (int) – request status : 0->success, 1->Fail,2->Queued
  • size (int) – size of request, e.g. number of UIDs
  • start_time (float) – start time of request in seconds since epoch
  • duration – duration for this request in seconds
  • doseq – set doseq parameter in entrezpy.request.Request.request()

Note

status is work in progress.

get_post_parameter()

Virtual function returning the POST parameters for the request from required attributes.

Return type:dict
Raises:NotImplemetedError
prepare_base_qry(extend=None)

Returns instance attributes required for every POST request.

Parameters:extend (dict) – parameters extending basic parameters
Returns:base parameters for POST request
Return type:dict
set_status_success()

Set status if request succeeded

set_status_fail()

Set status if request failed

report_status(processed_requests=None, expected_requests=None)

Reports request status when triggered

get_request_id()
Returns:full request id
Return type:str
set_request_error(error)

Sets request error and HTTP/URL error message

Parameters:error (str) – HTTP/URL error
start_stopwatch()

Starts time to measure request duration.

calc_duration()

Calculates request duration

dump_internals(extend=None)

Dumps internal attributes for request.

Parameters:extend (dict) – extend dump with additional information

Analyzer

class entrezpy.base.analyzer.EutilsAnalyzer

EutilsAnalyzer is the base class for an entrezpy analyzer. It prepares the response based on the requested format and checks for E-Utilities errors. The function parse() is invoked after every request by the corresponding query class, e.g. Esearcher. This allows analyzing data as it arrives without waiting until larger queries have been fetched. This approach allows implementing analyzers which can store already downloaded data to establish checkpoints or trigger other actions based on the received data.

Two virtual classes are the core and need their own implementation to support specific queries:

Note

Responses from NCBI are not very well documented and functions will be extended as new errors are encountered.

Inits EutilsAnalyzer with unknown type of result yet. The result needs to be set upon receiving the first response by init_result().

Variables:
  • hasErrorResponse (bool) – flag indicating error in response
  • result – result instance
known_fmts = {'json', 'text', 'xml'}

Store formats known to EutilsAnalzyer

init_result(response, request)

Virtual function to initialize result instance. This allows to set attributes from the first response and request.

Parameters:response (dict or io.StringIO) – converted response from convert_response()
Raises:NotImplementedError – if implementation is missing
analyze_error(response, request)

Virtual function to handle error responses

Parameters:response (dict or io.StringIO) – converted response from convert_response()
Raises:NotImplementedError – if implementation is missing
analyze_result(response, request)

Virtual function to handle responses, i.e. parsing them and prepare them for entrezpy.base.result.EutilsResult

Parameters:response (dict or io.StringIO) – converted response from convert_response()
Raises:NotImplementedError – if implementation is missing
parse(raw_response, request)

Check for errors and calls parser for the raw response.

Parameters:
Raises:

NotImplementedError – if request format is not in EutilsAnalyzer.known_fmts

convert_response(raw_response_decoded, request)

Converts raw_response into the expected format, deduced from request and set via the retmode parameter.

Parameters:
Returns:

response in parseable format

Return type:

dict or io.stringIO

..note::
Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
isErrorResponse(response, request)

Checking for error messages in response from Entrez Servers and set flag hasErrorResponse.

Parameters:
Returns:

error status

Return type:

bool

check_error_xml(response)

Checks for errors in XML responses

Parameters:response (io.stringIO) – XML response
Returns:if XML response has error message
Return type:bool
check_error_json(response)

Checks for errors in JSON responses. Not unified among Eutil functions.

Parameters:response (dict) – reponse
Returns:status if JSON response has error message
Return type:bool
isSuccess()

Test if response has errors

Return type:bool
get_result()

Return result

Returns:result instance
Return type:entrezpy.base.result.EutilsResult
follow_up()

Return follow-up parameters if available

Returns:Follow-up parameters
Return type:dict
isEmpty()

Test for empty result

Return type:bool

Result

class entrezpy.base.result.EutilsResult(function, qid, db, webenv=None, querykey=None)

EutilsResult is the base class for an entrezpy result. It sets the required result attributes common for all result and declares virtual functions to interact with other entrezpy classes. Empty results are successful results since no query error has been received. entrezpy.base.result.EutilsResult.size() is important to

  • determine if and how many follow-up requests are required
  • if it’s an empty result
Parameters:
  • function (string) – EUtil function of the result
  • qid (string) – query id
  • db (string) – Entrez database name for result
  • webenv (string) – WebEnv of response
  • querykey (int) – querykey of response
size()

Returns result size in the corresponding ResultSize unit

Return type:int
Raises:NotImplementedError – if implementation is missing
dump()

Dumps all instance attributes

Return type:dict
Raises:NotImplementedError – if implementation is missing

Assembles parameters for automated follow-ups. Use the query key from the first request by default.

Parameters:reqnum (int) – request number for which query_key should be returned
Returns:EUtils parameters
Return type:dict
Raises:NotImplementedError – if implementation is missing
isEmpty()

Indicates empty result.

Return type:bool
Raises:NotImplementedError – if implementation is missing

Monitor

EPost modules

Query

Inheritance diagram of entrezpy.epost.eposter
class entrezpy.epost.eposter.Eposter(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)

Eposter implements Epost queries to E-Utilities [0]. EPost posts UIDs to the history server. Without passed WebEnv, a new WebEnv and correspndong QueryKey are returned. With a given WebEvn the posted UIDs will be added to this WebEnv and the corresponding QueryKey is returned. All parameters described in [0] are acccepted. [0]: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost

Inits Eposter instance with given attributes.

Parameters:
  • tool (str) – tool name
  • email (str) – user email
  • apikey (str) – NCBI apikey
  • apikey_var (str) – enviroment variable storing NCBI apikey
  • threads (int) – set threads for multithreading
  • qid (str) – unique query id
inquire(parameter, analyzer=<entrezpy.epost.epost_analyzer.EpostAnalyzer object>)
Implements entrezpy.base.query.inquire() and posts UIDs to Entrez.
Epost is only one request.
Parameters:
Returns:

analyzer or None if request errors have been encountered

Return type:

entrezpy.base.analyzer.EntrezpyAnalyzer instance or None

Parameter

Inheritance diagram of entrezpy.epost.epost_parameter
class entrezpy.epost.epost_parameter.EpostParameter(parameter)

EpostParameter checks query specific parameters and configures a entrezpy.epost.epost_query.EpostQuery instance. Force XML since Epost responds only XML. Epost requests don’t have follow-ups.

Parameters:

parameter (dict) – Eutils Epost parameters

Variables:
  • uids (list) – UIDs to post
  • retmode (str) – fix retmode to XML
  • query_size (int) – size of query, here number of UIDs
  • request_size (int) – size of request, here nuber if UIDs
  • expected_requests (int) – number of expected requests, here 1
check()

Implements entrezpy.base.parameter.EutilsParameter.check() by checking for missing database parameter and UIDs.

dump()

Dump instance variables

Return type:dict
haveDb()

Check for required db parameter

Return type:bool
haveExpectedRequets()

Check fo expected requests. Hints an error if no requests are expected.

Return type:bool
haveQuerykey()

Check for required QueryKey parameter

Return type:bool
haveWebenv()

Check for required WebEnv parameter

Return type:bool
useHistory()

Check if history server should be used.

Return type:bool

Request

Inheritance diagram of entrezpy.epost.epost_request
class entrezpy.epost.epost_request.EpostRequest(eutil, parameter)

EpostRequest implements a single request as part of an Epost query. It stores and prepares the parameters for a single request. See entrezpy.epost.epost_parameter.EpostParameter for parameter description.

Parameters:
get_post_parameter()

Implements entrezpy.base.request.EutilsRequest.get_post_parameter()

dump()

Dump instance attributes

Return type:dict
calc_duration()

Calculates request duration

dump_internals(extend=None)

Dumps internal attributes for request.

Parameters:extend (dict) – extend dump with additional information
get_request_id()
Returns:full request id
Return type:str
prepare_base_qry(extend=None)

Returns instance attributes required for every POST request.

Parameters:extend (dict) – parameters extending basic parameters
Returns:base parameters for POST request
Return type:dict
report_status(processed_requests=None, expected_requests=None)

Reports request status when triggered

set_request_error(error)

Sets request error and HTTP/URL error message

Parameters:error (str) – HTTP/URL error
set_status_fail()

Set status if request failed

set_status_success()

Set status if request succeeded

start_stopwatch()

Starts time to measure request duration.

Analyzer

Inheritance diagram of entrezpy.epost.epost_analyzer
class entrezpy.epost.epost_analyzer.EpostAnalyzer

EpostAnalyzer implements the analysis of EPost responses from E-Utils. Epost puts UIDs onto the History server and returnes the corresponding WebEnv and QueryKey. Epost does only XML as response, therefore a dictionary imitating a JSON input is assembled and passed as result to entrezpy.epost.epost_result.EpostResult

init_result(response, request)

Implements entrezpy.base.analyzer.EutilsAnalyzer.init_result() and inits entrezpy.epost.epost_result.EpostResult.

analyze_result(response, request)

Implements entrezpy.base.analyzer.EutilsAnalyzer.analyze_result(). The response is one WebEnv and QueryKey and the result can be initiated after parsing them.

Parameters:
  • response – EUtils response
  • request – entrezpy request
analyze_error(response, request)

Implements entrezpy.base.analyzer.EutilsAnalyzer.analyze_error().

Parameters:
  • response – EUtils response
  • request – entrezpy request

Result

Inheritance diagram of entrezpy.epost.epost_result
class entrezpy.epost.epost_result.EpostResult(response, request)

EpostResult stores WebEnv and QueryKey from posting UIDs to the History server. Since no limit is imposed on the number of UIDs which can be posted in one query, the size of the result is the size of the request and only one WebEnv and QueryKey are returned.

Parameters:
  • request – entrezpy Epost request instance
  • response (dict) – response
Request type:

entrezpy.epost.epost_request.EpostRequest

Variables:

uids (list) – posted UIDs

dump()

Dumps all instance attributes

Return type:dict
Raises:NotImplementedError – if implementation is missing

Assembles parameters for automated follow-ups. Use the query key from the first request by default.

Parameters:reqnum (int) – request number for which query_key should be returned
Returns:EUtils parameters
Return type:dict
Raises:NotImplementedError – if implementation is missing
size()

Returns result size in the corresponding ResultSize unit

Return type:int
Raises:NotImplementedError – if implementation is missing
isEmpty()

Indicates empty result.

Return type:bool
Raises:NotImplementedError – if implementation is missing

Esearch modules

Esearcher

Inheritance diagram of entrezpy.esearch.esearcher
class entrezpy.esearch.esearcher.Esearcher(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)

Bases: entrezpy.base.query.EutilsQuery

Esearcher implements ESearch queries to NCBI’s E-Utilities. Esearch queries return UIDs or WebEnv/QueryKey references to Entrez’ History server. Esearcher implments entrezpy.base.query.EutilsQuery.inquire() which analyzes the first result and automatically configures subseqeunt requests to get all queried UIDs if required.

inquire(parameter, analyzer=<entrezpy.esearch.esearch_analyzer.EsearchAnalyzer object>)

Implements entrezpy.base.query.EutilsQuery.inquire() and configures follow-up requests if required.

Parameters:
Returns:

analyzer instance or None if request errors have been encountered

Return type:

entrezpy.esearch.esearch_analyzer.EsearchAnalyzer or None

Does first request and triggers follow-up if required or possible.

Parameters:
Returns:

follow-up parameter or None

Return type:

entrezpy.esearch.esearch_parameter.EsearchParamater or None

isGoodQuery()

Tests for request errors

rtype:bool
entrezpy.esearch.esearcher.configure_follow_up(parameter, analyzer)

Adjusting EsearchParameter to follow-up results based on the initial Esearch result. Fetch remaining UIDs using the history server.

Parameters:
  • analyzer (entrezpy.search.esearch_analyzer.EsearchAnalyzer) – Esearch analyzer instance
  • parameter – Initial Esearch parameter
entrezpy.esearch.esearcher.reachedLimit(parameter, analyzer)

Checks if the set limit has been reached

Return type:bool

EsearchParameter

Inheritance diagram of entrezpy.esearch.esearch_parameter
entrezpy.esearch.esearch_parameter.MAX_REQUEST_SIZE = 100000

Maximum number of UIDs for one request

class entrezpy.esearch.esearch_parameter.EsearchParameter(parameter)

Bases: entrezpy.base.parameter.EutilsParameter

EsearchParameter checks query specific parameters and configures an Esearch query. If more than one request is required the instance is reconfigured by entrezpy.esearch.esearcher.Esearcher.configure_follow_up().

Note

EsearchParameter works best when using the NCBI Entrez history server. If usehistory is not used, linking requests cannot be guaranteed.

goodDateparam()
Return type:bool
useMinMaxDate()
Return type:bool
set_uilist(rettype)
Return type:bool
adjust_retmax(retmax)

Adjusts retmax parameter. Order of check is crucial.

Parameters:retmax (int) – retmax value
Returns:adjusted retmax
Return type:int
adjust_reqsize(request_size)

Adjusts request size for low retmax

Returns:adjusted request size
Return type:int
calculate_expected_requests(qsize=None, reqsize=None)

Calculate anf set the expected number of requests. Uses internal parameters if non are provided.

Parameters:
  • or None qsize (int) – query size, i.e. expected number of data sets
  • reqsize (int) – number of data sets to fetch in one request
check()

Implements entrezpy.base.parameter.EutilsParameter.check to check for the minumum required parameters. Aborts if any check fails.

dump()

Dump instance attributes

Return type:dict
Raises:NotImplementedError – if not implemented
haveDb()

Check for required db parameter

Return type:bool
haveExpectedRequets()

Check fo expected requests. Hints an error if no requests are expected.

Return type:bool
haveQuerykey()

Check for required QueryKey parameter

Return type:bool
haveWebenv()

Check for required WebEnv parameter

Return type:bool
useHistory()

Check if history server should be used.

Return type:bool

EsearchAnalyzer

Inheritance diagram of entrezpy.esearch.esearch_analyzer
class entrezpy.esearch.esearch_analyzer.EsearchAnalyzer

Bases: entrezpy.base.analyzer.EutilsAnalyzer

EsearchAnalyzer implements the analysis of ESearch responses from E-Utils. JSON formatted data is enforced in responses. The result are stored as a entrezpy.esearch.esearch_result.EsearchResult instance.

Variables:resultentrezpy.esearch.esearch_result.EsearchResult
init_result(response, request)

Inits entrezpy.esearch.esearch_result.EsearchResult.

Returns:if result is initiated
Return type:bool
analyze_result(response, request)

Implements entrezpy.base.analyzer.EsearchAnalyzer.analyze_result().

Parameters:
analyze_error(response, request)

Implements entrezpy.base.analyzer.EutilsAnalyzer.analyze_error().

param dict response:
 Esearch response
param request:Esearch request
type request:entrezpy.esearch.esearch_request.EsearchRequest
size()

Returns number of analyzed UIDs in result

Return type:int
query_size()

Returns number of expected UIDs in result

Return type:int
reference()

Returns History Server references from result

Returns:History Server referencess
Return type:entrezpy.base.referencer.EutilReferencer.Reference
adjust_followup(parameter)

Adjusts result attributes from follow-up.

Parameters:
check_error_json(response)

Checks for errors in JSON responses. Not unified among Eutil functions.

Parameters:response (dict) – reponse
Returns:status if JSON response has error message
Return type:bool
check_error_xml(response)

Checks for errors in XML responses

Parameters:response (io.stringIO) – XML response
Returns:if XML response has error message
Return type:bool
convert_response(raw_response_decoded, request)

Converts raw_response into the expected format, deduced from request and set via the retmode parameter.

Parameters:
Returns:

response in parseable format

Return type:

dict or io.stringIO

..note::
Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
follow_up()

Return follow-up parameters if available

Returns:Follow-up parameters
Return type:dict
get_result()

Return result

Returns:result instance
Return type:entrezpy.base.result.EutilsResult
isEmpty()

Test for empty result

Return type:bool
isErrorResponse(response, request)

Checking for error messages in response from Entrez Servers and set flag hasErrorResponse.

Parameters:
Returns:

error status

Return type:

bool

isSuccess()

Test if response has errors

Return type:bool
known_fmts = {'json', 'text', 'xml'}
parse(raw_response, request)

Check for errors and calls parser for the raw response.

Parameters:
Raises:

NotImplementedError – if request format is not in EutilsAnalyzer.known_fmts

EsearchRequest

Inheritance diagram of entrezpy.esearch.esearch_request
class entrezpy.esearch.esearch_request.EsearchRequest(eutil, parameter, start, size)

Bases: entrezpy.base.request.EutilsRequest

The EsearchRequest class implements a single request as part of a Esearch query. It stores and prepares the parameters for a single request. See entrezpy.elink.elink_parameter.ElinkParameter for parameter description. Requests sizes are congifured from setting a start, i.e. the index of the first UID to fetch, and its size, i.e. how many to fetch. These are set by entrezpy.esearch.esearch_query.Esearcher.inquire().

Parameters:
get_post_parameter()

Virtual function returning the POST parameters for the request from required attributes.

Return type:dict
Raises:NotImplemetedError
dump()
Return type:dict
calc_duration()

Calculates request duration

dump_internals(extend=None)

Dumps internal attributes for request.

Parameters:extend (dict) – extend dump with additional information
get_request_id()
Returns:full request id
Return type:str
prepare_base_qry(extend=None)

Returns instance attributes required for every POST request.

Parameters:extend (dict) – parameters extending basic parameters
Returns:base parameters for POST request
Return type:dict
report_status(processed_requests=None, expected_requests=None)

Reports request status when triggered

set_request_error(error)

Sets request error and HTTP/URL error message

Parameters:error (str) – HTTP/URL error
set_status_fail()

Set status if request failed

set_status_success()

Set status if request succeeded

start_stopwatch()

Starts time to measure request duration.

EsearchResult

Inheritance diagram of entrezpy.esearch.esearch_result
class entrezpy.esearch.esearch_result.EsearchResult(response, request)

Bases: entrezpy.base.result.EutilsResult

EsearchResult sstores fetched UIDs and/or WebEnv-QueryKeys and creates follow-up parameters. UIDs are stored as string, even when UIDs, since responses can contain also accsessions when using the idtype option.

Parameters:
Variables:

uids (list) – analyzed UIDs from response

dump()
Return type:dict

Assemble follow-up parameters for linking. The first request returns all required information and using its querykey in such a case.

Return type:dict
isEmpty()

Empty search result has no webenv/querykey and/or no fetched UIDs.

size()

Returns number of analyzed UIDs.

Return type:int
query_size()

Returns number of all UIDs for search (count).

Return type:int
add_response(response)

Adds responses from individual requests.

Parameters:response (dict) – Esearch response

Efetch modules

Efetcher

Inheritance diagram of entrezpy.efetch.efetcher
class entrezpy.efetch.efetcher.Efetcher(tool, email, apikey=None, apikey_var=None, threads=None, qid=None)

Bases: entrezpy.base.query.EutilsQuery

Efetcher implements Efetch E-Utilities queries [0]. It implements entrezpy.base.query.EutilsQuery.inquire() to fetch data from NCBI Entrez servers. [0]: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch [1]: https://www.ncbi.nlm.nih.gov/books/NBK25497/table/ chapter2.T._entrez_unique_identifiers_ui/?report=objectonly

Variables:resultentrezpy.base.result.EutilsResult
inquire(parameter, analyzer=<entrezpy.efetch.efetch_analyzer.EfetchAnalyzer object>)

Implements entrezpy.base.query.EutilsQuery.inquire() and configures fetch.

Note

Efetch prefers to know the number of UIDs to fetch, i.e. number of UIDs or retmax. If this information is missing, the max number of UIDs for the specific retmode and rettype are fetched.

Parameters:
Returns:

analyzer instance or None if request errors have been encountered

Return type:

entrezpy.base.analyzer.EutilsAnalyzer or None

EfetchParameter

Inheritance diagram of entrezpy.efetch.efetch_parameter
entrezpy.efetch.efetch_parameter.DEF_RETMODE = 'xml'

Default retmode for fetch requests

class entrezpy.efetch.efetch_parameter.EfetchParameter(param)

Bases: entrezpy.base.parameter.EutilsParameter

EfetchParameter implements checks and configures an EftechQuery. A fetch query knows its size due to the id parameter or earlier result stored on the Entrez history server using WebEnv and query_key. The default retmode (fetch format) is set to XML because all E-Utilities can retun XML but not JSON, unfortunately.

req_limits = {'json': 500, 'text': 10000, 'xml': 10000}

Max number of UIDs to fetch per request mode

valid_retmodes = {'gene': {'text', 'xml'}, 'nuccore': {'text', 'xml'}, 'pmc': {'xml'}, 'poset': {'text', 'xml'}, 'protein': {'text', 'xml'}, 'pubmed': {'text', 'xml'}, 'sequences': {'text', 'xml'}}

Enforced request uid sizes by NCBI for fetch requests by format

adjust_retmax(retmax)

Adjusts retmax parameter. Order of check is crucial.

Parameters:retmax (int) – retmax value
Returns:adjusted retmax or None if all UIDs are fetched
Return type:int or None
check_retmode(retmode)

Checks for valid retmode and retmode combination

Parameters:retmode (str) – retmode parameter
Returns:retmode
Return type:str
adjust_reqsize(reqsize)

Adjusts request size for query

Parameters:reqsize (str or None) – Request size parameter
Returns:adjusted request size
Return type:int
calculate_expected_requests(qsize=None, reqsize=None)

Calculate anf set the expected number of requests. Uses internal parameters if non are provided.

Parameters:
  • or None qsize (int) – query size, i.e. expected number of data sets
  • reqsize (int) – number of data sets to fetch in one request
haveDb()

Check for required db parameter

Return type:bool
haveExpectedRequets()

Check fo expected requests. Hints an error if no requests are expected.

Return type:bool
haveQuerykey()

Check for required QueryKey parameter

Return type:bool
haveWebenv()

Check for required WebEnv parameter

Return type:bool
useHistory()

Check if history server should be used.

Return type:bool
check()

Implements entrezpy.base.parameter.EutilsParameter.check to check for the minumum required parameters. Aborts if any check fails.

dump()

Dump instance attributes

Return type:dict
Raises:NotImplementedError – if not implemented

EfetchAnalyzer

Inheritance diagram of entrezpy.efetch.efetch_analyzer
class entrezpy.efetch.efetch_analyzer.EfetchAnalyzer

Bases: entrezpy.base.analyzer.EutilsAnalyzer

EfetchAnalyzer implements a basic analysis of Efetch E-Utils responses. Stores results in a entrezpy.efetch.efetch_result.EfetchResult instance.

Note

This is a very superficial analyzer for documentation and educational purposes. In almost all cases a more specific analyzer has to be implemented in inheriting entrezpy.base.analyzer.EutilsAnalyzer and implementing the virtual functions entrezpy.base.analyzer.EutilsAnalzyer.analyze_result() and entrezpy.base.analyzer.EutilsAnalzyer.analyze_error().

Variables:resultentrezpy.efetch.efetch_result.EfetchResult
init_result(response, request)

Should be implemented if used properly

analyze_result(response, request)

Virtual function to handle responses, i.e. parsing them and prepare them for entrezpy.base.result.EutilsResult

Parameters:response (dict or io.StringIO) – converted response from convert_response()
Raises:NotImplementedError – if implementation is missing
analyze_error(response, request)

Virtual function to handle error responses

Parameters:response (dict or io.StringIO) – converted response from convert_response()
Raises:NotImplementedError – if implementation is missing
norm_response(response, rettype=None)

Normalizes response for printing

Parameters:response (dict or io.StringIO) – efetch response
Returns:str or dict
isEmpty()

Test for empty result

Return type:bool
check_error_json(response)

Checks for errors in JSON responses. Not unified among Eutil functions.

Parameters:response (dict) – reponse
Returns:status if JSON response has error message
Return type:bool
check_error_xml(response)

Checks for errors in XML responses

Parameters:response (io.stringIO) – XML response
Returns:if XML response has error message
Return type:bool
convert_response(raw_response_decoded, request)

Converts raw_response into the expected format, deduced from request and set via the retmode parameter.

Parameters:
Returns:

response in parseable format

Return type:

dict or io.stringIO

..note::
Using threads without locks randomly ‘looses’ the response, i.e. the raw response is emptied between requests. With locks, it works, but threading is not much faster than non-threading. It seems JSON is more prone to this than XML.
follow_up()

Return follow-up parameters if available

Returns:Follow-up parameters
Return type:dict
get_result()

Return result

Returns:result instance
Return type:entrezpy.base.result.EutilsResult
isErrorResponse(response, request)

Checking for error messages in response from Entrez Servers and set flag hasErrorResponse.

Parameters:
Returns:

error status

Return type:

bool

isSuccess()

Test if response has errors

Return type:bool
known_fmts = {'json', 'text', 'xml'}
parse(raw_response, request)

Check for errors and calls parser for the raw response.

Parameters:
Raises:

NotImplementedError – if request format is not in EutilsAnalyzer.known_fmts

EfetchRequest

Inheritance diagram of entrezpy.efetch.efetch_request
class entrezpy.efetch.efetch_request.EfetchRequest(eutil, parameter, start, size)

Bases: entrezpy.base.request.EutilsRequest

The EfetchRequest class implements a single request as part of an Efetch query. It stores and prepares the parameters for a single request. entrezpy.efetch.efetch_query.Efetch.inquire() calculates start and size for a single request.

Parameters:
get_post_parameter()

Virtual function returning the POST parameters for the request from required attributes.

Return type:dict
Raises:NotImplemetedError
dump()

Dumps instance attributes

calc_duration()

Calculates request duration

dump_internals(extend=None)

Dumps internal attributes for request.

Parameters:extend (dict) – extend dump with additional information
get_request_id()
Returns:full request id
Return type:str
prepare_base_qry(extend=None)

Returns instance attributes required for every POST request.

Parameters:extend (dict) – parameters extending basic parameters
Returns:base parameters for POST request
Return type:dict
report_status(processed_requests=None, expected_requests=None)

Reports request status when triggered

set_request_error(error)

Sets request error and HTTP/URL error message

Parameters:error (str) – HTTP/URL error
set_status_fail()

Set status if request failed

set_status_success()

Set status if request succeeded

start_stopwatch()

Starts time to measure request duration.

Requester module

Requester

class entrezpy.requester.requester.Requester(wait, max_retries=9, init_timeout=10, timeout_max=60, timeout_step=5)

Requester implements the sendong of HTTP POST requests and the receiving of the result. It checks for request connection errors and performs retries when possible. If the maximum number of retries is reached, the request is conisdered failed. In case of connections errors, abort if the error is not due to timeout. The initial timeout is increased insteps until the maximum timeout has been reached.

Parameters:
  • wait (float) – wait time in seconds between requests
  • max_retries (int) – number of rertries before giving up.
  • init_timeout (int) – number of seconds before the initial request is consid considered a timeout error
  • timeout_max (int) – maximum requet timeout before giving up
  • timeout_steps (int) – increase value for timeout errors
request(req)

Request the request

Parameters:req (entrezpy.base.request.EutilsRequest) – entrezpy request
run_one_request(request, monitor)

Processes one request from the queue and logs its progress.

Parameters:request (entrezpy.base.request.EutilsRequest) – single entrezpy request

Conduit module

Conduit

Inheritance diagram of entrezpy.conduit
class entrezpy.conduit.Conduit(email, apikey=None, apikey_envar=None, threads=None)

Conduit simplifies to create pipelines and queries for entrezpy. Conduit stores results from previous requests, allowing to concatenate queries and retrieve obtained results later if required to reduce the need to redownload data. Conduit can use multiple threads to speed up data download, but some external libraries can break, e.g. SQLite3.

Queries instances in pipelines of Conduit.Pipeline are stored in the dictionary Conduit.queries with the query id as key and are accessible by all Conduit instances. A single Conduit.Pipeline stores only the query id for this instance

Parameters:
  • email (str) – user email
  • apikey (str) – NCBI apikey
  • apikey_var (str) – enviroment variable storing NCBI apikey
  • threads (int) – set threads for multithreading
queries = {}

Query storage

analyzers = {}

Analyzed query storage

class Query(function, parameter, dependency=None, analyzer=None)

Entrezpy query for a Conduit pipeline. Conduit assembles pipelines using several Query() instances. If a dependency is given, it uses those parameters as basis using :meth:.resolve_dependency`.

Parameters:
  • function (str) – Eutils function
  • parameter (dict) – function parameters
  • dependency (str) – query id from earlier query
  • analyzer (entrezpy.base.analyzer.EutilsAnalyzer) – analyzer instance for this query
resolve_dependency()

Resolves dependencies to obtain paremeters from earlier query. Parameters passed to this instance will overwrite dependency parameters

dump()
class Pipeline

The Pipeline class implements a query pipeline with several consecutive queries. New pipelines are obtained through Conduit. Query instances are stored in Conduit.queries and the corresponding query id’s in queries. Every added query returns its id which can be used to retrieve it.

Variables:queries – queries for this Pipeline instance

Adds Esearch query

Parameters:
Returns:

Conduit query

Return type:

ConduitQuery

Adds Elink query. Signature as Conduit.Pipeline.add_search()

add_post(parameter=None, dependency=None, analyzer=None)

Adds Epost query. Signature as Conduit.Pipeline.add_search()

add_summary(parameter=None, dependency=None, analyzer=None)

Adds Esummary query. Signature as Conduit.Pipeline.add_search()

add_fetch(parameter=None, dependency=None, analyzer=None)

Adds Efetch query. Same signature as Conduit.Pipeline.add_search() but analyzer is required as this step obtains highly variable results.

add_query(query)

Adds query to own pipeline and storage

Parameters:query (Conduit.Query) – Conduit query
Returns:query id of added query
Return type:str
run(pipeline)

Runs one query in pipeline and checks for errors. If errors are encounterd the pipeline aborts.

Parameters:pipeline (Conduit.Pipeline) – Conduit pipeline
check_query(query)

Check for successful query.

Parameters:query (Conduit.Query) – Conduit query
get_result(query_id)

“Returns stored result from previous run.

Parameters:query_id (str) – query id
Returns:Result from this query
Return type:entrezpy.base.result.EutilsResult
new_pipeline()

Retrurns new Conduit pipeline.

Returns:Conduit pipeline
Return type:Conduit.Pipeline
search(query, analyzer=<class 'entrezpy.esearch.esearch_analyzer.EsearchAnalyzer'>)

Configures and runs an Esearch query. Analyzer are class references and instantiated here.

Parameters:
  • query (Conduit.Query) – Conduit Query
  • analyzer – reference to analyzer class
Returns:

analyzer

Return type:

entrezpy.esearch.esearch_analyzer.EsearchAnalyzer

summarize(query, analyzer=<class 'entrezpy.esummary.esummary_analyzer.EsummaryAnalyzer'>)

Configures and runs an Esummary query. Analyzer are class references and instantiated here.

Parameters:
  • query (Conduit.Query) – Conduit Query
  • analyzer – reference to analyzer class
Returns:

analyzer

Return type:

entrezpy.esummary.esummary_analyzer.EsummaryAnalyzer

Configures and runs an Elink query. Analyzer are class references and instantiated here.

Parameters:
  • query (Conduit.Query) – Conduit Query
  • analyzer – reference to analyzer class
Returns:

analyzer

Return type:

entrezpy.elink.elink_analyzer.ElinkAnalyzer

post(query, analyzer=<class 'entrezpy.epost.epost_analyzer.EpostAnalyzer'>)

Configures and runs an Epost query. Analyzer are class references and instantiated here.

Parameters:
  • query (Conduit.Query) – Conduit Query
  • analyzer – reference to analyzer class
Returns:

analyzer

Return type:

entrezpy.epost.epost_analyzer.EpostAnalyzer

fetch(query, analyzer=<class 'entrezpy.efetch.efetch_analyzer.EfetchAnalyzer'>)

uns an Efetch query. The Analyzer needs to be added to the quuery

Parameters:
  • query (Conduit.Query) – Conduit Query
  • analyzer – reference to analyzer class
Returns:

analyzer

Returns:

analyzer

Return type:

entrezpy.efetch.efetch_analyzer.EfetchAnalyzer

Glossary

Glossary

NCBI
National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov
E-Utilities
E-Utility
Collection of NCBI tools handling queries to Entrez
Entrez
NCBI database servers storing biomedical data and literature
UID
UIDs
Document identifier unique within one Entrez database
source database
The database from which UIDs are linkd from
target database
The database from which UIDs are linked
WebEnv
String referencing a E-Utility query
querykey
query_key
Number referencing a specific request for a WebEnv
Entrezpy query
Entrezpy querys
entrezpy query
entrezpy querys
A query to onn E-Utility function in entrezpy is considered one query, which can have several entrezpy requests.
Entrezpy request
Entrezpy requests
entrezpy request
entrezpy requests
One request as part of an entrezpy query.

Indices and tables