Fetching sequence metadata from Entrez¶

Prerequisites

Python 3.6 or higher is assumed.
entrezpy is either installed via PyPi or cloned from the git repository (Installation).
basic familiarity with object oriented Python, i.e. inheritance
read the tutorial Fetching publication information from Entrez
The full implementation can be found in the repository at examples/tutorials/seqmetadata/seqmetadata-fetcher.py

Overview¶

This tutorial explains how to write a simple sequence docsum fetcher using entrezpy.conduit.Conduit and by adjust entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer. It is based on a esearch followed by fetching the data as docsum JSON. This tutorial is very similar as Fetching publication information from Entrez, the main difference being parsing JSON and using two steps in entrezpy.conduit.Conduit. The main steps are very similar and the reader is should look there for more details.

Outline

develop a entrezpy.conduit.Conduit pipline
implement a Docsum data structure
inherit entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer
implement the required virtual methods
add methods to derived classes

The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. We’ll fetch Docsum data in JSON using the EUtil esummary after performing an esearch step using accessions numbers as query. Instead of using efetch, we will use esummary and replace the default analyzer with our own.

In entrezpy, a result (or query), is the sum of all individual requests required to obtain the whole query. esummary fetches data in batches. In this example, all batches are collected prior to printing the infomration to standard output. The method DocsumAnalyzer.analyze_result() can be adjusted to store or analyze the results from each batch as soon as the are fetched.

A quick note on virtual functions

entrezpy is heavily based on virtual methods [1]. A virtual method is declared in the the base class but implemented in the derived class. Every class inheriting the base class has to implement the virtual functions using the same signature and return the same result type as the base class. To implement the method in the inherited class, you need to look up the method in the base class.

`Docsum` data structure¶

Before we start to write our implementation, we need to understand the structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its examination, it can be either stored into a file, or preferably, piped to a pager, e.g. less [2] or more [3]. These are usually installed on most *NIX systems.

Listing 11 Fetching Docsum data record for accession HOU142311 using EDirect’s
esearch and efetch.¶

$ esearch -db nuccore -query HOU142311 | esummary -mode json

The entry should start and end as shown in Listing 12.

Listing 12 JSON Docsum data record for accession HOU142311. Only the first few attributes lines are shown for brevity.¶

{
    "header": {
        "type": "esummary",
        "version": "0.3"
    },
    "result": {
        "uids": [
            "1110864597"
        ],
        "1110864597": {
            "uid": "1110864597",
            "caption": "KX883530",
            "title": "Beihai levi-like virus 30 strain HOU142311 hypothetical protein genes, complete cds",
            "extra": "gi|1110864597|gb|KX883530.1|",
            "gi": 1110864597,
            "createdate": "2016/12/10",
            "updatedate": "2016/12/10",
            "flags": "",
            "taxid": 1922417,
            "slen": 4084,
            "biomol": "genomic",
            "moltype": "rna",
            "topology": "linear",
            "sourcedb": "insd",
            "segsetsize": "",
            "projectid": "0",
            "genome": "genomic",
            "subtype": "strain|host|country|collection_date",
            "subname": "HOU142311|horseshoe crab|China|2014",
            "assemblygi": "",
            "assemblyacc": "",
            "tech": "",
            "completeness": "",
            "geneticcode": "1",
            "strand": "",
            "organism": "Beihai levi-like virus 30",
            "strain": "HOU142311",
            "biosample": "",
        }
    }
}

The first step is to write a program to fetch the requested records. This can be done using the entrezpy.conduit.Conduit class.

Simple Conduit pipeline to fetch `Docsum` Records¶

We will write simple entrezpy pipeline named seqmetadata-fetcher.py using entrezpy.conduit.Conduit to test and run our implementations. A simple entrezpy.conduit.Conduit pipeline requires two arguments:

user email

accession numbers

Listing 13 Basic entrezpy.conduit.Conduit pipeline to fetch Docsum data records. The required arguments are parsed by ArgumentParser.¶

#!/usr/bin/env python3


import os
import sys
import json
import argparse


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


def main():
  ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
  Accessions are parsed form STDIN, one accession pre line')
  ap.add_argument('--email',
                  type=str,
                  required=True,
                  help='email required by NCBI'),
  ap.add_argument('--apikey',
                  type=str,
                  default=None,
                  help='NCBI apikey (optional)')
  ap.add_argument('-db',
                  type=str,
                  required=True,
                  help='Database to search ')
  args = ap.parse_args()

  c = entrezpy.conduit.Conduit(args.email)
  fetch_docsum = c.new_pipeline()
  sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
  fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
                            dependency=sid, analyzer=DocsumAnalyzer())

Lines 1-17: import standard Python libraries and entrezpy modules
Lines 21-35: Setup argument parser
Line 37: create new entrezpy.conduit.Conduit instance with an email address.
Line 38: New pipeline instance entrezpy.conduit.Conduit.new_pipeline()
Line 39: add search request to the pipeline with the databse name from the

user passed argument and a search strin assembled from standard input. Store the query id in sid. entrezpy.conduit.Conduit.Pipeline.add_search()
Line 40 add summary step with the search query as dependency.

(entrezpy.conduit.Conduit.Pipeline.add_summary())
Line 22: run pipeline using entrezpy.conduit.Conduit.run()

We need to implement the DocsumAnalyzer, but before we have to design a Docsum data structure.

How to store `Docsum` data records¶

The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a Docsum record. Becuase we fetch data in JSON format, the class performs a rather dull parsing. The nested Subtype class handles the subtype and subname attributes in a Docsum response.

Listing 14 Implementing a Docsum data record¶

class Docsum:
  """Simple data class to store individual sequence Docsum records."""

  class Subtype:

    def __init__(self, subtype, subname):
      self.strain = None
      self.host = None
      self.country = None
      self.collection = None
      self.collection_date = None

      for i in range(len(subtype)):
        if subtype[i] == 'strain':
          self.stain = subname[i]
        if subtype[i] == 'host':
          self.host = subname[i]
        if subtype[i] == 'country':
          self.country = subname[i]
        if subtype[i] == 'collection_date':
          self.collection_date = subname[i]

  def __init__(self, json_docsum):
    self.uid = int(json_docsum['uid'])
    self.caption = json_docsum['caption']
    self.title = json_docsum['title']
    self.extra = json_docsum['extra']
    self.gi = int(json_docsum['gi'])
    self.taxid = int(json_docsum['taxid'])
    self.slen =  int(json_docsum['slen'])
    self.biomol =  json_docsum['biomol']
    self.moltype =  json_docsum['moltype']
    self.tolopolgy = json_docsum['topology']
    self.sourcedb = json_docsum['sourcedb']
    self.segsetsize = json_docsum['segsetsize']
    self.projectid = int(json_docsum['projectid'])
    self.genome = json_docsum['genome']
    self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
                                  json_docsum['subname'].split('|'))
    self.assemblygi = json_docsum['assemblygi']
    self.assemblyacc = json_docsum['assemblyacc']
    self.tech = json_docsum['tech']
    self.completeness = json_docsum['completeness']
    self.geneticcode = int(json_docsum['geneticcode'])
    self.strand = json_docsum['strand']
    self.organism = self.strand = json_docsum['organism']
    self.strain = self.strand = json_docsum['strain']
    self.accessionversion = json_docsum['accessionversion']

Implement `DocsumResult`¶

We have to extend the virtual methods declared in entrezpy.base.result.EutilsResult. The documentation informs us about the required parameters and expected return values.

In addition, we declare the method PubmedResult.add_docsum() to handle adding new Docsum data record instances as defined in Listing 14. The Docsum methods in this tutorial are trivial and we can implement the class in one go

Listing 15 Implementing DocsumResult¶

class DocsumResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
  Individual Docsum records are implemented in :class:`Docsum` and
  stored in :ivar:`docsums`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict docsums: storing Docsum instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.docsums = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.docsums)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.docsums:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching summary record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'docsum_records':[x for x in self.docsums],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_docsum(self, docsum):
    """The only non-virtual and therefore DocsumResult-specific method to handle
    adding new data records"""
    self.docsums[docsum.uid] = docsum

Line 1: inherit the base class entrezpy.base.result.EutilsResult
Line 10-12: initialize DocsumResult instance with the required

parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
Line 14-17: implement entrezpy.base.result.EutilsResult.size()
Line 19-24: implement entrezpy.base.result.EutilsResult.isEmpty()
Line 26-31: implement entrezpy.base.result.EutilsResult.get_link_parameter()
Line 33-41: implement entrezpy.base.result.EutilsResult.dump()
Line 43-46: specific PubmedResult method to store individual DocsumResult

instances

Note

The fetch result for Docsum records has no WebEnv value and is missing the originating database since esummary is usually the last query within a series of Eutils queries. Therefore, we implement a warning, informing the user linking is not possible.

Implementing `DocsumAnalyzer`¶

We have to extend the virtual methods declared in entrezpy.base.analyzer.EutilsAnalyzer. The documentation informs us about the required parameters and expected return values.

Listing 16 Implementing PubmedAnalyzer¶

class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse Docsum responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = DocsumResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect JSON, just print the error to STDOUT as string."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    The results is a JSON structure and allows easy parsing"""
    self.init_result(response, request)
    for i in response['result']['uids']:
      self.result.add_docsum(Docsum(response['result'][i]))

Line 1: Inherit the base class entrezpy.base.analyzer.EutilsAnalyzer
Lines 5-6: initialize PubmedResult instance.
Lines 8-12: declare entrezpy.base.analyzer.EutilsAnalyzer.init_result()
Lines 14-18: decalre entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
Lines 20-25: declare entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()

Compared to the pubmed analyzer, parsing the JOSN output is very easy. If you already have a parser, you can use an object composition approach [#fn-oocomp]. Further, you can add a method in analyze_result to store the processed data in a database or implementing checkpoints.

Putting everything together¶

The completed implementation is shown in Listing 17.

Listing 17 Complete Docsum fetcher¶

#!/usr/bin/env python3

import os
import sys
import json
import argparse


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


class Docsum:
  """Simple data class to store individual sequence Docsum records."""

  class Subtype:

    def __init__(self, subtype, subname):
      self.strain = None
      self.host = None
      self.country = None
      self.collection = None
      self.collection_date = None

      for i in range(len(subtype)):
        if subtype[i] == 'strain':
          self.stain = subname[i]
        if subtype[i] == 'host':
          self.host = subname[i]
        if subtype[i] == 'country':
          self.country = subname[i]
        if subtype[i] == 'collection_date':
          self.collection_date = subname[i]

  def __init__(self, json_docsum):
    self.uid = int(json_docsum['uid'])
    self.caption = json_docsum['caption']
    self.title = json_docsum['title']
    self.extra = json_docsum['extra']
    self.gi = int(json_docsum['gi'])
    self.taxid = int(json_docsum['taxid'])
    self.slen =  int(json_docsum['slen'])
    self.biomol =  json_docsum['biomol']
    self.moltype =  json_docsum['moltype']
    self.tolopolgy = json_docsum['topology']
    self.sourcedb = json_docsum['sourcedb']
    self.segsetsize = json_docsum['segsetsize']
    self.projectid = int(json_docsum['projectid'])
    self.genome = json_docsum['genome']
    self.subtype = Docsum.Subtype(json_docsum['subtype'].split('|'),
                                  json_docsum['subname'].split('|'))
    self.assemblygi = json_docsum['assemblygi']
    self.assemblyacc = json_docsum['assemblyacc']
    self.tech = json_docsum['tech']
    self.completeness = json_docsum['completeness']
    self.geneticcode = int(json_docsum['geneticcode'])
    self.strand = json_docsum['strand']
    self.organism = self.strand = json_docsum['organism']
    self.strain = self.strand = json_docsum['strain']
    self.accessionversion = json_docsum['accessionversion']

class DocsumResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Docsum queries.
  Individual Docsum records are implemented in :class:`Docsum` and
  stored in :ivar:`docsums`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict docsums: storing Docsum instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.docsums = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.docsums)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.docsums:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching summary record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'docsum_records':[x for x in self.docsums],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_docsum(self, docsum):
    """The only non-virtual and therefore DocsumResult-specific method to handle
    adding new data records"""
    self.docsums[docsum.uid] = docsum

class DocsumAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse Docsum responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = DocsumResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect JSON, just print the error to STDOUT as string."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    The results is a JSON structure and allows easy parsing"""
    self.init_result(response, request)
    for i in response['result']['uids']:
      self.result.add_docsum(Docsum(response['result'][i]))

def main():
  ap = argparse.ArgumentParser(description='Simple Sequence Metadata Fetcher. \
  Accessions are parsed form STDIN, one accession pre line')
  ap.add_argument('--email',
                  type=str,
                  required=True,
                  help='email required by NCBI'),
  ap.add_argument('--apikey',
                  type=str,
                  default=None,
                  help='NCBI apikey (optional)')
  ap.add_argument('-db',
                  type=str,
                  required=True,
                  help='Database to search ')
  args = ap.parse_args()

  c = entrezpy.conduit.Conduit(args.email)
  fetch_docsum = c.new_pipeline()
  sid = fetch_docsum.add_search({'db':args.db, 'term':','.join([str(x.strip()) for x in sys.stdin])})
  fetch_docsum.add_summary({'rettype':'docsum', 'retmode':'json'},
                            dependency=sid, analyzer=DocsumAnalyzer())
  docsums = c.run(fetch_docsum).get_result().docsums
  for i in docsums:
    print(i, docsums[i].uid, docsums[i].caption,docsums[i].strain, docsums[i].subtype.host)
  return 0

if __name__ == '__main__':
  main()

The implementaion can be invoked as shown in Listing 18.

Listing 18 Fetching Docsum data for several accessions¶

$ cat "NC_016134.3" > accs
$ cat "HOU142311" >> accs
$ cat accs | python seqmetadata-fetcher.py --email email -db nuccore

Footnotes

[1]	https://en.wikipedia.org/wiki/Virtual_function

[2]	http://www.greenwoodsoftware.com/less/

[3]	https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/

[4]	https://en.cppreference.com/w/c/language/struct

[5]	https://en.wikipedia.org/wiki/Object_composition

Fetching sequence metadata from Entrez¶

Overview¶

`Docsum` data structure¶

Simple Conduit pipeline to fetch `Docsum` Records¶

How to store `Docsum` data records¶

Implement `DocsumResult`¶

Implementing `DocsumAnalyzer`¶

Putting everything together¶

Table of Contents

Related Topics

Fetching sequence metadata from Entrez¶

Overview¶

Docsum data structure¶

Simple Conduit pipeline to fetch Docsum Records¶

How to store Docsum data records¶

Implement DocsumResult¶

Implementing DocsumAnalyzer¶

Putting everything together¶

`Docsum` data structure¶

Simple Conduit pipeline to fetch `Docsum` Records¶

How to store `Docsum` data records¶

Implement `DocsumResult`¶

Implementing `DocsumAnalyzer`¶