Fetching publication information from Entrez¶

Prerequisites

Python 3.6 or higher is assumed.
entrezpy is either installed via PyPi or cloned from the git repository (Installation).
basic familiarity with object oriented Python, i.e. inheritance
The full implementation can be found in the repository at examples/tutorials/pubmed/pubmed-fetcher.py

Acknowledgment

I’d like to thank Pedram Hosseini (pdr[dot]hosseini[at]gmail[dot]com) for pointing out the requirement for this tutorial.

Overview¶

This tutorial explains how to write a simple PubMed data record fetcher using entrezpy.conduit.Conduit and by adjust entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer.

Outline

develop a entrezpy.conduit.Conduit pipline
implement a PubMed data structure
inherit entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer
implement the required virtual methods
add methods to derived classes

The Efetch Entrez Utility is NCBI’s utility responsible for fetching data records. Its manual lists all possible databases and which records (Record type) can be fetched in which format. For the first example, we’ll fetch PubMed data in XML, specifically, the UID, authors, title, abstract, and citations. We will test and develop the pipeline using the article the article with PubMed ID (PMID) 26378223 because it has all the required fields. In the end we will see that not all fields are always present.

In entrezpy, a result (or query), is the sum of all individual requests required to obtain the whole query. If you want to analyze the number of citations for a specific author, the result is the number of citations which you obtained using a query. To obtain the final number, you have to parse several PubMed records. Therefore, entrezpy requires a result entrezpy.base.result.EutilsResult class to store the partial results obtained from a query.

A quick note on virtual functions

entrezpy is heavily based on virtual methods [1]. A virtual method is declared in the the base class but implemented in the derived class. Every class inheriting the base class has to implement the virtual functions using the same signature and return the same result type as the base class. To implement the method in the inherited class, you need to look up the method in the base class.

PubMed data structure¶

Before we start to write our implementation, we need to understand the structure of the received data. This can be done using the EDirect tools from NCBI. The result is printed to the standard output. For its examination, it can be either stored into a file, or preferably, piped to a pager, e.g. less [2] or more [3]. These are usually installed on most *NIX systems.

Listing 1 Fetching PubMed data record for PMID 26378223 using EDirect’s efetch¶

$ efetch -db pubmed -id 26378223 -mode XML | less

The entry should start and end as shown in Listing 2.

Listing 2 XML PubMed data record for publication PMID26378223. Data not related to authors, abstract, title, and references has been removed for clarity.¶

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <!- SKIPPED DATA ->
        <Article PubModel="Print">
            <!- SKIPPED DATA ->
            <ArticleTitle>Cell Walls and the Convergent Evolution of the Viral Envelope.</ArticleTitle>
            <!- SKIPPED DATA ->
            <Abstract>
                <AbstractText>Why some viruses are enveloped while others lack an outer lipid bilayer is a major question in viral evolution but one that has received relatively little attention. The viral envelope serves several functions, including protecting the RNA or DNA molecule(s), evading recognition by the immune system, and facilitating virus entry. Despite these commonalities, viral envelopes come in a wide variety of shapes and configurations. The evolution of the viral envelope is made more puzzling by the fact that nonenveloped viruses are able to infect a diverse range of hosts across the tree of life. We reviewed the entry, transmission, and exit pathways of all (101) viral families on the 2013 International Committee on Taxonomy of Viruses (ICTV) list. By doing this, we revealed a strong association between the lack of a viral envelope and the presence of a cell wall in the hosts these viruses infect. We were able to propose a new hypothesis for the existence of enveloped and nonenveloped viruses, in which the latter represent an adaptation to cells surrounded by a cell wall, while the former are an adaptation to animal cells where cell walls are absent. In particular, cell walls inhibit viral entry and exit, as well as viral transport within an organism, all of which are critical waypoints for successful infection and spread. Finally, we discuss how this new model for the origin of the viral envelope impacts our overall understanding of virus evolution. </AbstractText>
                <CopyrightInformation>Copyright © 2015, American Society for Microbiology. All Rights Reserved.</CopyrightInformation>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Buchmann</LastName>
                    <ForeName>Jan P</ForeName>
                    <Initials>JP</Initials>
                    <AffiliationInfo>
                        <Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Holmes</LastName>
                    <ForeName>Edward C</ForeName>
                    <Initials>EC</Initials>
                    <AffiliationInfo>
                        <Affiliation>Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences, and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia edward.holmes@sydney.edu.au.</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <!- SKIPPED DATA ->
        </Article>
        <!- SKIPPED DATA ->
        <ReferenceList>
            <Reference>
                <Citation>Nature. 2014 Jan 16;505(7483):432-5</Citation>
                <ArticleIdList>
                    <ArticleId IdType="pubmed">24336205</ArticleId>
                </ArticleIdList>
            </Reference>
            <Reference>
                <Citation>Crit Rev Microbiol. 1988;15(4):339-89</Citation>
                <ArticleIdList>
                    <ArticleId IdType="pubmed">3060317</ArticleId>
                </ArticleIdList>
            </Reference>
            <!- SKIPPED DATA ->
        </ReferenceList>
    </PubmedData>
</PubmedArticle>

</PubmedArticleSet>

This shows us the XML fields, specifically the tags, present in a typical PubMed record. The root tag for each batch of fetched data records is <PubmedArticleSet> and each individual data record is described in the nested tags <PubmedArticle>. We are interested in the following tags nested within <PubmedArticle>:

<ArticleTitle>

<Abstract>

<AuthorList>

<ReferenceList>

The first step is to write a program to fetch the requested records. This can be done using the entrezpy.conduit.Conduit class.

Simple Conduit pipeline to fetch PubMed Records¶

We will write simple entrezpy pipeline named pubmed-fetcher.py using entrezpy.conduit.Conduit to test and run our implementations. A simple entrezpy.conduit.Conduit pipeline requires two arguments:

user email

PMID (here 15430309)

Listing 3 Basic entrezpy.conduit.Conduit pipeline to fetch PubMed data records. The required arguments are positional arguments given at the command line.¶

#!/usr/bin/env python3


import os
import sys


"""
If entrezpy is installed using PyPi uncomment th line 'import entrezpy'  and
comment the 'sys.path.insert(...)'
"""
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit


def main():
  c = entrezpy.conduit.Conduit(sys.argv[1])
  fetch_pubmed = c.new_pipeline()
  fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2]], 'retmode':'xml'})
  c.run(fetch_pubmed)
  return 0

if __name__ == '__main__':
  main()

Lines 3-4: import standard Python libraries
Lines 12-15: import the module entrezpy.conduit (adjust as necessary)
Line 19: create new entrezpy.conduit.Conduit instance with an email address from the first

command line argument
Line 20: create new pipeline fetch_pubmed using

entrezpy.conduit.Conduit.new_pipeline()
Line 21: add fetch request to the fetch_pubmed pipeline with the PMID

from the second command line argument using entrezpy.conduit.Conduit.Pipeline.add_fetch()
Line 22: run pipeline using entrezpy.conduit.Conduit.run()

Let’s test this program to see if all modules are found and conduit works.

$ python pubmed-fetcher.py your@email 15430309

Since we didn’t specify an analyzer yet, we expect the raw XML output is printed to the standard output. So far, this produces the same output as Listing 1.

If this command fails and/or no output is printed to the standard output, something went wrong. Possible issues may include no internet connection, wrongly installed entrezpy, wrong import statements, or bad permissions.

If everything went smoothly, we wrote a basic but working pipeline to fetch PubMed data from NCBI’s Entrez database. We can now start to implement our specific entrezpy.base.result.EutilsResult and entrezpy.base.analyzer.EutilsAnalyzer classes. However, before we implement these classes, we need to decide how want to store a PubMed data record.

How to store PubMed data records¶

The data records can be stored in different ways, but using a class facilitates collecting and retrieving the requested data. We implement a simple class (analogous to a C/C++ struct [4]) to represent a PubMed record.

Listing 4 Implementing a PubMed data record¶

class PubmedRecord:
  """Simple data class to store individual Pubmed records. Individual authors will
  be stored as dict('lname':last_name, 'fname': first_name) in authors.
  Citations as string elements in the list citations. """

  def __init__(self):
    self.pmid = None
    self.title = None
    self.abstract = None
    self.authors = []
    self.references = []

Further, we will use the dict pubmed_records as attribute of PubmedResult to store PubmedRecord instances using the PMID as key to avoid duplicates.

Defining `PubmedResult` and `PubmedAnalyzer`¶

From the documentation or publication, we know that entrezpy.base.analyzer.EutilsAnalyzer parses responses and stores results in entrezpy.base.result.EutilsResult. Therefore, we need to derive and adjust these classes for our PubmedResult and PubmedAnalyzer classes. We will add these classes to our program pubmed-fetcher.py. The documentation tells us what the required parameters for each class are and the virtual methods we need to implement.

Implement `PubmedResult`¶

We have to extend the virtual methods declared in entrezpy.base.result.EutilsResult. The documentation informs us about the required parameters and expected return values.

In addition, we declare the method PubmedResult.add_pubmed_record() to handle adding new PubMed data record instances as defined in Listing 4. The PubmedResult methods in this tutorial are trivial since and we can implement the class in one go

Listing 5 Implementing PubmedResult¶

class PubmedResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
  Individual Pubmed records are implemented in :class:`PubmedRecord` and
  stored in :ivar:`pubmed_records`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict pubmed_records: storing PubmedRecord instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.pubmed_records = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.pubmed_records)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.pubmed_records:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching a pubmed record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_pubmed_record(self, pubmed_record):
    """The only non-virtual and therefore PubmedResult-specific method to handle
    adding new data records"""
    self.pubmed_records[pubmed_record.pmid] = pubmed_record

Line 1: inherit the base class entrezpy.base.result.EutilsResult
Line 10-12: initialize PubmedResult instance with the required

parameters and attributes. We don’t need any information from the response, e.g. WebEnv.
Line 14-17: implement entrezpy.base.result.EutilsResult.size()
Line 19-24: implement entrezpy.base.result.EutilsResult.isEmpty()
Line 26-31: implement entrezpy.base.result.EutilsResult.get_link_parameter()
Line 33-41: implement entrezpy.base.result.EutilsResult.dump()
Line 43-46: specific PubmedResult method to store individual PubmedRecord

instances

Note

Linking PubMed records for subsequent searches is better handled by creating a pipeline performing esearch queries followed by elink queries and a final efetch query. The fetch result for PubMed records has no WebEnv value and is missing the originating database since efetch is usually the last query within a series of Eutils queries. You can test this using the following EDirect pipeline: $ efetch -db pubmed -id 20148030 | elink -target nuccore Therefore, we implement a warning, informing the user linking is not possible. Nevertheless, the method could return any parsed information, e.g. nucleotide UIDs, and used as parameter for a subsequent fetch. However, some features could not be used, e.g. the Entrez history server.

Implementing `PubmedAnalyzer`¶

We have to extend the virtual methods declared in entrezpy.base.analyzer.EutilsAnalyzer. The documentation informs us about the required parameters and expected return values.

Listing 6 Implementing PubmedAnalyzer¶

class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse PubMed responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = PubmedResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect XML errors, just print the error to STDOUT for
    logging/debugging."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response.getvalue()}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    Parse PubMed  XML line by line to extract authors and citations.
    xml.etree.ElementTree.iterparse
    (https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
    reads the XML file incrementally. Each  <PubmedArticle> is cleared after processing.

    ..note::  Adjust this method to include more/different tags to extract.
              Remember to adjust :class:`.PubmedRecord` as well."""
    self.init_result(response, request)
    isAuthorList = False
    isAuthor = False
    isRefList = False
    isRef = False
    isArticle = False
    medrec = None
    for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
      if event == 'start':
        if elem.tag == 'PubmedArticle':
          medrec = PubmedRecord()
        if elem.tag == 'AuthorList':
          isAuthorList = True
        if isAuthorList and elem.tag == 'Author':
          isAuthor = True
          medrec.authors.append({'fname': None, 'lname': None})
        if elem.tag == 'ReferenceList':
          isRefList = True
        if isRefList and elem.tag == 'Reference':
          isRef = True
        if elem.tag == 'Article':
          isArticle = True
      else:
        if elem.tag == 'PubmedArticle':
          self.result.add_pubmed_record(medrec)
          elem.clear()
        if elem.tag == 'AuthorList':
          isAuthorList = False
        if isAuthorList and elem.tag == 'Author':
          isAuthor = False
        if elem.tag == 'ReferenceList':
          isRefList = False
        if elem.tag == 'Reference':
          isRef = False
        if elem.tag == 'Article':
          isArticle = False
        if elem.tag == 'PMID':
          medrec.pmid = elem.text.strip()
        if isAuthor and elem.tag == 'LastName':
          medrec.authors[-1]['lname'] = elem.text.strip()
        if isAuthor and elem.tag == 'ForeName':
          medrec.authors[-1]['fname'] = elem.text.strip()
        if isRef and elem.tag == 'Citation':
          medrec.references.append(elem.text.strip())
        if isArticle and elem.tag == 'AbstractText':
          if not medrec.abstract:
            medrec.abstract = elem.text.strip()
          else:
            medrec.abstract += elem.text.strip()
        if isArticle and elem.tag == 'ArticleTitle':
          medrec.title = elem.text.strip()

Line 1: Inherit the base class entrezpy.base.analyzer.EutilsAnalyzer
Lines 5-6: initialize PubmedResult instance.
Lines 8-12: declare entrezpy.base.analyzer.EutilsAnalyzer.init_result()
Lines 14-19: decalre entrezpy.base.analyzer.EutilsAnalyzer.analyze_error()
Lines 21-69: declare entrezpy.base.analyzer.EutilsAnalyzer.analyze_result()

The XML parser is the critical, and most likely most complex, piece to implement. However, if you want to parse your Entrez results you anyway need to develop a parser. If you already have a parser, you can use an object composition approach [#fn-oocomp]. Further, you can add a method in analyze_result to store the processed data in a database or implementing checkpoints.

Note

Explaining the XML parser is beyond the scope of this tutorial (and there are likely better approaches, anyways).

Putting everything together¶

The completed implementation is shown in Listing 7.

Listing 7 Complete PubMed fetcher to extract author and citations.¶

#!/usr/bin/env python3


import os
import sys
import json
import xml.etree.ElementTree


# If entrezpy is installed using PyPi uncomment th line 'import entrezpy'
# and comment the 'sys.path.insert(...)'
# import entrezpy
sys.path.insert(1, os.path.join(sys.path[0], '../../../src'))
# Import required entrepy modules
import entrezpy.conduit
import entrezpy.base.result
import entrezpy.base.analyzer


class PubmedRecord:
  """Simple data class to store individual Pubmed records. Individual authors will
  be stored as dict('lname':last_name, 'fname': first_name) in authors.
  Citations as string elements in the list citations. """

  def __init__(self):
    self.pmid = None
    self.title = None
    self.abstract = None
    self.authors = []
    self.references = []

class PubmedResult(entrezpy.base.result.EutilsResult):
  """Derive class entrezpy.base.result.EutilsResult to store Pubmed queries.
  Individual Pubmed records are implemented in :class:`PubmedRecord` and
  stored in :ivar:`pubmed_records`.

  :param response: inspected response from :class:`PubmedAnalyzer`
  :param request: the request for the current response
  :ivar dict pubmed_records: storing PubmedRecord instances"""

  def __init__(self, response, request):
    super().__init__(request.eutil, request.query_id, request.db)
    self.pubmed_records = {}

  def size(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.size`
    returning the number of stored data records."""
    return len(self.pubmed_records)

  def isEmpty(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.isEmpty`
    to query if any records have been stored at all."""
    if not self.pubmed_records:
      return True
    return False

  def get_link_parameter(self, reqnum=0):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.get_link_parameter`.
    Fetching a pubmed record has no intrinsic elink capabilities and therefore
    should inform users about this."""
    print("{} has no elink capability".format(self))
    return {}

  def dump(self):
    """Implement virtual method :meth:`entrezpy.base.result.EutilsResult.dump`.

    :return: instance attributes
    :rtype: dict
    """
    return {self:{'dump':{'pubmed_records':[x for x in self.pubmed_records],
                              'query_id': self.query_id, 'db':self.db,
                              'eutil':self.function}}}

  def add_pubmed_record(self, pubmed_record):
    """The only non-virtual and therefore PubmedResult-specific method to handle
    adding new data records"""
    self.pubmed_records[pubmed_record.pmid] = pubmed_record

class PubmedAnalyzer(entrezpy.base.analyzer.EutilsAnalyzer):
  """Derived class of :class:`entrezpy.base.analyzer.EutilsAnalyzer` to analyze and
  parse PubMed responses and requests."""

  def __init__(self):
    super().__init__()

  def init_result(self, response, request):
    """Implemented virtual method :meth:`entrezpy.base.analyzer.init_result`.
    This method initiate a result instance when analyzing the first response"""
    if self.result is None:
      self.result = PubmedResult(response, request)

  def analyze_error(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_error`. Since
    we expect XML errors, just print the error to STDOUT for
    logging/debugging."""
    print(json.dumps({__name__:{'Response': {'dump' : request.dump(),
                                             'error' : response.getvalue()}}}))

  def analyze_result(self, response, request):
    """Implement virtual method :meth:`entrezpy.base.analyzer.analyze_result`.
    Parse PubMed  XML line by line to extract authors and citations.
    xml.etree.ElementTree.iterparse
    (https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)
    reads the XML file incrementally. Each  <PubmedArticle> is cleared after processing.

    ..note::  Adjust this method to include more/different tags to extract.
              Remember to adjust :class:`.PubmedRecord` as well."""
    self.init_result(response, request)
    isAuthorList = False
    isAuthor = False
    isRefList = False
    isRef = False
    isArticle = False
    medrec = None
    for event, elem in xml.etree.ElementTree.iterparse(response, events=["start", "end"]):
      if event == 'start':
        if elem.tag == 'PubmedArticle':
          medrec = PubmedRecord()
        if elem.tag == 'AuthorList':
          isAuthorList = True
        if isAuthorList and elem.tag == 'Author':
          isAuthor = True
          medrec.authors.append({'fname': None, 'lname': None})
        if elem.tag == 'ReferenceList':
          isRefList = True
        if isRefList and elem.tag == 'Reference':
          isRef = True
        if elem.tag == 'Article':
          isArticle = True
      else:
        if elem.tag == 'PubmedArticle':
          self.result.add_pubmed_record(medrec)
          elem.clear()
        if elem.tag == 'AuthorList':
          isAuthorList = False
        if isAuthorList and elem.tag == 'Author':
          isAuthor = False
        if elem.tag == 'ReferenceList':
          isRefList = False
        if elem.tag == 'Reference':
          isRef = False
        if elem.tag == 'Article':
          isArticle = False
        if elem.tag == 'PMID':
          medrec.pmid = elem.text.strip()
        if isAuthor and elem.tag == 'LastName':
          medrec.authors[-1]['lname'] = elem.text.strip()
        if isAuthor and elem.tag == 'ForeName':
          medrec.authors[-1]['fname'] = elem.text.strip()
        if isRef and elem.tag == 'Citation':
          medrec.references.append(elem.text.strip())
        if isArticle and elem.tag == 'AbstractText':
          if not medrec.abstract:
            medrec.abstract = elem.text.strip()
          else:
            medrec.abstract += elem.text.strip()
        if isArticle and elem.tag == 'ArticleTitle':
          medrec.title = elem.text.strip()

def main():
  c = entrezpy.conduit.Conduit(sys.argv[1])
  fetch_pubmed = c.new_pipeline()
  fetch_pubmed.add_fetch({'db':'pubmed', 'id':[sys.argv[2].split(',')],
                          'retmode':'xml'}, analyzer=PubmedAnalyzer())

  a = c.run(fetch_pubmed)

  #print(a)
  # Testing PubmedResult
  #print("DUMP: {}".format(a.get_result().dump()))
  #print("SIZE: {}".format(a.get_result().size()))
  #print("LINK: {}".format(a.get_result().get_link_parameter()))

  res = a.get_result()
  print("PMID","Title","Abstract","Authors","RefCount", "References", sep='=')
  for i in res.pubmed_records:
    print("{}={}={}={}={}={}".format(res.pubmed_records[i].pmid, res.pubmed_records[i].title,
                                  res.pubmed_records[i].abstract,
                                  ';'.join(str(x['lname']+","+x['fname'].replace(' ', '')) for x in res.pubmed_records[i].authors),
                                  len(res.pubmed_records[i].references),
                                  ';'.join(x for x in res.pubmed_records[i].references)))
  return 0

if __name__ == '__main__':
  main()

Line 163: Adjust argument processing to allow several comma-separated PMIDs
Line 164: add our implemented PubmedAnalyzer as parameter to analyze

results as described in entrezpy.conduit.Conduit.Pipeline.add_fetch()
Line 166: run the pipeline and store the analyzer in a
Lines 168-172: Testing methods
Line 174: get PubmedResult instance
Lines 175-181: process fetched data records into columns

The implementation can be invoked as shown in Listing 8.

Listing 8 Fetching and formatting data records for several different PMIDs¶

$ python pubmed-fetcher.py you@email 6,15430309,31077305,27880757,26378223| column -s= -t |less

You’ll notice that not all data records have all fields. This is because they are missing in these records or some tags have different names.

Running pubmed-fetcher.py with UID 20148030 will fail (Listing 9).

Listing 9 Fetching the data record PMID20148030 results in an error¶

$ python pubmed-fetcher.py you@email 20148030

The reason for this is can be found in the requested XML. Running the command in Listing 10 hints the problem. Adjusting and fixing is a task left for interested readers.

Listing 10 Hint to find the reason why PMID 20148030 fails¶

$ efetch -db pubmed -id 20148030  -mode xml | grep -A7 \<AuthorList

Footnotes

[1]	https://en.wikipedia.org/wiki/Virtual_function

[2]	http://www.greenwoodsoftware.com/less/

[3]	https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/

[4]	https://en.cppreference.com/w/c/language/struct

[5]	https://en.wikipedia.org/wiki/Object_composition

Fetching publication information from Entrez¶

Overview¶

PubMed data structure¶

Simple Conduit pipeline to fetch PubMed Records¶

How to store PubMed data records¶

Defining `PubmedResult` and `PubmedAnalyzer`¶

Implement `PubmedResult`¶

Implementing `PubmedAnalyzer`¶

Putting everything together¶

Table of Contents

Related Topics

Fetching publication information from Entrez¶

Overview¶

PubMed data structure¶

Simple Conduit pipeline to fetch PubMed Records¶

How to store PubMed data records¶

Defining PubmedResult and PubmedAnalyzer¶

Implement PubmedResult¶

Implementing PubmedAnalyzer¶

Putting everything together¶

Defining `PubmedResult` and `PubmedAnalyzer`¶

Implement `PubmedResult`¶

Implementing `PubmedAnalyzer`¶