evcouplings.compare package

evcouplings.compare.distances module

Distance calculations on PDB 3D coordinates

Authors:
Thomas A. Hopf Anna G. Green (remap_complex_chains)
class evcouplings.compare.distances.DistanceMap(residues_i, residues_j, dist_matrix, symmetric)[source]

Bases: object

Compute, store and accesss pairwise residue distances in PDB 3D structures

classmethod aggregate(*matrices, intersect=False, agg_func=<MagicMock id='139762542125264'>)[source]

Aggregate with other distance map(s). Secondary structure will be aggregated by assigning the most frequent state across all distance matrices; if there are equal counts, H (helix) will be chosen over E (strand) over C (coil).

Parameters:
  • *matrices (DistanceMap) –

    *args-style list of DistanceMaps that will be aggregated.

    Note

    The id column of each axis may only contain numeric residue ids (and no characters such as insertion codes)

  • intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
  • agg_func (function (default: numpy.nanmin)) – Function that will be used to aggregate distance matrices. Needs to take a parameter “axis” to aggregate over all matrices.
Returns:

Aggregated distance map

Return type:

DistanceMap

Raises:

ValueError – If residue identifiers are not numeric, or if intersect is True, but positions on axis do not overlap.

contacts(max_dist=5.0, min_dist=None)[source]

Return list of pairs below distance threshold

Parameters:
  • max_dist (float, optional (default: 5.0)) – Maximum distance for any pair to be considered a contact
  • min_dist (float, optional (default: None)) – Minimum distance of any pair to be returned (may be useful if extracting different distance ranges from matrix). Distance has to be > min_dist, (not >=).
Returns:

contacts – Table with residue-residue contacts, with the following columns:

  1. id_i: identifier of residue in chain i
  2. id_j: identifier of residue in chain j
  3. dist: pair distance

Return type:

pandas.DataFrame

dist(i, j, raise_na=True)[source]

Return distance of residue pair

Parameters:
  • i (int or str) – Identifier of position on first axis
  • j (int or str) – Identifier of position on second axis
  • raise_na (bool, optional (default: True)) – Raise error if i or j is not contained in either axis. If False, returns np.nan for undefined entries.
Returns:

Distance of pair (i, j). If raise_na is False and identifiers are not valid, distance will be np.nan

Return type:

np.float

Raises:

KeyError – If index i or j is not a valid identifier for respective chain

classmethod from_coords(chain_i, chain_j=None)[source]

Compute distance matrix from PDB chain coordinates.

Parameters:
  • chain_i (Chain) – PDB chain to be used for first axis of matrix
  • chain_j (Chain, optional (default: None)) – PDB chain to be used for second axis of matrix. If not given, will be set to chain_i, resulting in a symmetric distance matrix
Returns:

Distance map computed from given coordinates

Return type:

DistanceMap

classmethod from_file(filename)[source]

Load existing distance map using filename prefix (each distance map consist of .csv and .npy file)

Parameters:filename (str) – Prefix of path to distance map files (excluding .csv/.npy)
Returns:Loaded distance map
Return type:DistanceMap
classmethod from_files(residue_table_file, distance_matrix_file)[source]

Load existing distance map with explicit paths to residue table (.csv) and distance matrix (.npy). Use DistanceMap.from_file to load using joint prefix of both files.

Parameters:
  • residue_table_file (str or file-like object) – Path to residue table file (prefix + .csv)
  • distance_matrix_file (str or file-like object) – Path to distance matrix file (prefix + .npy)
Returns:

Loaded distance map

Return type:

DistanceMap

structure_coverage()[source]

Find covered residue segments for individual structures that this distance map was computed from (either directly from structure or through aggregation of multiple structures). Only works if all residue identifiers of DistanceMap are numeric (i.e., do not have insertion codes)

Returns:coverage – Returns tuples of the form (coverage_i, coverage_j, coverage_id), where * coverage_i and coverage_j are lists of tuples
(segment_start, segment_end) of residue coverage along axis i and j, with segment_end being included in the range
  • coverage_id is the identifier of the individual substructure the coverage segments belong to (only set if an aggregated structure, None otherwise)
Return type:list of tuple
to_file(filename)[source]

Store distance map in files

Parameters:filename (str) – Prefix of distance map files (will create .csv and .npy file)
Returns:
  • residue_table_filename (str) – Path to residue table (will be filename + .csv)
  • dist_mat_filename (str) – Path to distance matrix file in numpy format (will be filename + .npy)
transpose()[source]

Transpose distance map (i.e. swap axes)

Returns:Transposed copy of distance map
Return type:DistanceMap
evcouplings.compare.distances.inter_dists(sifts_result_i, sifts_result_j, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]

Compute inter-chain distances (between different entities) in PDB file. Resulting distance map is typically not symmetric, with either axis corresponding to either chain. Inter-distances are calculated on all combinations of chains that have the same PDB id in sifts_result_i and sifts_result_j.

Parameters:
  • sifts_result_i (SIFTSResult) – Input structures and mapping to use for first axis of computed distance map
  • sifts_result_j (SIFTSResult) – Input structures and mapping to use for second axis of computed distance map
  • structures (str or dict, optional (default: None)) –
    • If str: Load structures from directory this string points to. Missing structures will be fetched from web.
    • If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
  • atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
  • intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
  • output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
  • model (int, optional (default: 0)) – Index of model in PDB structure that should be used
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns:

agg_distmap – Computed aggregated distance map across all input structures

If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table:

pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index_i”, “sifts_table_index_j” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map).

Will be None if output_prefix is None.

Return type:

DistanceMap

Raises:
  • ValueError – If sifts_result_i or sifts_result_j is empty (no structure hits)
  • ResourceError – If any structure could not be loaded and raise_missing is True
evcouplings.compare.distances.intra_dists(sifts_result, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]

Compute intra-chain distances in PDB files.

Parameters:
  • sifts_result (SIFTSResult) – Input structures and mapping to use for distance map calculation
  • structures (str or dict, optional (default: None)) –

    If str: Load structures from directory this string points to. Missing structures will be fetched from web.

    If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.

  • atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
  • intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
  • output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
  • model (int, optional (default: 0)) – Index of model in PDB structure that should be used
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns:

agg_distmap – Computed aggregated distance map across all input structures

Contains an additional attribute aggregated_residue_maps, a pd.DataFrame with the concatenated residue maps of all individual chains used to compute this DistanceMap. Individual chains are linked to the input sifts_results through the column sifts_table_index.

If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table: pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map). Will be None if output_prefix is None.

Return type:

DistanceMap

Raises:
  • ValueError – If sifts_result is empty (no structure hits)
  • ResourceError – If any structure could not be loaded and raise_missing is True
evcouplings.compare.distances.multimer_dists(sifts_result, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]

Compute homomultimer distances (between repeated copies of the same entity) in PDB file. Resulting distance matrix will be symmetric by minimization over upper and lower triangle of matrix, even if the complex structure is not symmetric.

Parameters:
  • sifts_result (SIFTSResult) – Input structures and mapping to use for distance map calculation
  • structures (str or dict, optional (default: None)) –

    If str: Load structures from directory this string points to. Missing structures will be fetched from web.

    If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.

  • atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
  • intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
  • output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
  • model (int, optional (default: 0)) – Index of model in PDB structure that should be used
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns:

agg_distmap – Computed aggregated distance map across all input structures

If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table: pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index_i”, “sifts_table_index_j” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map).

Will be None if output_prefix is None.

Return type:

DistanceMap

Raises:
  • ValueError – If sifts_result is empty (no structure hits)
  • ResourceError – If any structure could not be loaded and raise_missing is True
evcouplings.compare.distances.remap_chains(sifts_result, output_prefix, sequence=None, structures=None, atom_filter=('N', 'CA', 'C', 'O'), model=0, chain_name='A', raise_missing=True)[source]

Remap a set of PDB chains into the numbering scheme (and amino acid sequence) of a target sequence (a.k.a. the poorest homology model possible).

(This function is placed here because of close relationship to intra_dists and reusing functionality for it).

Parameters:
  • sifts_result (SIFTSResult) – Input structures and mapping to use for remapping
  • output_prefix (str) – Save remapped structures to files prefixed with this string
  • sequence (dict, optional (default: None)) –

    Mapping from sequence position (int or str) to residue. If this parameter is given, residues in the output structures will be renamed to the residues in this mapping.

    Note

    if side-chain residues are not taken off using atom_filter, this will e.g. happily label an actual glutamate as an alanine).

  • structures (str or dict, optional (default: None)) –
    • If str: Load structures from directory this string points to. Missing structures will be fetched from web.
    • If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
  • atom_filter (str, optional (default: ("N", "CA", "C", "O"))) – Filter coordinates to contain only these atoms. If None, will retain all atoms; the default value will only keep backbone atoms.
  • model (int, optional (default: 0)) – Index of model in PDB structure that should be used
  • chain_name (str, optional (default: "A")) – Rename the PDB chain to this when saving the file. This will not affect the file name, only the name of the chain in the PDB object.
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns:

remapped – Mapping from index of each structure hit in sifts_results.hits to filename of stored remapped structure

Return type:

dict

evcouplings.compare.distances.remap_complex_chains(sifts_result_i, sifts_result_j, sequence_i=None, sequence_j=None, structures=None, atom_filter=('N', 'CA', 'C', 'O'), output_prefix=None, raise_missing=True, chain_name_i='A', chain_name_j='B', model=0)[source]

Remap a pair of PDB chains from the same structure into the numbering scheme (and amino acid sequence) of a target sequence.

Parameters:
  • sifts_result_i (SIFTSResult) – Input structures and mapping to use for remapping
  • sifts_result_j (SIFTSResult) – Input structures and mapping to use for remapping
  • output_prefix (str) – Save remapped structures to files prefixed with this string
  • sequence_i (dict, optional (default: None)) –

    Mapping from sequence position (int or str) in the first sequence to residue. If this parameter is given, residues in the output structures will be renamed to the residues in this mapping.

    Note

    if side-chain residues are not taken off using atom_filter, this will e.g. happily label an actual glutamate as an alanine).

  • sequence_j (dict, optional (default: None)) – Same as sequence_j for second sequence.
  • structures (str or dict, optional (default: None)) –
    • If str: Load structures from directory this string points to. Missing structures will be fetched from web.
    • If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
  • atom_filter (str, optional (default: ("N", "CA", "C", "O"))) – Filter coordinates to contain only these atoms. If None, will retain all atoms; the default value will only keep backbone atoms.
  • model (int, optional (default: 0)) – Index of model in PDB structure that should be used
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
  • chain_name_i (str, optional (default: "A")) – Renames the first chain to this string
  • chain_name_j (str, optional (default: "B")) – Renames the second chain to this string
Returns:

remapped – Mapping from index of each structure hit in sifts_results.hits to filename of stored remapped structure

Return type:

dict

Raises:
  • ValueError – If sifts_result_i or sifts_result_j is empty (no structure hits)
  • ResourceError – If any structure could not be loaded and raise_missing is True

evcouplings.compare.ecs module

Compare evolutionary couplings to distances in 3D structures

Authors:
Thomas A. Hopf
evcouplings.compare.ecs.add_distances(ec_table, dist_map, target_column='dist')[source]

Add pair distances to EC score table

Parameters:
  • ec_table (pandas.DataFrame) – List of evolutionary couplings, with pair positions in columns i and j
  • dist_map (DistanceMap) – Distance map that will be used to annotate distances in ec_table
  • target_column (str) – Name of column in which distances will be stored
Returns:

Couplings table with added distances in target_column. Pairs where no distance information is available will be np.nan

Return type:

pandas.DataFrame

evcouplings.compare.ecs.add_precision(ec_table, dist_cutoff=5, score='cn', min_sequence_dist=6, target_column='precision', dist_column='dist')[source]

Compute precision of evolutionary couplings as predictor of 3D structure contacts

Parameters:
  • ec_table (pandas.DataFrame) – List of evolutionary couplings
  • dist_cutoff (float, optional (default: 5)) – Upper distance cutoff (in Angstrom) for a pair to be considered a true positive contact
  • score (str, optional (default: "cn")) – Column which contains coupling score. Table will be sorted in descending order by this score.
  • min_sequence_dist (int, optional (default: 6)) – Minimal distance in primary sequence for an EC to be included in precision calculation
  • target_column (str, optional (default: "precision")) – Name of column in which precision will be stored
  • dist_column (str, optional (default: "dist")) – Name of column which contains pair distances
Returns:

EC table with added precision values as a function of EC rank (returned table will be sorted by score column)

Return type:

pandas.DataFrame

evcouplings.compare.ecs.coupling_scores_compared(ec_table, dist_map, dist_map_multimer=None, dist_cutoff=5, output_file=None, score='cn', min_sequence_dist=6)[source]

Utility function to create “CouplingScores.csv”-style table

Parameters:
  • ec_table (pandas.DataFrame) – List of evolutionary couplings
  • dist_map (DistanceMap) – Distance map that will be used to annotate distances in ec_table
  • dist_map_multimer (DistanceMap, optional (default: None)) – Additional multimer distance map. If given, the distance for any EC pair will be the minimum out of the monomer and multimer distances.
  • dist_cutoff (float, optional (default: 5)) – Upper distance cutoff (in Angstrom) for a pair to be considered a true positive contact
  • output_file (str, optional (default: None)) – Store final table to this file
  • score (str, optional (default: "cn")) – Column which contains coupling score. Table will be sorted in descending order by this score.
  • min_sequence_dist (int, optional (default: 6)) – Minimal distance in primary sequence for an EC to be included in precision calculation
Returns:

EC table with added distances, and precision if dist_cutoff is given.

Return type:

pandas.DataFrame

evcouplings.compare.mapping module

Index mapping for PDB structures

Authors:
Thomas A. Hopf Charlotta P. Schärfe
evcouplings.compare.mapping.alignment_index_mapping(alignment_file, format='stockholm', target_seq=None)[source]

Create index mapping table between sequence positions based on a sequence alignment.

Parameters:
  • alignment_file (str) – Path of alignment file containing sequences for which indices shoul dbe mapped
  • format ({"stockholm", "fasta"}) – Format of alignment file
  • target_seq (str, optional (default: None)) – Identifier of sequence around which the index mapping will be centered. If None, first sequence in alignment will be used.
Returns:

Mapping table containing assignment of

  1. index in target sequence (i)
  2. symbol in target sequence (A_i)

For all other sequences in alignment, the following two columns:

  1. index in second sequence (j_<sequence id>)
  2. symbol in second sequence (A_j_<sequence_id>)

Return type:

pandas.DataFrame

evcouplings.compare.mapping.map_indices(seq_i, start_i, end_i, seq_j, start_j, end_j, gaps=('-', '.'))[source]

Compute index mapping between positions in two aligned sequences

Parameters:
  • seq_i (str) – First aligned sequence
  • start_i (int) – Index of first position in first sequence
  • end_i (int) – Index of last position in first sequence (used for verification purposes only)
  • seq_j (str) – Second aligned sequence
  • start_j (int) – Index of first position in second sequence
  • end_j (int) – Index of last position in second sequence (used for verification purposes only)
Returns:

Mapping table containing assignment of

  1. index in first sequence (i)
  2. symbol in first sequence (A_i)
  3. index in second sequence (j)
  4. symbol in second sequence (A_j)

Return type:

pandas.DataFrame

evcouplings.compare.pdb module

PDB structure handling based on MMTF format

Authors:
Thomas A. Hopf
class evcouplings.compare.pdb.Chain(residues, coords)[source]

Bases: object

Container for PDB chain residue and coordinate information

filter_atoms(atom_name='CA')[source]

Filter coordinates of chain, e.g. to compute C_alpha-C_alpha distances

Parameters:atom_name (str or list-like, optional (default: "CA")) – Name(s) of atoms to keep
Returns:Chain containing only filtered atoms (and those residues that have such an atom)
Return type:Chain
filter_positions(positions)[source]

Select a subset of positions from the chain

Parameters:positions (list-like) – Set of residues that will be kept
Returns:Chain containing only the selected residues
Return type:Chain
remap(mapping, source_id='seqres_id')[source]

Remap chain into different numbering scheme (e.g. from seqres to uniprot numbering)

Parameters:
  • mapping (dict) –

    Mapping of residue identifiers from source_id (current main index of PDB chain) to new identifiers.

    mapping may either be:

    1. dict(str -> str) to map individual residue IDs. Keys and values of dictionary will be typecast to string before the mapping, so it is possible to pass in integer values too (if the source or target IDs are numbers)
    2. dict((int, int) -> (int, int)) to map ranges of numbers to ranges of numbers. This should typically be only used with RESSEQ or UniProt numbering. End index or range is *inclusive* Note that residue IDs in the end will still be handled as strings when mapping.
  • source_id ({"seqres_id", "coord_id", "id"}, optional (default: "seqres_id")) – Residue identifier in chain to map *from* (will be used as key to access mapping)
Returns:

Chain with remapped numbering (“id” column in residues DataFrame)

Return type:

Chain

to_file(fileobj, chain_id='A', end=True, first_atom_id=1)[source]

Write chain to a file in PDB format (mmCIF not yet supported).

Note that PDB files written this function may not be 100% compliant with the PDB format standards, in particular:

  • some HETATM records may turn into ATOM records when starting from an mmtf file, if the record has a one-letter code (such as MSE / M).
  • code does not print TER record at the end of a peptide chain
Parameters:
  • fileobj (file-like object) – Write to this file handle
  • chain_id (str, optional (default: "A")) – Assign this chain name in file (allows to redefine chain name from whatever chain was originally)
  • end (bool, optional (default: True)) – Print “END” record after chain (signals end of PDB file)
  • first_atom_id (int, optional (default: 1)) – Renumber atoms to start with this index (set to None to keep default indices)
Raises:

ValueError – If atom or residue numbers are too wide and cannot be written to old fixed-column PDB file format

to_seqres()[source]

Return copy of chain with main index set to SEQRES numbering. Residues that do not have a SEQRES id will be dropped.

Returns:Chain with seqres IDs as main index
Return type:Chain
class evcouplings.compare.pdb.ClassicPDB(structure)[source]

Bases: object

Class to handle “classic” PDB and mmCIF formats (for new mmtf format see PDB class above). Wraps around Biopython PDB functionality to provide a consistent interface.

Unlike the PDB class (based on mmtf), this object will not be able to extract SEQRES indices corresponding to ATOM-record residue indices.

classmethod from_file(filename, file_format='pdb')[source]

Initialize structure from PDB/mmCIF file

Parameters:
  • filename (str) – Path of file
  • file_format ({"pdb", "cif"}, optional (default: "pdb")) – Format of structure (old PDB format or mmCIF)
Returns:

Initialized PDB structure

Return type:

ClassicPDB

classmethod from_id(pdb_id)[source]

Initialize structure by PDB ID (fetches structure from RCSB servers)

Parameters:pdb_id (str) – PDB identifier (e.g. 1hzx)
Returns:initialized PDB structure
Return type:PDB
get_chain(chain, model=0)[source]

Extract residue information and atom coordinates for a given chain in PDB structure

Parameters:
  • chain (str) – Name of chain to be extracted (e.g. “A”)
  • model (int, optional (default: 0)) – Index of model to be extracted
Returns:

Chain object containing DataFrames listing residues and atom coordinates

Return type:

Chain

class evcouplings.compare.pdb.PDB(mmtf)[source]

Bases: object

Wrapper around PDB MMTF decoder object to access residue and coordinate information

classmethod from_file(filename)[source]

Initialize structure from MMTF file

Parameters:filename (str) – Path of MMTF file
Returns:initialized PDB structure
Return type:PDB
classmethod from_id(pdb_id)[source]

Initialize structure by PDB ID (fetches structure from RCSB servers)

Parameters:pdb_id (str) – PDB identifier (e.g. 1hzx)
Returns:initialized PDB structure
Return type:PDB
get_chain(chain, model=0)[source]

Extract residue information and atom coordinates for a given chain in PDB structure

Parameters:
  • chain (str) – Name of chain to be extracted (e.g. “A”)
  • model (int, optional (default: 0)) – Index of model to be extracted
Returns:

Chain object containing DataFrames listing residues and atom coordinates

Return type:

Chain

evcouplings.compare.pdb.load_structures(pdb_ids, structure_dir=None, raise_missing=True)[source]

Load PDB structures from files / web

Parameters:
  • pdb_ids (Iterable) – List / iterable containing PDB identifiers to be loaded.
  • structure_dir (str, optional (default: None)) – Path to directory with structures. Structures filenames must be in the format 5p21.mmtf. If a file can not be found, will try to fetch from web instead.
  • raise_missing (bool, optional (default: True)) – Raise a ResourceError exception if any of the PDB IDs cannot be loaded. If False, missing entries will be ignored.
Returns:

structures – Dictionary containing loaded structures. Keys (PDB identifiers) will be lower-case.

Return type:

dict(str -> PDB)

Raises:

ResourceError – Raised if raise_missing is True and any of the given PDB IDs cannot be loaded.

evcouplings.compare.protocol module

evcouplings.compare.sifts module

Uniprot to PDB structure identification and index mapping using the SIFTS database (https://www.ebi.ac.uk/pdbe/docs/sifts/)

This functionality is centered around the pdb_chain_uniprot.csv table available from SIFTS. (ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv/pdb_chain_uniprot.csv.gz)

Authors:
Thomas A. Hopf Anna G. Green (find_homologs) Chan Kang (find_homologs)
class evcouplings.compare.sifts.SIFTS(sifts_table_file, sequence_file=None)[source]

Bases: object

Provide Uniprot to PDB mapping data and functions starting from SIFTS mapping table.

by_alignment(min_overlap=20, reduce_chains=False, **kwargs)[source]

Find structures by sequence alignment between query sequence and sequences in PDB.

Parameters:
  • min_overlap (int, optional (default: 20)) – Require at least this many aligned positions with the target structure
  • reduce_chains (bool, optional (Default: True)) – If true, keep only first chain per PDB ID (i.e. remove redundant occurrences of same protein in PDB structures). Should be set to False to identify homomultimeric contacts.
  • **kwargs

    Defines the behaviour of find_homologs() function used to find homologs by sequence alignment: - which alignment method is used

    (pdb_alignment_method: {“jackhmmer”, “hmmsearch”}, default: “jackhmmer”),
    • parameters passed into the protocol for the selected alignment method (evcouplings.align.jackhmmer_search or evcouplings.align.hmmbuild_and_search).

      Default parameters are set in the HMMER_CONFIG string in this module, other parameters will need to be overriden; these minimally are: - for pdb_alignment_method == “jackhmmer”:

      • sequence_id : str, identifier of target sequence
      • jackhmmer : str, path to jackhmmer binary if not on path
      • for pdb_alignment_method == “hmmsearch”: - sequence_id : str, identifier of target sequence - raw_focus_alignment_file : str, path to input alignment file - hmmbuild : str, path to hmmbuild binary if not on path - hmmsearch : str, path to search binary if not on path
    • additionally, if “prefix” is given, individual mappings will be saved to files suffixed by the respective key in mapping table.
Returns:

Record of hits and mappings found for this query sequence by alignment. See by_pdb_id() for detailed explanation of fields.

Return type:

SIFTSResult

by_pdb_id(pdb_id, pdb_chain=None, uniprot_id=None)[source]

Find structures and mapping by PDB id and chain name

Parameters:
  • pdb_id (str) – 4-letter PDB identifier
  • pdb_chain (str, optional (default: None)) – PDB chain name (if not given, all chains for PDB entry will be returned)
  • uniprot_id (str, optional (default: None)) – Filter to keep only this Uniprot accession number or identifier (necessary for chimeras, or multi-chain complexes with different proteins)
Returns:

Identified hits plus index mappings to Uniprot

Return type:

SIFTSResult

Raises:

ValueError – If selected segments in PDB file do not unambigously map to one Uniprot entry

by_uniprot_id(uniprot_id, reduce_chains=False)[source]

Find structures and mapping by Uniprot access number.

Parameters:
  • uniprot_ac (str) – Find PDB structures for this Uniprot accession number. If sequence_file was given while creating the SIFTS object, Uniprot identifiers can also be used.
  • reduce_chains (bool, optional (Default: True)) – If true, keep only first chain per PDB ID (i.e. remove redundant occurrences of same protein in PDB structures). Should be set to False to identify homomultimeric contacts.
Returns:

Record of hits and mappings found for this Uniprot protein. See by_pdb_id() for detailed explanation of fields.

Return type:

SIFTSResult

create_sequence_file(output_file, chunk_size=1000, max_retries=100)[source]

Create FASTA sequence file containing all UniProt sequences of proteins in SIFTS. This file is required for homology-based structure identification and index remapping. This function will also automatically associate the sequence file with the SIFTS object.

Parameters:
  • output_file (str) – Path at which to store sequence file
  • chunk_size (int, optional (default: 1000)) – Retrieve sequences from UniProt in chunks of this size (too large chunks cause the mapping service to stall)
  • max_retries (int, optional (default: 100)) – Allow this many retries when fetching sequences from UniProt ID mapping service, which unfortunately often suffers from connection failures.
class evcouplings.compare.sifts.SIFTSResult(hits, mapping)[source]

Bases: object

Store results of SIFTS structure/mapping identification.

(Full class defined for easify modification of fields)

evcouplings.compare.sifts.fetch_uniprot_mapping(ids, from_='ACC', to='ACC', format='fasta')[source]

Fetch data from UniProt ID mapping service (e.g. download set of sequences)

Parameters:
  • ids (list(str)) – List of UniProt identifiers for which to retrieve mapping
  • from (str, optional (default: "ACC")) – Source identifier (i.e. contained in “ids” list)
  • to (str, optional (default: "ACC")) – Target identifier (to which source should be mapped)
  • format (str, optional (default: "fasta")) – Output format to request from Uniprot server
Returns:

Response from UniProt server

Return type:

str

evcouplings.compare.sifts.find_homologs(pdb_alignment_method='jackhmmer', **kwargs)[source]

Identify homologs using jackhmmer or hmmbuild/hmmsearch

Parameters:
  • pdb_alignment_method ({"jackhmmer", "hmmsearch"},) – optional (default: “jackhmmer”) Sequence alignment method used for searching the PDB
  • **kwargs – Passed into jackhmmer / hmmbuild_and_search protocol (see documentation for available options)
Returns:

  • ali (evcouplings.align.Alignment) – Alignment of homologs of query sequence in sequence database
  • hits (pandas.DataFrame) – Tabular representation of hits