Welcome to EVcouplings’s documentation!¶
Alignment¶
evcouplings.align package¶
evcouplings.align.alignment module¶
evcouplings.align.pfam module¶
Code for identifying Pfam domains and mapping Pfam alignments and ECs into target sequence mode.
- Authors:
- Thomas A. Hopf
Todo
- Write code to create list of family sizes
- Implement alignments against Pfam-HMM so precomputed results can be reused in focus mode
-
evcouplings.align.pfam.
create_family_size_table
(full_pfam_file, outfile=None)[source]¶ Parse family size table from Pfam flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz)
Parameters: Returns: Parsed Pfam table.
Return type: pd.DataFrame
-
evcouplings.align.pfam.
pfam_hits
(query_file, hmm_database, prefix, clan_table_file, size_table_file, resolve_overlaps=True, **kwargs)[source]¶ Identify hits of Pfam HMMs in a set of sequences.
Parameters: - query_file (str) – File containing query sequence(s)
- hmm_database (str) – File containing HMM database (Pfam-A.hmm, with hmmpress applied)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- clan_table_file (str) – File with table linking Pfam families to clans (Pfam-A.clans.tsv). Set to None if not available, but resolve_overlaps cannot be True in that case.
- size_table_file (str) – File with table of family sizes. Create using create_family_size_table(). Set to None if not available.
- resolve_overlaps (bool) – Resolve overlapping hits by families from the same clan. Only possible if clan_table_file is given.
- **kwargs (dict) – kwargs passed on to evcouplings.align.tools.run_hmmscan
Returns: Pfam hit table
Return type: pd.DataFrame
-
evcouplings.align.pfam.
remove_clan_overlaps
(pfam_table)[source]¶ Remove overlapping Pfam hits from same Pfam clan (equivalent of PfamScan.pl). Currently only allows to remove overlaps by domain bitscore.
Todo
is bitscore the most sensible choice if different length hits?
Parameters: pfam_table (pd.DataFrame) – Pfam hit table as generated by pfam_hits() function (must contain Pfam clan annotation). Returns: Pfam hit table with lower-scoring overlaps removed Return type: pd.DataFrame
evcouplings.align.protocol module¶
Protein sequence alignment creation protocols/workflows.
- Authors:
- Thomas A. Hopf Anna G. Green - complex protocol, hmm_build_and_search Chan Kang - hmm_build_and_search
-
evcouplings.align.protocol.
complex
(**kwargs)[source]¶ Protocol:
Run monomer alignment protocol and postprocess it for EVcomplex calculations
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the alignment protocol, and the following additional field: - genome_location_file : path to file containing
- the genomic locations for CDs’s corresponding to identifiers in the alignment.
Return type: dict
-
evcouplings.align.protocol.
cut_sequence
(sequence, sequence_id, region=None, first_index=None, out_file=None)[source]¶ Cut a given sequence to sub-range and save it in a file
Parameters: - sequence (str) – Full sequence that will be cut
- sequence_id (str) – Identifier of sequence, used to construct header in output file
- region (tuple(int, int), optional (default: None)) – Region that will be cut out of full sequence. If None, full sequence will be returned.
- first_index (int, optional (default: None)) – Define index of first position in sequence. Will be set to 1 if None.
- out_file (str, optional (default: None)) – Save sequence in a FASTA file (header: >sequence_id/start_region-end_region)
Returns: - str – Subsequence contained in region
- tuple(int, int) – Region. If no input region is given, this will be (1, len(sequence)); otherwise, the input region is returned.
Raises: InvalidParameterError
– Upon invalid region specification (violating boundaries of sequence)
-
evcouplings.align.protocol.
describe_coverage
(alignment, prefix, first_index, minimum_column_coverage)[source]¶ Produce “classical” buildali coverage statistics, i.e. number of sequences, how many residues have too many gaps, etc.
Only to be applied to alignments focused around the target sequence.
Parameters: - alignment (Alignment) – Alignment for which coverage statistics will be calculated
- prefix (str) – Prefix of alignment file that will be stored as identifier in table
- first_index (int) – Sequence index of first position of target sequence
- minimum_column_coverage (Iterable(float) or float) –
Minimum column coverage threshold(s) that will be tested (creating one row for each threshold in output table).
Note
int
values given to this function instead of a float will be divided by 100 to create the corresponding floating point representation. This parameter is 1.0 - maximum fraction of gaps per column.
Returns: Table with coverage statistics for different gap thresholds
Return type: pd.DataFrame
-
evcouplings.align.protocol.
describe_frequencies
(alignment, first_index, target_seq_index=None)[source]¶ Get parameters of alignment such as gaps, coverage, conservation and summarize.
Parameters: - alignment (Alignment) – Alignment for which description statistics will be calculated
- first_index (int) – Sequence index of first residue in target sequence
- target_seq_index (int, optional (default: None)) – If given, will add the symbol in the target sequence into a separate column of the output table
Returns: Table detailing conservation and symbol frequencies for all positions in the alignment
Return type: pandas.DataFrame
-
evcouplings.align.protocol.
describe_seq_identities
(alignment, target_seq_index=0)[source]¶ Calculate sequence identities of any sequence to target sequence and create result dataframe.
Parameters: alignment (Alignment) – Alignment for which description statistics will be calculated Returns: Table giving the identity to target sequence for each sequence in alignment (in order of occurrence) Return type: pandas.DataFrame
-
evcouplings.align.protocol.
existing
(**kwargs)[source]¶ Protocol:
Use external sequence alignment and extract all relevant information from there (e.g. sequence, region, etc.), then apply gap & fragment filtering as usual
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: - sequence_id (passed through from input)
- alignment_file
- raw_focus_alignment_file
- statistics_file
- sequence_file
- first_index
- target_sequence_file
- annotation_file (None)
- frequencies_file
- identities_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
extract_header_annotation
(alignment, from_annotation=True)[source]¶ Extract Uniprot/Uniref sequence annotation from Stockholm file (as output by jackhmmer). This function may not work for other formats.
Parameters: - alignment (Alignment) – Multiple sequence alignment object
- from_annotation (bool, optional (default: True)) – Use annotation line (in Stockholm file) rather than sequence ID line (e.g. in FASTA file)
Returns: Table containing all annotation (one row per sequence in alignment, in order of occurrence)
Return type: pandas.DataFrame
-
evcouplings.align.protocol.
fetch_sequence
(sequence_id, sequence_file, sequence_download_url, out_file)[source]¶ Fetch sequence either from database based on identifier, or from input sequence file.
Parameters: - sequence_id (str) – Identifier of sequence that should be retrieved
- sequence_file (str) – File containing sequence. If None, sqeuence will be downloaded from sequence_download_url
- sequence_download_url (str) – URL from which to download missing sequence. Must contain “{}” at the position where sequence ID will be inserted into download URL (using str.format).
- out_file (str) – Output file in which sequence will be stored, if sequence_file is not existing.
Returns: - str – Path of file with stored sequence (can be sequence_file or out_file)
- tuple (str, str) – Identifier of sequence as stored in file, and sequence
-
evcouplings.align.protocol.
hmmbuild_and_search
(**kwargs)[source]¶ Protocol:
Build HMM from sequence alignment using hmmbuild and search against a sequence database using hmmsearch.
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the protocol, including the following fields: - target_sequence_file
- sequence_file
- raw_alignment_file
- hittable_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
jackhmmer_search
(**kwargs)[source]¶ Protocol:
Iterative jackhmmer search against a sequence database.
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required :param .. todo::: explain meaning of parameters in detail.
Returns: outcfg – Output configuration of the protocol, including the following fields: - sequence_id (passed through from input)
- first_index (passed through from input)
- target_sequence_file
- sequence_file
- raw_alignment_file
- hittable_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
modify_alignment
(focus_ali, target_seq_index, target_seq_id, region_start, **kwargs)[source]¶ Apply pairwise identity filtering, fragment filtering, and exclusion of columns with too many gaps to a sequence alignment. Also generates files describing properties of the alignment such as frequency distributions, conservation, and “old-style” alignment statistics files.
Note
assumes focus alignment (otherwise unprocessed) as input.
Todo
come up with something more clever to filter fragments than fixed width (e.g. use 95% quantile of length distribution as reference point)
Parameters: - focus_ali (Alignment) – Focus-mode input alignment
- target_seq_index (int) – Index of target sequence in alignment
- target_seq_id (str) – Identifier of target sequence (without range)
- region_start (int) – Index of first sequence position in target sequence
- kwargs (See required arguments in source code) –
Returns: - outcfg (Dict) – File products generated by the function:
- alignment_file
- statistics_file
- frequencies_file
- identities_file
- raw_focus_alignment_file
- ali (Alignment) – Final processed alignment
-
evcouplings.align.protocol.
run
(**kwargs)[source]¶ Run alignment protocol to generate multiple sequence alignment from input sequence.
Parameters: - kwargs arguments (Mandatory) – protocol: Alignment protocol to run prefix: Output prefix for all generated files
- Optional –
Returns: - Alignment
- Dictionary with results of stage in following fields (in brackets - not returned by all protocols) –
- alignment_file
- [raw_alignment_file]
- statistics_file
- target_sequence_file
- sequence_file
- [annotation_file]
- frequencies_file
- identities_file
- [hittable_file]
- focus_mode
- focus_sequence
- segments
-
evcouplings.align.protocol.
search_thresholds
(use_bitscores, seq_threshold, domain_threshold, seq_len)[source]¶ Set homology search inclusion parameters.
- HMMER hits get included in the HMM according to a two-step rule
- sequence passes sequence-level treshold
- domain passes domain-level threshold
- Therefore, search thresholds are set based on the following logic:
- If only sequence threshold is given, a MissingParameterException is raised
- If only bitscore threshold is given, sequence threshold is set to the same
- If both thresholds are given, they are according to defined values
- Valid inputs for bitscore thresholds:
- int or str: taken as absolute score threshold
- float: taken as relative threshold (absolute threshold derived by
multiplication with domain length)
- Valid inputs for integer thresholds:
- int: Used as negative exponent, threshold will be set to 1E-<exponent>
- float or str: Interpreted literally
Parameters: - use_bitscores (bool) – Use bitscore threshold instead of E-value threshold
- domain_threshold (str or int or float) – Domain-level threshold. See rules above.
- seq_threshold (str or int or float) – Sequence-level threshold. See rules above.
- seq_len (int) – Length of sequence. Used to calculate absolute bitscore threshold for relative bitscore thresholds.
Returns: Sequence- and domain-level thresholds ready to be fed into HMMER
Return type:
-
evcouplings.align.protocol.
standard
(**kwargs)[source]¶ Protocol:
Standard buildali4 workflow (run iterative jackhmmer search against sequence database, than determine which sequences and columns to include in the calculation based on coverage and maximum gap thresholds).
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: - outcfg (dict) – Output configuration of the pipeline, including
the following fields:
- sequence_id (passed through from input)
- first_index (passed through from input)
- alignment_file
- raw_alignment_file
- raw_focus_alignment_file
- statistics_file
- target_sequence_file
- sequence_file
- annotation_file
- frequencies_file
- identities_file
- hittable_file
- focus_mode
- focus_sequence
- segments
- ali (Alignment) – Final sequence alignment
- outcfg (dict) – Output configuration of the pipeline, including
the following fields:
evcouplings.align.tools module¶
Wrappers for running external sequence alignment tools
- Authors:
- Thomas A. Hopf Anna G. Green - run_hmmbuild, run_hmmsearch Chan Kang - run_hmmbuild, run_hmmsearch
-
class
evcouplings.align.tools.
HmmbuildResult
(prefix, hmmfile, output)¶ Bases:
tuple
-
hmmfile
¶ Alias for field number 1
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
-
class
evcouplings.align.tools.
HmmscanResult
(prefix, output, tblout, domtblout, pfamtblout)¶ Bases:
tuple
-
domtblout
¶ Alias for field number 3
-
output
¶ Alias for field number 1
-
pfamtblout
¶ Alias for field number 4
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 2
-
-
class
evcouplings.align.tools.
HmmsearchResult
(prefix, alignment, output, tblout, domtblout)¶ Bases:
tuple
-
alignment
¶ Alias for field number 1
-
domtblout
¶ Alias for field number 4
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 3
-
-
class
evcouplings.align.tools.
JackhmmerResult
(prefix, alignment, output, tblout, domtblout)¶ Bases:
tuple
-
alignment
¶ Alias for field number 1
-
domtblout
¶ Alias for field number 4
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 3
-
-
evcouplings.align.tools.
read_hmmer_domtbl
(filename)[source]¶ Read a HMMER domtbl file into DataFrame.
Parameters: filename (str) – Path of domtbl file Returns: DataFrame with parsed domtbl Return type: pd.DataFrame
-
evcouplings.align.tools.
read_hmmer_tbl
(filename)[source]¶ Read a HMMER tbl file into DataFrame.
Parameters: filename (str) – Path of tbl file Returns: DataFrame with parsed tbl Return type: pd.DataFrame
-
evcouplings.align.tools.
run_hhfilter
(input_file, output_file, threshold=95, columns='a2m', binary='hhfilter')[source]¶ Redundancy-reduce a sequence alignment using hhfilter from the HHsuite alignment suite.
Parameters: - input_file (str) – Path to input alignment in A2M/FASTA format
- output_file (str) – Path to output alignment (will be in A3M format)
- threshold (int, optional (default: 95)) – Sequence identity threshold for maximum pairwise identity (between 0 and 100)
- columns ({"first", "a2m"}, optional (default: "a2m")) – Definition of match columns (based on first sequence or upper-case columns (a2m))
- binary (str) – Path to hhfilter binary
Returns: output_file
Return type: Raises: ResourceError
– If output alignment is non-existent/emptyValueError
– Upon invalid value of columns parameter
-
evcouplings.align.tools.
run_hmmbuild
(alignment_file, prefix, cpu=None, stdout_redirect=None, symfrac=None, binary='hmmbuild')[source]¶ Profile HMM construction from multiple sequence alignments Refer to HMMER documentation for details.
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
Parameters: - alignment_file (str) – File containing the multiple sequence alignment. Can be in Stockholm, a2m, or clustal formats, or any other format recognized by hmmer. Please note that ALL POSITIONS above the symfrac cutoff will be used in HMM construction (if the alignment contains columns that are insertions relative to the query sequence, this may be problematic for structure comparison)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- symfrac (float, optional (default: None)) – range 0.0 - 1.0, HMMbuild will use columns with > symfrac percent gaps to construct the HMM. If None provided, HMMbuild internal default is 0.5. (Note: this is calculated after their internal sequence weighting is calculated)
- binary (str (default: "hmmbuild")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_hmmscan
(query, database, prefix, use_model_threshold=True, threshold_type='cut_ga', use_bitscores=True, domain_threshold=None, seq_threshold=None, nobias=False, cpu=None, stdout_redirect=None, binary='hmmscan')[source]¶ Run hmmscan of HMMs in database against sequences in query to identify matches of these HMMs. Refer to HMMER Userguide for explanation of these parameters.
Parameters: - query (str) – File containing query sequence(s)
- database (str) – File containing HMM database (prepared with hmmpress)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_model_threshold (bool (default: True)) – Use model-specific inclusion thresholds from HMM database rather than global bitscore/E-value thresholds (use_bitscores, domain_threshold and seq_threshold are overriden by this flag).
- threshold-type ({"cut_ga", "cut_nc", "cut_tc"} (default: "cut_ga")) – Use gathering (default), noise or trusted cutoff to define scan hits. Please refer to HMMER manual for details.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. Overriden by use_model_threshold flag.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, output, tblout, domtblout, pfamtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_hmmsearch
(hmmfile, database, prefix, use_bitscores, domain_threshold, seq_threshold, nobias=False, cpu=None, stdout_redirect=None, binary='hmmsearch')[source]¶ Search profile(s) against a sequence database. Refer to HMMER documentation for details.
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
Parameters: - hmmfile (str) – File containing the profile(s)
- database (str) – File containing sequence database
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- binary (str (default: "hmmsearch")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_jackhmmer
(query, database, prefix, use_bitscores, domain_threshold, seq_threshold, iterations=5, nobias=False, cpu=None, stdout_redirect=None, checkpoints_hmm=False, checkpoints_ali=False, binary='jackhmmer')[source]¶ Run jackhmmer sequence search against target database. Refer to HMMER Userguide for explanation of these parameters.
Parameters: - query (str) – File containing query sequence
- database (str) – File containing sequence database
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- iterations (int) – number of jackhmmer search iterations
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- checkpoints_hmm (bool, optional (default: False)) – Store checkpoint HMMs to prefix.<iter>.hmm
- checkpoints_ali (bool, optional (default: False)) – Store checkpoint alignments to prefix.<iter>.sto
- binary (str (default: "jackhmmer")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError
Couplings Analysis¶
evcouplings.compare package¶
evcouplings.compare.distances module¶
Distance calculations on PDB 3D coordinates
- Authors:
- Thomas A. Hopf Anna G. Green (remap_complex_chains)
-
class
evcouplings.compare.distances.
DistanceMap
(residues_i, residues_j, dist_matrix, symmetric)[source]¶ Bases:
object
Compute, store and accesss pairwise residue distances in PDB 3D structures
-
classmethod
aggregate
(*matrices, intersect=False, agg_func=<MagicMock id='139675017884048'>)[source]¶ Aggregate with other distance map(s). Secondary structure will be aggregated by assigning the most frequent state across all distance matrices; if there are equal counts, H (helix) will be chosen over E (strand) over C (coil).
Parameters: - *matrices (DistanceMap) –
*args-style
list of DistanceMaps that will be aggregated.Note
The id column of each axis may only contain numeric residue ids (and no characters such as insertion codes)
- intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
- agg_func (function (default: numpy.nanmin)) – Function that will be used to aggregate distance matrices. Needs to take a parameter “axis” to aggregate over all matrices.
Returns: Aggregated distance map
Return type: Raises: ValueError
– If residue identifiers are not numeric, or if intersect is True, but positions on axis do not overlap.- *matrices (DistanceMap) –
-
contacts
(max_dist=5.0, min_dist=None)[source]¶ Return list of pairs below distance threshold
Parameters: Returns: contacts – Table with residue-residue contacts, with the following columns:
- id_i: identifier of residue in chain i
- id_j: identifier of residue in chain j
- dist: pair distance
Return type: pandas.DataFrame
-
dist
(i, j, raise_na=True)[source]¶ Return distance of residue pair
Parameters: Returns: Distance of pair (i, j). If raise_na is False and identifiers are not valid, distance will be np.nan
Return type: np.float
Raises: KeyError
– If index i or j is not a valid identifier for respective chain
-
classmethod
from_coords
(chain_i, chain_j=None)[source]¶ Compute distance matrix from PDB chain coordinates.
Parameters: Returns: Distance map computed from given coordinates
Return type:
-
classmethod
from_file
(filename)[source]¶ Load existing distance map using filename prefix (each distance map consist of .csv and .npy file)
Parameters: filename (str) – Prefix of path to distance map files (excluding .csv/.npy) Returns: Loaded distance map Return type: DistanceMap
-
classmethod
from_files
(residue_table_file, distance_matrix_file)[source]¶ Load existing distance map with explicit paths to residue table (.csv) and distance matrix (.npy). Use DistanceMap.from_file to load using joint prefix of both files.
Parameters: Returns: Loaded distance map
Return type:
-
structure_coverage
()[source]¶ Find covered residue segments for individual structures that this distance map was computed from (either directly from structure or through aggregation of multiple structures). Only works if all residue identifiers of DistanceMap are numeric (i.e., do not have insertion codes)
Returns: coverage – Returns tuples of the form (coverage_i, coverage_j, coverage_id), where * coverage_i and coverage_j are lists of tuples (segment_start, segment_end) of residue coverage along axis i and j, with segment_end being included in the range- coverage_id is the identifier of the individual substructure the coverage segments belong to (only set if an aggregated structure, None otherwise)
Return type: list of tuple
-
to_file
(filename)[source]¶ Store distance map in files
Parameters: filename (str) – Prefix of distance map files (will create .csv and .npy file) Returns: - residue_table_filename (str) – Path to residue table (will be filename + .csv)
- dist_mat_filename (str) – Path to distance matrix file in numpy format (will be filename + .npy)
-
transpose
()[source]¶ Transpose distance map (i.e. swap axes)
Returns: Transposed copy of distance map Return type: DistanceMap
-
classmethod
-
evcouplings.compare.distances.
inter_dists
(sifts_result_i, sifts_result_j, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]¶ Compute inter-chain distances (between different entities) in PDB file. Resulting distance map is typically not symmetric, with either axis corresponding to either chain. Inter-distances are calculated on all combinations of chains that have the same PDB id in sifts_result_i and sifts_result_j.
Parameters: - sifts_result_i (SIFTSResult) – Input structures and mapping to use for first axis of computed distance map
- sifts_result_j (SIFTSResult) – Input structures and mapping to use for second axis of computed distance map
- structures (str or dict, optional (default: None)) –
- If str: Load structures from directory this string points to. Missing structures will be fetched from web.
- If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
- atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
- intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
- output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
- model (int, optional (default: 0)) – Index of model in PDB structure that should be used
- raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns: agg_distmap – Computed aggregated distance map across all input structures
If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table:
pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index_i”, “sifts_table_index_j” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map).
Will be None if output_prefix is None.
Return type: Raises: ValueError
– If sifts_result_i or sifts_result_j is empty (no structure hits)ResourceError
– If any structure could not be loaded and raise_missing is True
-
evcouplings.compare.distances.
intra_dists
(sifts_result, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]¶ Compute intra-chain distances in PDB files.
Parameters: - sifts_result (SIFTSResult) – Input structures and mapping to use for distance map calculation
- structures (str or dict, optional (default: None)) –
If str: Load structures from directory this string points to. Missing structures will be fetched from web.
If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
- atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
- intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
- output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
- model (int, optional (default: 0)) – Index of model in PDB structure that should be used
- raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns: agg_distmap – Computed aggregated distance map across all input structures
Contains an additional attribute aggregated_residue_maps, a pd.DataFrame with the concatenated residue maps of all individual chains used to compute this DistanceMap. Individual chains are linked to the input sifts_results through the column sifts_table_index.
If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table: pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map). Will be None if output_prefix is None.
Return type: Raises: ValueError
– If sifts_result is empty (no structure hits)ResourceError
– If any structure could not be loaded and raise_missing is True
-
evcouplings.compare.distances.
multimer_dists
(sifts_result, structures=None, atom_filter=None, intersect=False, output_prefix=None, model=0, raise_missing=True)[source]¶ Compute homomultimer distances (between repeated copies of the same entity) in PDB file. Resulting distance matrix will be symmetric by minimization over upper and lower triangle of matrix, even if the complex structure is not symmetric.
Parameters: - sifts_result (SIFTSResult) – Input structures and mapping to use for distance map calculation
- structures (str or dict, optional (default: None)) –
If str: Load structures from directory this string points to. Missing structures will be fetched from web.
If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
- atom_filter (str, optional (default: None)) – Filter coordinates to contain only these atoms. E.g. set to “CA” to compute C_alpha - C_alpha distances instead of minimum atom distance over all atoms in both residues.
- intersect (bool, optional (default: False)) – If True, intersect indices of the given distance maps. Otherwise, union of indices will be used.
- output_prefix (str, optional (default: None)) – If given, save individual contact maps to files prefixed with this string. The appended file suffixes map to row index in sifts_results.hits
- model (int, optional (default: 0)) – Index of model in PDB structure that should be used
- raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns: agg_distmap – Computed aggregated distance map across all input structures
If output_prefix is given, agg_distmap will have an additional attribute individual_distance_map_table: pd.DataFrame with all individual distance maps that went into the aggregated distance map, with columns “sifts_table_index_i”, “sifts_table_index_j” (linking to SIFTS hit table) and “residue_table” and “distance_matrix” (file names of .csv and .npy files constituting the respective distance map).
Will be None if output_prefix is None.
Return type: Raises: ValueError
– If sifts_result is empty (no structure hits)ResourceError
– If any structure could not be loaded and raise_missing is True
-
evcouplings.compare.distances.
remap_chains
(sifts_result, output_prefix, sequence=None, structures=None, atom_filter=('N', 'CA', 'C', 'O'), model=0, chain_name='A', raise_missing=True)[source]¶ Remap a set of PDB chains into the numbering scheme (and amino acid sequence) of a target sequence (a.k.a. the poorest homology model possible).
(This function is placed here because of close relationship to intra_dists and reusing functionality for it).
Parameters: - sifts_result (SIFTSResult) – Input structures and mapping to use for remapping
- output_prefix (str) – Save remapped structures to files prefixed with this string
- sequence (dict, optional (default: None)) –
Mapping from sequence position (int or str) to residue. If this parameter is given, residues in the output structures will be renamed to the residues in this mapping.
Note
if side-chain residues are not taken off using atom_filter, this will e.g. happily label an actual glutamate as an alanine).
- structures (str or dict, optional (default: None)) –
- If str: Load structures from directory this string points to. Missing structures will be fetched from web.
- If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
- atom_filter (str, optional (default: ("N", "CA", "C", "O"))) – Filter coordinates to contain only these atoms. If None, will retain all atoms; the default value will only keep backbone atoms.
- model (int, optional (default: 0)) – Index of model in PDB structure that should be used
- chain_name (str, optional (default: "A")) – Rename the PDB chain to this when saving the file. This will not affect the file name, only the name of the chain in the PDB object.
- raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
Returns: remapped – Mapping from index of each structure hit in sifts_results.hits to filename of stored remapped structure
Return type:
-
evcouplings.compare.distances.
remap_complex_chains
(sifts_result_i, sifts_result_j, sequence_i=None, sequence_j=None, structures=None, atom_filter=('N', 'CA', 'C', 'O'), output_prefix=None, raise_missing=True, chain_name_i='A', chain_name_j='B', model=0)[source]¶ Remap a pair of PDB chains from the same structure into the numbering scheme (and amino acid sequence) of a target sequence.
Parameters: - sifts_result_i (SIFTSResult) – Input structures and mapping to use for remapping
- sifts_result_j (SIFTSResult) – Input structures and mapping to use for remapping
- output_prefix (str) – Save remapped structures to files prefixed with this string
- sequence_i (dict, optional (default: None)) –
Mapping from sequence position (int or str) in the first sequence to residue. If this parameter is given, residues in the output structures will be renamed to the residues in this mapping.
Note
if side-chain residues are not taken off using atom_filter, this will e.g. happily label an actual glutamate as an alanine).
- sequence_j (dict, optional (default: None)) – Same as sequence_j for second sequence.
- structures (str or dict, optional (default: None)) –
- If str: Load structures from directory this string points to. Missing structures will be fetched from web.
- If dict: dictionary with lower-case PDB ids as keys and PDB objects as values. This dictionary has to contain all necessary structures, missing ones will not be fetched. This dictionary can be created using pdb.load_structures.
- atom_filter (str, optional (default: ("N", "CA", "C", "O"))) – Filter coordinates to contain only these atoms. If None, will retain all atoms; the default value will only keep backbone atoms.
- model (int, optional (default: 0)) – Index of model in PDB structure that should be used
- raise_missing (bool, optional (default: True)) – Raise a ResourceError if any of the input structures can not be loaded; otherwise, ignore missing entries.
- chain_name_i (str, optional (default: "A")) – Renames the first chain to this string
- chain_name_j (str, optional (default: "B")) – Renames the second chain to this string
Returns: remapped – Mapping from index of each structure hit in sifts_results.hits to filename of stored remapped structure
Return type: Raises: ValueError
– If sifts_result_i or sifts_result_j is empty (no structure hits)ResourceError
– If any structure could not be loaded and raise_missing is True
evcouplings.compare.ecs module¶
Compare evolutionary couplings to distances in 3D structures
- Authors:
- Thomas A. Hopf
-
evcouplings.compare.ecs.
add_distances
(ec_table, dist_map, target_column='dist')[source]¶ Add pair distances to EC score table
Parameters: - ec_table (pandas.DataFrame) – List of evolutionary couplings, with pair positions in columns i and j
- dist_map (DistanceMap) – Distance map that will be used to annotate distances in ec_table
- target_column (str) – Name of column in which distances will be stored
Returns: Couplings table with added distances in target_column. Pairs where no distance information is available will be np.nan
Return type: pandas.DataFrame
-
evcouplings.compare.ecs.
add_precision
(ec_table, dist_cutoff=5, score='cn', min_sequence_dist=6, target_column='precision', dist_column='dist')[source]¶ Compute precision of evolutionary couplings as predictor of 3D structure contacts
Parameters: - ec_table (pandas.DataFrame) – List of evolutionary couplings
- dist_cutoff (float, optional (default: 5)) – Upper distance cutoff (in Angstrom) for a pair to be considered a true positive contact
- score (str, optional (default: "cn")) – Column which contains coupling score. Table will be sorted in descending order by this score.
- min_sequence_dist (int, optional (default: 6)) – Minimal distance in primary sequence for an EC to be included in precision calculation
- target_column (str, optional (default: "precision")) – Name of column in which precision will be stored
- dist_column (str, optional (default: "dist")) – Name of column which contains pair distances
Returns: EC table with added precision values as a function of EC rank (returned table will be sorted by score column)
Return type: pandas.DataFrame
-
evcouplings.compare.ecs.
coupling_scores_compared
(ec_table, dist_map, dist_map_multimer=None, dist_cutoff=5, output_file=None, score='cn', min_sequence_dist=6)[source]¶ Utility function to create “CouplingScores.csv”-style table
Parameters: - ec_table (pandas.DataFrame) – List of evolutionary couplings
- dist_map (DistanceMap) – Distance map that will be used to annotate distances in ec_table
- dist_map_multimer (DistanceMap, optional (default: None)) – Additional multimer distance map. If given, the distance for any EC pair will be the minimum out of the monomer and multimer distances.
- dist_cutoff (float, optional (default: 5)) – Upper distance cutoff (in Angstrom) for a pair to be considered a true positive contact
- output_file (str, optional (default: None)) – Store final table to this file
- score (str, optional (default: "cn")) – Column which contains coupling score. Table will be sorted in descending order by this score.
- min_sequence_dist (int, optional (default: 6)) – Minimal distance in primary sequence for an EC to be included in precision calculation
Returns: EC table with added distances, and precision if dist_cutoff is given.
Return type: pandas.DataFrame
evcouplings.compare.mapping module¶
Index mapping for PDB structures
- Authors:
- Thomas A. Hopf Charlotta P. Schärfe
-
evcouplings.compare.mapping.
alignment_index_mapping
(alignment_file, format='stockholm', target_seq=None)[source]¶ Create index mapping table between sequence positions based on a sequence alignment.
Parameters: - alignment_file (str) – Path of alignment file containing sequences for which indices shoul dbe mapped
- format ({"stockholm", "fasta"}) – Format of alignment file
- target_seq (str, optional (default: None)) – Identifier of sequence around which the index mapping will be centered. If None, first sequence in alignment will be used.
Returns: Mapping table containing assignment of
- index in target sequence (i)
- symbol in target sequence (A_i)
For all other sequences in alignment, the following two columns:
- index in second sequence (j_<sequence id>)
- symbol in second sequence (A_j_<sequence_id>)
Return type: pandas.DataFrame
-
evcouplings.compare.mapping.
map_indices
(seq_i, start_i, end_i, seq_j, start_j, end_j, gaps=('-', '.'))[source]¶ Compute index mapping between positions in two aligned sequences
Parameters: - seq_i (str) – First aligned sequence
- start_i (int) – Index of first position in first sequence
- end_i (int) – Index of last position in first sequence (used for verification purposes only)
- seq_j (str) – Second aligned sequence
- start_j (int) – Index of first position in second sequence
- end_j (int) – Index of last position in second sequence (used for verification purposes only)
Returns: Mapping table containing assignment of
- index in first sequence (i)
- symbol in first sequence (A_i)
- index in second sequence (j)
- symbol in second sequence (A_j)
Return type: pandas.DataFrame
evcouplings.compare.pdb module¶
PDB structure handling based on MMTF format
- Authors:
- Thomas A. Hopf
-
class
evcouplings.compare.pdb.
Chain
(residues, coords)[source]¶ Bases:
object
Container for PDB chain residue and coordinate information
-
filter_atoms
(atom_name='CA')[source]¶ Filter coordinates of chain, e.g. to compute C_alpha-C_alpha distances
Parameters: atom_name (str or list-like, optional (default: "CA")) – Name(s) of atoms to keep Returns: Chain containing only filtered atoms (and those residues that have such an atom) Return type: Chain
-
filter_positions
(positions)[source]¶ Select a subset of positions from the chain
Parameters: positions (list-like) – Set of residues that will be kept Returns: Chain containing only the selected residues Return type: Chain
-
remap
(mapping, source_id='seqres_id')[source]¶ Remap chain into different numbering scheme (e.g. from seqres to uniprot numbering)
Parameters: - mapping (dict) –
Mapping of residue identifiers from source_id (current main index of PDB chain) to new identifiers.
mapping may either be:
- dict(str -> str) to map individual residue IDs. Keys and values of dictionary will be typecast to string before the mapping, so it is possible to pass in integer values too (if the source or target IDs are numbers)
- dict((int, int) -> (int, int)) to map ranges of numbers to ranges of numbers. This should typically be only used with RESSEQ or UniProt numbering. End index or range is *inclusive* Note that residue IDs in the end will still be handled as strings when mapping.
- source_id ({"seqres_id", "coord_id", "id"}, optional (default: "seqres_id")) – Residue identifier in chain to map *from* (will be used as key to access mapping)
Returns: Chain with remapped numbering (“id” column in residues DataFrame)
Return type: - mapping (dict) –
-
to_file
(fileobj, chain_id='A', end=True, first_atom_id=1)[source]¶ Write chain to a file in PDB format (mmCIF not yet supported).
Note that PDB files written this function may not be 100% compliant with the PDB format standards, in particular:
- some HETATM records may turn into ATOM records when starting from an mmtf file, if the record has a one-letter code (such as MSE / M).
- code does not print TER record at the end of a peptide chain
Parameters: - fileobj (file-like object) – Write to this file handle
- chain_id (str, optional (default: "A")) – Assign this chain name in file (allows to redefine chain name from whatever chain was originally)
- end (bool, optional (default: True)) – Print “END” record after chain (signals end of PDB file)
- first_atom_id (int, optional (default: 1)) – Renumber atoms to start with this index (set to None to keep default indices)
Raises: ValueError
– If atom or residue numbers are too wide and cannot be written to old fixed-column PDB file format
-
-
class
evcouplings.compare.pdb.
ClassicPDB
(structure)[source]¶ Bases:
object
Class to handle “classic” PDB and mmCIF formats (for new mmtf format see PDB class above). Wraps around Biopython PDB functionality to provide a consistent interface.
Unlike the PDB class (based on mmtf), this object will not be able to extract SEQRES indices corresponding to ATOM-record residue indices.
-
classmethod
from_file
(filename, file_format='pdb')[source]¶ Initialize structure from PDB/mmCIF file
Parameters: - filename (str) – Path of file
- file_format ({"pdb", "cif"}, optional (default: "pdb")) – Format of structure (old PDB format or mmCIF)
Returns: Initialized PDB structure
Return type:
-
classmethod
from_id
(pdb_id)[source]¶ Initialize structure by PDB ID (fetches structure from RCSB servers)
Parameters: pdb_id (str) – PDB identifier (e.g. 1hzx) Returns: initialized PDB structure Return type: PDB
-
classmethod
-
class
evcouplings.compare.pdb.
PDB
(mmtf)[source]¶ Bases:
object
Wrapper around PDB MMTF decoder object to access residue and coordinate information
-
classmethod
from_file
(filename)[source]¶ Initialize structure from MMTF file
Parameters: filename (str) – Path of MMTF file Returns: initialized PDB structure Return type: PDB
-
classmethod
from_id
(pdb_id)[source]¶ Initialize structure by PDB ID (fetches structure from RCSB servers)
Parameters: pdb_id (str) – PDB identifier (e.g. 1hzx) Returns: initialized PDB structure Return type: PDB
-
classmethod
-
evcouplings.compare.pdb.
load_structures
(pdb_ids, structure_dir=None, raise_missing=True)[source]¶ Load PDB structures from files / web
Parameters: - pdb_ids (Iterable) – List / iterable containing PDB identifiers to be loaded.
- structure_dir (str, optional (default: None)) – Path to directory with structures. Structures filenames must be in the format 5p21.mmtf. If a file can not be found, will try to fetch from web instead.
- raise_missing (bool, optional (default: True)) – Raise a ResourceError exception if any of the PDB IDs cannot be loaded. If False, missing entries will be ignored.
Returns: structures – Dictionary containing loaded structures. Keys (PDB identifiers) will be lower-case.
Return type: dict(str -> PDB)
Raises: ResourceError
– Raised if raise_missing is True and any of the given PDB IDs cannot be loaded.
evcouplings.compare.protocol module¶
evcouplings.compare.sifts module¶
Uniprot to PDB structure identification and index mapping using the SIFTS database (https://www.ebi.ac.uk/pdbe/docs/sifts/)
This functionality is centered around the pdb_chain_uniprot.csv table available from SIFTS. (ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv/pdb_chain_uniprot.csv.gz)
- Authors:
- Thomas A. Hopf Anna G. Green (find_homologs) Chan Kang (find_homologs)
-
class
evcouplings.compare.sifts.
SIFTS
(sifts_table_file, sequence_file=None)[source]¶ Bases:
object
Provide Uniprot to PDB mapping data and functions starting from SIFTS mapping table.
-
by_alignment
(min_overlap=20, reduce_chains=False, **kwargs)[source]¶ Find structures by sequence alignment between query sequence and sequences in PDB.
Parameters: - min_overlap (int, optional (default: 20)) – Require at least this many aligned positions with the target structure
- reduce_chains (bool, optional (Default: True)) – If true, keep only first chain per PDB ID (i.e. remove redundant occurrences of same protein in PDB structures). Should be set to False to identify homomultimeric contacts.
- **kwargs –
Defines the behaviour of find_homologs() function used to find homologs by sequence alignment: - which alignment method is used
(pdb_alignment_method: {“jackhmmer”, “hmmsearch”}, default: “jackhmmer”),- parameters passed into the protocol for the selected
alignment method (evcouplings.align.jackhmmer_search or
evcouplings.align.hmmbuild_and_search).
Default parameters are set in the HMMER_CONFIG string in this module, other parameters will need to be overriden; these minimally are: - for pdb_alignment_method == “jackhmmer”:
- sequence_id : str, identifier of target sequence
- jackhmmer : str, path to jackhmmer binary if not on path
- for pdb_alignment_method == “hmmsearch”: - sequence_id : str, identifier of target sequence - raw_focus_alignment_file : str, path to input alignment file - hmmbuild : str, path to hmmbuild binary if not on path - hmmsearch : str, path to search binary if not on path
- additionally, if “prefix” is given, individual mappings will be saved to files suffixed by the respective key in mapping table.
- parameters passed into the protocol for the selected
alignment method (evcouplings.align.jackhmmer_search or
evcouplings.align.hmmbuild_and_search).
Returns: Record of hits and mappings found for this query sequence by alignment. See by_pdb_id() for detailed explanation of fields.
Return type:
-
by_pdb_id
(pdb_id, pdb_chain=None, uniprot_id=None)[source]¶ Find structures and mapping by PDB id and chain name
Parameters: - pdb_id (str) – 4-letter PDB identifier
- pdb_chain (str, optional (default: None)) – PDB chain name (if not given, all chains for PDB entry will be returned)
- uniprot_id (str, optional (default: None)) – Filter to keep only this Uniprot accession number or identifier (necessary for chimeras, or multi-chain complexes with different proteins)
Returns: Identified hits plus index mappings to Uniprot
Return type: Raises: ValueError
– If selected segments in PDB file do not unambigously map to one Uniprot entry
-
by_uniprot_id
(uniprot_id, reduce_chains=False)[source]¶ Find structures and mapping by Uniprot access number.
Parameters: - uniprot_ac (str) – Find PDB structures for this Uniprot accession number. If sequence_file was given while creating the SIFTS object, Uniprot identifiers can also be used.
- reduce_chains (bool, optional (Default: True)) – If true, keep only first chain per PDB ID (i.e. remove redundant occurrences of same protein in PDB structures). Should be set to False to identify homomultimeric contacts.
Returns: Record of hits and mappings found for this Uniprot protein. See by_pdb_id() for detailed explanation of fields.
Return type:
-
create_sequence_file
(output_file, chunk_size=1000, max_retries=100)[source]¶ Create FASTA sequence file containing all UniProt sequences of proteins in SIFTS. This file is required for homology-based structure identification and index remapping. This function will also automatically associate the sequence file with the SIFTS object.
Parameters: - output_file (str) – Path at which to store sequence file
- chunk_size (int, optional (default: 1000)) – Retrieve sequences from UniProt in chunks of this size (too large chunks cause the mapping service to stall)
- max_retries (int, optional (default: 100)) – Allow this many retries when fetching sequences from UniProt ID mapping service, which unfortunately often suffers from connection failures.
-
-
class
evcouplings.compare.sifts.
SIFTSResult
(hits, mapping)[source]¶ Bases:
object
Store results of SIFTS structure/mapping identification.
(Full class defined for easify modification of fields)
-
evcouplings.compare.sifts.
fetch_uniprot_mapping
(ids, from_='ACC', to='ACC', format='fasta')[source]¶ Fetch data from UniProt ID mapping service (e.g. download set of sequences)
Parameters: - ids (list(str)) – List of UniProt identifiers for which to retrieve mapping
- from (str, optional (default: "ACC")) – Source identifier (i.e. contained in “ids” list)
- to (str, optional (default: "ACC")) – Target identifier (to which source should be mapped)
- format (str, optional (default: "fasta")) – Output format to request from Uniprot server
Returns: Response from UniProt server
Return type:
-
evcouplings.compare.sifts.
find_homologs
(pdb_alignment_method='jackhmmer', **kwargs)[source]¶ Identify homologs using jackhmmer or hmmbuild/hmmsearch
Parameters: - pdb_alignment_method ({"jackhmmer", "hmmsearch"},) – optional (default: “jackhmmer”) Sequence alignment method used for searching the PDB
- **kwargs – Passed into jackhmmer / hmmbuild_and_search protocol (see documentation for available options)
Returns: - ali (evcouplings.align.Alignment) – Alignment of homologs of query sequence in sequence database
- hits (pandas.DataFrame) – Tabular representation of hits
evcouplings.complex package¶
evcouplings.complex.protocol module¶
Protocols for matching putatively interacting sequences in protein complexes to create a concatenated sequence alignment
- Authors:
- Anna G. Green Thomas A. Hopf
-
evcouplings.complex.protocol.
best_hit
(**kwargs)[source]¶ Protocol:
Concatenate alignments based on the best hit to the focus sequence in each species
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: alignment_file raw_alignment_file focus_mode focus_sequence segments frequencies_file identities_file num_sequences num_sites raw_focus_alignment_file statistics_file
Return type: dict
-
evcouplings.complex.protocol.
describe_concatenation
(annotation_file_1, annotation_file_2, genome_location_filename_1, genome_location_filename_2, outfile)[source]¶ Describes properties of concatenated alignment.
Writes a csv with the following columns
num_seqs_1 : number of sequences in the first monomer alignment num_seqs_2 : number of sequences in the second monomer alignment num_nonred_species_1 : number of unique species annotations in the
first monomer alignment- num_nonred_species_2 : number of unique species annotations in the
- second monomer alignment
num_species_overlap: number of unique species found in both alignments median_num_per_species_1 : median number of paralogs per species in the
first monomer alignmment- median_num_per_species_2 : median number of paralogs per species in
- the second monomer alignment
- num_with_embl_cds_1 : number of IDs for which we found an EMBL CDS in the
- first monomer alignment (relevant to distance concatention only)
- num_with_embl_cds_2 : number of IDs for which we found an EMBL CDS in the
- first monomer alignment (relevant to distance concatention only)
Parameters: - annotation_file_1 (str) – Path to annotation.csv file for first monomer alignment
- annotation_file_2 (str) – Path to annotation.csv file for second monomer alignment
- genome_location_filename_1 (str) – Path to genome location mapping file for first alignment
- genome_location_filename_2 (str) – Path to genome location mapping file for second alignment
- outfile (str) – Path to output file
-
evcouplings.complex.protocol.
genome_distance
(**kwargs)[source]¶ Protocol:
Concatenate alignments based on genomic distance
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: - alignment_file
- raw_alignment_file
- focus_mode
- focus_sequence
- segments
- frequencies_file
- identities_file
- num_sequences
- num_sites
- raw_focus_alignment_file
- statistics_file
Return type: dict
-
evcouplings.complex.protocol.
modify_complex_segments
(outcfg, **kwargs)[source]¶ Modifies the output configuration so that the segments are correct for a concatenated alignment
Parameters: outcfg (dict) – The output configuration Returns: outcfg – The output configuration, with a new field called “segments” Return type: dict
-
evcouplings.complex.protocol.
run
(**kwargs)[source]¶ Run alignment concatenation protocol
Parameters: kwargs arguments (Mandatory) – protocol: concatenation protocol to run prefix: Output prefix for all generated files Returns: outcfg – Output configuration of concatenation stage Dictionary with results in following fields: (in brackets: not mandatory) alignment_file raw_alignment_file focus_mode focus_sequence segments frequencies_file identities_file num_sequences num_sites raw_focus_alignment_file statistics_file
Return type: dict
evcouplings.couplings package¶
evcouplings.couplings.mapping module¶
Mapping indices for complexes / multi-domain sequences to internal model numbering.
- Authors:
- Thomas A. Hopf Anna G. Green (MultiSegmentCouplingsModel)
-
class
evcouplings.couplings.mapping.
MultiSegmentCouplingsModel
(filename, *segments, precision='float32', file_format='plmc_v2', **kwargs)[source]¶ Bases:
evcouplings.couplings.model.CouplingsModel
Complex specific Couplings Model that handles segments and provides the option to convert model into inter-segment only.
-
to_inter_segment_model
()[source]¶ Convert model to inter-segment only parameters, ie the J_ijs that correspond to inter-protein or inter-domain residue pairs. All other parameters are set to 0.
Returns: Copy of object turned into inter-only Epistatic model Return type: CouplingsModel
-
-
class
evcouplings.couplings.mapping.
Segment
(segment_type, sequence_id, region_start, region_end, positions=None, segment_id='A')[source]¶ Bases:
object
Represents a continuous stretch of sequence in a sequence alignment to infer evolutionary couplings (e.g. multiple domains, or monomers in a concatenated complex alignment)
-
default_chain_name
()[source]¶ Retrieve default PDB chain identifier the segment will be mapped to in 3D structures (by convention, segments in the pipeline are named A_1, A_2, …, B_1, B_2, …; the default chain identifier is anything before the underscore).
Returns: chain – Default PDB chain identifier the segment maps to Return type: str
-
classmethod
from_list
(segment)[source]¶ Create a segment object from list representation (e.g. from config).
Parameters: segment (list) – List representation of segment, with the following items: segment_id (str), segment_type (str), sequence_id (str), region_start (int), region_end (int), positions (list(int)) Returns: New Segment instance from list Return type: Segment
-
-
class
evcouplings.couplings.mapping.
SegmentIndexMapper
(focus_mode, first_index, *segments)[source]¶ Bases:
object
Map indices of one or more sequence segments into CouplingsModel internal numbering space. Can also be used to (trivially) remap indices for a single sequence.
-
patch_model
(model, inplace=True)[source]¶ Change numbering of CouplingModel object so that it uses segment-based numbering
Parameters: - model (CouplingsModel) – Model that will be updated to segment- based numbering
- inplace (bool, optional (default: True)) – If True, change passed model; otherwise returnnew object
Returns: Model with updated numbering (if inplace is False, this will point to original model)
Return type: Raises: ValueError
– If segment mapping does not match internal model numbering
-
-
evcouplings.couplings.mapping.
segment_map_ecs
(ecs, mapper)[source]¶ Map EC dataframe in model numbering into segment numbering
Parameters: ecs (pandas.DataFrame) – EC table (with columns i and j) Returns: Mapped EC table (with columns i and j mapped, and additional columns segment_i and segment_j) Return type: pandas.DataFrame
evcouplings.couplings.mean_field module¶
evcouplings.couplings.model module¶
Class to store parameters of undirected graphical model of sequences and perform calculations using the model (statistical energies, coupling scores).
- Authors:
- Thomas A. Hopf
-
class
evcouplings.couplings.model.
CouplingsModel
(model_file, precision='float32', file_format='plmc_v2', **kwargs)[source]¶ Bases:
object
Class to store parameters of pairwise undirected graphical model of sequences and compute evolutionary couplings, sequence statistical energies, etc.
-
Jij
(i=None, j=None, A_i=None, A_j=None)[source]¶ Quick access to J_ij matrix with automatic index mapping. See __4d_access for explanation of parameters.
-
classmethod
apc
(matrix)[source]¶ Apply average product correction (Dunn et al., Bioinformatics, 2008) to matrix
Parameters: matrix (np.array) – Symmetric L x L matrix which should be corrected by APC Returns: Symmetric L x L matrix with APC correction applied Return type: np.array
-
cn
(i=None, j=None)[source]¶ Quick access to cn_scores matrix with automatic index mapping. See __2d_access_score_matrix for explanation of parameters.
-
cn_scores
¶ L x L numpy matrix with CN (corrected norm) scores
-
convert_sequences
(sequences)[source]¶ Converts sequences in string format into internal symbol representation according to alphabet of model
Parameters: sequences (list of str) – List of sequences (must have same length and correspond to model states) Returns: Matrix of size len(sequences) x L of sequences converted to integer symbols Return type: np.array
-
delta_hamiltonian
(substitutions, verify_mutants=True)[source]¶ Calculate difference in statistical energy relative to self.target_seq by changing sequence according to list of substitutions
Parameters: - substitutions (list of tuple(pos, subs_from, subs_to)) – Substitutions to be applied to target sequence
- verify_mutants (bool, optional) – Test if subs_from is consistent with self.target_seq
Returns: Vector of length 3 with 1) total delta Hamiltonian, 2) delta J_ij, 3) delta h_i
Return type: np.array
-
dmm
(i=None, j=None, A_i=None, A_j=None)[source]¶ Access delta_Hamiltonian matrix of double mutants of target sequence
Parameters: Returns: 4D matrix containing energy differences for slices along both axes of double mutation matrix (axes 1/2: position, axis 3/4: substitutions).
Return type: np.array(float)
-
double_mut_mat
¶ Hamiltonian difference for all possible double mutant variants
L x L x num_symbol x num_symbol matrix containing delta Hamiltonians for all possible double mutants of target sequence
-
ecs
¶ DataFrame with evolutionary couplings, sorted by CN score (all scores: CN, FN, MI)
-
fi
(i=None, A_i=None)[source]¶ Quick access to f_i matrix with automatic index mapping. See __2d_access for explanation of parameters.
-
fij
(i=None, j=None, A_i=None, A_j=None)[source]¶ Quick access to f_ij matrix with automatic index mapping. See __4d_access for explanation of parameters.
-
fn
(i=None, j=None)[source]¶ Quick access to fn_scores matrix with automatic index mapping. See __2d_access_score_matrix for explanation of parameters.
-
fn_scores
¶ L x L numpy matrix with FN (Frobenius norm) scores
-
hamiltonians
(sequences)[source]¶ Calculates the Hamiltonians of the global probability distribution P(A_1, …, A_L) for the given sequences A_1,…,A_L from J_ij and h_i parameters
Parameters: sequences (list of str) – List of sequences for which Hamiltonian will be computed, or converted np.array obtained using convert_sequences method Returns: Float matrix of size len(sequences) x 3, where each row corresponds to the 1) total Hamiltonian of sequence and the 2) J_ij and 3) h_i sub-sums Return type: np.array
-
hi
(i=None, A_i=None)[source]¶ Quick access to h_i matrix with automatic index mapping. See __2d_access for explanation of parameters.
-
index_list
¶ Target/Focus sequence of model used for delta_hamiltonian calculations (including single and double mutation matrices)
-
mi_apc
(i=None, j=None)[source]¶ Quick access to mi_scores_apc matrix with automatic index mapping. See __2d_access_score_matrix for explanation of parameters.
-
mi_raw
(i=None, j=None)[source]¶ Quick access to mi_scores_raw matrix with automatic index mapping. See __2d_access_score_matrix for explanation of parameters.
-
mi_scores_apc
¶ L x L numpy matrix with MI (mutual information) scores with APC correction
-
mi_scores_raw
¶ L x L numpy matrix with MI (mutual information) scores without APC correction
-
mn
(i=None)[source]¶ Map model numbering to internal numbering
Parameters: i (Iterable(int) or int) – Position(s) to be mapped from model numbering space into internal numbering space Returns: Remapped position(s) Return type: Iterable(int) or int
-
seq
(i=None)[source]¶ Access target sequence of model
Parameters: i (Iterable(int) or int) – Position(s) for which symbol should be retrieved Returns: Sequence symbols Return type: Iterable(char) or char
-
single_mut_mat
¶ Hamiltonian difference for all possible single-site variants
L x num_symbol matrix (np.array) containing delta Hamiltonians for all possible single mutants of target sequence.
-
single_mut_mat_full
¶ Hamiltonian difference for all possible single-site variants
L x num_symbol x 3 matrix (np.array) containing delta Hamiltonians for all possible single mutants of target sequence. Third dimension: 1) full Hamiltonian, 2) J_ij, 3) h_i
-
smm
(i=None, A_i=None)[source]¶ Access delta_Hamiltonian matrix of single mutants of target sequence
Parameters: Returns: 2D matrix containing energy differences for slices along both axes of single mutation matrix (first axis: position, second axis: substitution).
Return type: np.array(float)
-
sn
(i=None)[source]¶ Map internal numbering to sequence numbering
Parameters: i (Iterable(int) or int) – Position(s) to be mapped from internal numbering space into sequence numbering space. Returns: Remapped position(s) Return type: Iterable(int) or int
-
target_seq
¶ Target/Focus sequence of model used for delta_hamiltonian calculations (including single and double mutation matrices)
-
to_file
(out_file, precision='float32', file_format='plmc_v2')[source]¶ Writes the potentially modified model again to binary file
Parameters: - out_file (str) – A string specifying the path to a file
- precision ({"float16", "float32", "float64"}, optional (default: "float32")) – Numerical NumPy data type specifying the precision used to write numerical values to file
- file_format ({"plmc_v1", "plmc_v2"}, optional (default: "plmc_v2")) – Available file formats
-
to_independent_model
()[source]¶ Estimate parameters of a single-site model using Gaussian prior/L2 regularization.
Returns: Copy of object turned into independent model Return type: CouplingsModel
-
evcouplings.couplings.pairs module¶
evcouplings.couplings.protocol module¶
evcouplings.couplings.tools module¶
evcouplings.mutate package¶
evcouplings.mutate.calculations module¶
High-level mutation calculation functions for EVmutation
Todo
implement segment handling
- Authors:
- Thomas A. Hopf Anna G. Green (generalization for multiple segments)
-
evcouplings.mutate.calculations.
extract_mutations
(mutation_string, offset=0, sep=', ')[source]¶ Turns a string containing mutations of the format I100V into a list of tuples with format (100, ‘I’, ‘V’) (index, from, to)
Parameters: Returns: List of tuples of the form (index+offset, from, to)
Return type: list of tuples
-
evcouplings.mutate.calculations.
predict_mutation_table
(model, table, output_column='prediction_epistatic', mutant_column='mutant', hamiltonian='full', segment=None)[source]¶ Predicts all mutants in a dataframe and adds predictions as a new column.
If mutant_column is None, the dataframe index is used, otherwise the given column.
Mutations which cannot be calculated (e.g. not covered by alignment, or invalid substitution) using object are set to NaN.
Parameters: - model (CouplingsModel) – CouplingsModel instance used to compute mutation effects
- table (pandas.DataFrame) – DataFrame with mutants to which delta of statistical energy will be added
- mutant_column (str) – Name of column in table that contains mutants
- output_column (str) – Name of column in returned dataframe that will contain computed effects
- hamiltonian ({"full", "couplings", "fields"},) – default: “full” Use full Hamiltonian of exponential model (default), or only couplings / fields for statistical energy calculation.
- segment (str, default: None) – Specificy a segment identifier to use for the positions in the mutation table. This will only be used if the mutation table doesn’t already have a segments column.
Returns: Dataframe with added column (mutant_column) that contains computed mutation effects
Return type: pandas.DataFrame
-
evcouplings.mutate.calculations.
single_mutant_matrix
(model, output_column='prediction_epistatic', exclude_self_subs=True)[source]¶ Create table with all possible single substitutions of target sequence in CouplingsModel object.
Parameters: - model (CouplingsModel) – Model that will be used to predict single mutants
- output_column (str, default: "prediction_epistatic") – Name of column in Dataframe that will contain predictions
- exclude_self_subs (bool, default: True) – Exclude self-substitutions (e.g. A100A) from results
Returns: DataFrame with predictions for all single mutants
Return type: pandas.DataFrame
-
evcouplings.mutate.calculations.
split_mutants
(x, mutant_column='mutant')[source]¶ Splits mutation strings into individual columns in DataFrame (wild-type symbol(s), position(s), substitution(s), number of mutations). This function is e.g. helpful when computing average effects per position using pandas groupby() operations
Parameters: - x (pandas.DataFrame) – Table with mutants
- mutant_column (str, default: "mutant") – Column which contains mutants, set to None to use index of DataFrame
Returns: DataFrame with added columns “num_subs”, “pos”, “wt” and “subs” that contain the number of mutations, and split mutation strings (if higher-order mutations, symbols/numbers are comma-separated)
Return type: pandas.DataFrame
evcouplings.mutate.protocol module¶
Sequence statistical energy and mutation effect computation protocols
- Authors:
- Thomas A. Hopf Anna G. Green (complex)
-
evcouplings.mutate.protocol.
complex
(**kwargs)[source]¶ Protocol: Mutation effect prediction and visualization for protein complexes
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: - mutation_matrix_file
- [mutation_dataset_predicted_file]
Return type: dict
-
evcouplings.mutate.protocol.
run
(**kwargs)[source]¶ Run mutation protocol
Parameters: kwargs arguments (Mandatory) – protocol: EC protocol to run prefix: Output prefix for all generated files Returns: outcfg – Output configuration of stage (see individual protocol for fields) Return type: dict
-
evcouplings.mutate.protocol.
standard
(**kwargs)[source]¶ Protocol: Mutation effect calculation and visualization for protein monomers
TODO: eventually merge with complexes to make a protocol agnostic to the number of segments
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: - mutation_matrix_file
- [mutation_dataset_predicted_file]
Return type: dict
Folding Analysis¶
evcouplings.fold package¶
evcouplings.fold.cns module¶
evcouplings.fold.filter module¶
Functions for detecting ECs that should not be included in 3D structure prediction
Most functions in this module are rewritten from older pipeline code in choose_CNS_constraint_set.m
- Authors:
- Thomas A. Hopf
-
evcouplings.fold.filter.
detect_secstruct_clash
(i, j, secstruct)[source]¶ Detect if an EC pair (i, j) is geometrically impossible given a predicted secondary structure
Based on direct port of the logic implemented in choose_CNS_constraint_set.m from original pipeline, lines 351-407.
Use secstruct_clashes() to annotate an entire table of ECs.
Parameters: Returns: clashes – True if (i, j) clashes with secondary structure
Return type:
-
evcouplings.fold.filter.
disulfide_clashes
(ec_pairs, output_column='cys_clash')[source]¶ Add disulfide bridge clashes to EC table (i.e. if any cysteine residue is coupled to another cysteine). This flag is necessary if disulfide bridges are created during folding, since only one bridge is possible per cysteine.
Parameters: - ec_pairs (pandas.DataFrame) – Table with EC pairs that will be tested for the occurrence of multiple cys-cys pairings (with columns i, j, A_i, A_j)
- output_column (str, optional (default: "cys_clash")) – Target column indicating if pair is in a clash or not
Returns: Annotated EC table with clashes
Return type: pandas.DataFrame
-
evcouplings.fold.filter.
secstruct_clashes
(ec_pairs, residues, output_column='ss_clash', secstruct_column='sec_struct_3state')[source]¶ Add secondary structure clashes to EC table
Parameters: - ec_pairs (pandas.DataFrame) – Table with EC pairs that will be tested for clashes with secondary structure (with columns i, j)
- residues (pandas.DataFrame) – Table with residues in sequence and their secondary structure (columns i, ss_pred).
- output_column (str, optional (default: "secstruct_clash")) – Target column indicating if pair is in a clash or not
- secstruct_column (str, optional (default: "sec_struct_3state")) – Source column in ec_pairs with secondary structure states (H, E, C)
Returns: Annotated EC table with clashes
Return type: pandas.DataFrame
evcouplings.fold.protocol module¶
evcouplings.fold.ranking module¶
evcouplings.fold.restraints module¶
Functions for generating distance restraints from evolutionary couplings and secondary structure predictions
- Authors:
- Thomas A. Hopf Anna G. Green (docking restraints)
-
evcouplings.fold.restraints.
docking_restraints
(ec_pairs, output_file, restraint_formatter, config_file=None)[source]¶ Create .tbl file with distance restraints for docking
Parameters: - ec_pairs (pandas.DataFrame) – Table with EC pairs that will be turned into distance restraints (with columns i, j, A_i, A_j, segment_i, segment_j)
- output_file (str) – Path to file in which restraints will be saved
- restraint_formatter (function) – Function called to create string representation of restraint
- config_file (str, optional (default: None)) – Path to config file with folding settings. If None, will use default settings included in package (restraints.yml).
-
evcouplings.fold.restraints.
ec_dist_restraints
(ec_pairs, output_file, restraint_formatter, config_file=None)[source]¶ Create .tbl file with distance restraints based on evolutionary couplings
Logic based on choose_CNS_constraint_set.m, lines 449-515
Parameters: - ec_pairs (pandas.DataFrame) – Table with EC pairs that will be turned into distance restraints (with columns i, j, A_i, A_j)
- output_file (str) – Path to file in which restraints will be saved
- restraint_formatter (function) – Function called to create string representation of restraint
- config_file (str, optional (default: None)) – Path to config file with folding settings. If None, will use default settings included in package (restraints.yml).
-
evcouplings.fold.restraints.
secstruct_angle_restraints
(residues, output_file, restraint_formatter, config_file=None, secstruct_column='sec_struct_3state')[source]¶ Create .tbl file with dihedral angle restraints based on secondary structure prediction
Logic based on make_cns_angle_constraints.pl
Parameters: - residues (pandas.DataFrame) – Table containing positions (column i), residue type (column A_i), and secondary structure for each position
- output_file (str) – Path to file in which restraints will be saved
- restraint_formatter (function, optional) – Function called to create string representation of restraint
- config_file (str, optional (default: None)) – Path to config file with folding settings. If None, will use default settings included in package (restraints.yml).
- secstruct_column (str, optional (default: sec_struct_3state)) – Column name in residues dataframe from which secondary structure will be extracted (has to be H, E, or C).
-
evcouplings.fold.restraints.
secstruct_dist_restraints
(residues, output_file, restraint_formatter, config_file=None, secstruct_column='sec_struct_3state')[source]¶ Create .tbl file with distance restraints based on secondary structure prediction
Logic based on choose_CNS_constraint_set.m, lines 519-1162
Parameters: - residues (pandas.DataFrame) – Table containing positions (column i), residue type (column A_i), and secondary structure for each position
- output_file (str) – Path to file in which restraints will be saved
- restraint_formatter (function) – Function called to create string representation of restraint
- config_file (str, optional (default: None)) – Path to config file with folding settings. If None, will use default settings included in package (restraints.yml).
- secstruct_column (str, optional (default: sec_struct_3state)) – Column name in residues dataframe from which secondary structure will be extracted (has to be H, E, or C).
evcouplings.fold.tools module¶
Wrappers for tools for 3D structure prediction from evolutionary couplings
- Authors:
- Thomas A. Hopf
-
evcouplings.fold.tools.
parse_maxcluster_clustering
(clustering_output)[source]¶ Parse maxcluster clustering output into a DataFrame
Parameters: clustering_output (str) – stdout output from maxcluster after clustering Returns: Parsed result table (columns: filename, cluster, cluster_size) Return type: pandas.DataFrame
-
evcouplings.fold.tools.
parse_maxcluster_comparison
(comparison_output)[source]¶ Parse maxcluster output into a DataFrame
Parameters: comparison_output (str) – stdout output from maxcluster after comparison Returns: Parsed result table (columns: filename, num_pairs, rmsd, maxsub, tm, msi), refer to maxcluster documentation for explanation of the score fields. Return type: pandas.DataFrame
-
evcouplings.fold.tools.
read_psipred_prediction
(filename, first_index=1)[source]¶ Read a psipred secondary structure prediction file in horizontal or vertical format (auto-detected).
Parameters: Returns: pred – Table containing secondary structure prediction, with the following columns:
- i: position
- A_i: amino acid
- sec_struct_3state: prediction (H, E, C)
If reading vformat, also contains columns for the individual (score_coil/helix/strand)
If reading hformat, also contains confidence score between 1 and 9 (sec_struct_conf)
Return type: pandas.DataFrame
-
evcouplings.fold.tools.
run_cns
(inp_script=None, inp_file=None, log_file=None, binary='cns')[source]¶ Run CNSsolve 1.21 (without worrying about environment setup)
Note that the user is responsible for verifying the output products of CNS, since their paths are determined by .inp scripts and hard to check automatically and in a general way.
Either input_script or input_file has to be specified.
Parameters: - inp_script (str, optional (default: None)) – CNS “.inp” input script (actual commands, not file)
- inp_file (str, optional (default: None)) – Path to .inp input script file. Will override inp_script if also specified.
- log_file (str, optional (default: None)) – Save CNS stdout output to this file
- binary (str, optional (default: "cns")) – Absolute path of CNS binary
Raises: ExternalToolError
– If call to CNS failsInvalidParameterError
– If no input script (file or string) given
-
evcouplings.fold.tools.
run_cns_13
(inp_script=None, inp_file=None, log_file=None, source_script=None, binary='cns')[source]¶ Run CNSsolve 1.3
Note that the user is responsible for verifying the output products of CNS, since their paths are determined by .inp scripts and hard to check automatically and in a general way.
Either input_script or input_file has to be specified.
Parameters: - inp_script (str, optional (default: None)) – CNS “.inp” input script (actual commands, not file)
- inp_file (str, optional (default: None)) – Path to .inp input script file. Will override inp_script if also specified.
- log_file (str, optional (default: None)) – Save CNS stdout output to this file
- source_script (str, optional (default: None)) – Script to set CNS environment variables. This should typically point to .cns_solve_env_sh in the CNS installation main directory (the shell script itself needs to be edited to contain the path of the installation)
- binary (str, optional (default: "cns")) – Name of CNS binary
Raises: ExternalToolError
– If call to CNS failsInvalidParameterError
– If no input script (file or string) given
-
evcouplings.fold.tools.
run_maxcluster_cluster
(predictions, method='average', rmsd=True, clustering_threshold=None, binary='maxcluster')[source]¶ Compare a set of predicted structures to an experimental structure using maxcluster.
For clustering functionality, use run_maxcluster_clustering() function.
Parameters: - predictions (list(str)) – List of PDB files that should be compared against experiment
- method ({"single", "average", "maximum", "pairs_min", "pairs_abs"}, optional (default: "average")) – Clustering method (single / average / maximum linkage, or min / absolute size neighbour pairs
- clustering_threshold (float (optional, default: None)) – Initial clustering threshold (maxcluster -T option)
- rmsd (bool, optional (default: True)) – Use RMSD-based clustering (faster)
- binary (str, optional (default: "maxcluster")) – Path to maxcluster binary
Returns: Clustering result table (see parse_maxcluster_clustering for more detailed explanation)
Return type: pandas.DataFrame
-
evcouplings.fold.tools.
run_maxcluster_compare
(predictions, experiment, normalization_length=None, distance_cutoff=None, binary='maxcluster')[source]¶ Compare a set of predicted structures to an experimental structure using maxcluster.
For clustering functionality, use run_maxcluster_clustering() function.
For a high-level wrapper around this function that removes problematic atoms and compares multiple models, please look at evcouplings.fold.protocol.compare_models_maxcluster().
Parameters: - predictions (list(str)) – List of PDB files that should be compared against experiment
- experiment (str) – Path of experimental structure PDB file. Note that the numbering and residues in this file must agree with the predicted structure, and that the structure may not contain duplicate atoms (multiple models, or alternative locations for the same atom).
- normalization_length (int, optional (default: None)) – Use this length to normalize the Template Modeling (TM) score (-N option of maxcluster). If None, will normalize by length of experiment.
- distance_cutoff (float, optional (default: None)) – Distance cutoff for MaxSub search (-d option of maxcluster). If None, will use maxcluster auto-calibration.
- binary (str, optional (default: "maxcluster")) – Path to maxcluster binary
Returns: Comparison result table (see parse_maxcluster_comparison for more detailed explanation)
Return type: pandas.DataFrame
-
evcouplings.fold.tools.
run_psipred
(fasta_file, output_dir, binary='runpsipred')[source]¶ Run psipred secondary structure prediction
psipred output file convention: run_psipred creates output files <rootname>.ss2 and <rootname2>.horiz in the current working directory, where <rootname> is extracted from the basename of the input file (e.g. /home/test/<rootname>.fa)
Parameters: Returns: - ss2_file (str) – Absolute path to prediction output in “VFORMAT”
- horiz_file (str) – Absolute path to prediction output in “HFORMAT”
Raises: ExternalToolError
– If call to psipred fails
Visualization¶
Utilities¶
evcouplings.utils package¶
evcouplings.utils.app module¶
evcouplings.utils.batch module¶
Looping through batches of jobs (former submit_job.py and buildali loop)
- Authors:
- Benjamin Schubert, Thomas A. Hopf
-
class
evcouplings.utils.batch.
AClusterSubmitter
[source]¶ Bases:
evcouplings.utils.batch.ASubmitter
Abstract subclass of a cluster submitter
-
cancel
(command)[source]¶ Consumes a list of jobIDs and trys to cancel them
Parameters: command (Command) – The Command jobejct to cancel Returns: If job was canceled Return type: bool
-
cancel_command
¶
-
db
¶ The persistent DB to keep track of all submitted jobs and their status
Returns: The Persistent DB Return type: PersistentDict
-
job_id_pattern
¶
-
monitor
(command)[source]¶ Returns the status of the consumed command
Parameters: command (Command) – The command object whose status is inquired Returns: The status of the Command Return type: Enum(Status)
-
monitor_command
¶
-
resource_flags
¶
-
-
class
evcouplings.utils.batch.
APluginRegister
(name, bases, nmspc)[source]¶ Bases:
abc.ABCMeta
This class allows automatic registration of new plugins.
-
class
evcouplings.utils.batch.
ASubmitter
[source]¶ Bases:
object
Interface for all submitters
-
cancel
(command)[source]¶ Consumes a list of jobIDs and trys to cancel them
Parameters: command (Command) – The Command jobejct to cancel Returns: If job was canceled Return type: bool
-
isBlocking
¶ Indicator whether the submitter is blocking or not
Returns: whether submitter blocks by calling join or not Return type: bool
-
monitor
(command)[source]¶ Returns the status of the consumed command
Parameters: command (Command) – The command object whose status is inquired Returns: The status of the Command Return type: Enum(Status)
-
registry
= {'local': <class 'evcouplings.utils.batch.LocalSubmitter'>, 'lsf': <class 'evcouplings.utils.batch.LSFSubmitter'>, 'sge': <class 'evcouplings.utils.batch.SGESubmitter'>, 'slurm': <class 'evcouplings.utils.batch.SlurmSubmitter'>}¶
-
-
class
evcouplings.utils.batch.
Command
(command, name=None, environment=None, workdir=None, resources=None)[source]¶ Bases:
object
Wrapper around the command parameters needed to execute a script
-
class
evcouplings.utils.batch.
EJob
[source]¶ Bases:
enum.Enum
An enumeration.
-
CANCEL
= 2¶
-
MONITOR
= 1¶
-
PID
= 5¶
-
STOP
= 3¶
-
SUBMIT
= 0¶
-
UPDATE
= 4¶
-
-
evcouplings.utils.batch.
EResource
¶ alias of
evcouplings.utils.batch.Enum
-
evcouplings.utils.batch.
EStatus
¶ alias of
evcouplings.utils.batch.Enum
-
class
evcouplings.utils.batch.
LSFSubmitter
(blocking=False, db_path=None)[source]¶ Bases:
evcouplings.utils.batch.AClusterSubmitter
Implements an LSF submitter
-
cancel_command
¶
-
db
¶ The persistent DB to keep track of all submitted jobs and their status
Returns: The Persistent DB Return type: PersistentDict
-
isBlocking
¶ Indicator whether the submitter is blocking or not
Returns: whether submitter blocks by calling join or not Return type: bool
-
job_id_pattern
¶
-
monitor_command
¶
-
resource_flags
¶
-
submit_command
¶
-
-
class
evcouplings.utils.batch.
LocalSubmitter
(blocking=True, db_path=None, ncpu=1)[source]¶ Bases:
evcouplings.utils.batch.ASubmitter
-
cancel
(command)[source]¶ Consumes a list of jobIDs and trys to cancel them
Parameters: command (Command) – The Command jobejct to cancel Returns: If job was canceled Return type: bool
-
isBlocking
¶ Indicator whether the submitter is blocking or not
Returns: whether submitter blocks by calling join or not Return type: bool
-
-
class
evcouplings.utils.batch.
SGESubmitter
(blocking=False, db_path=None)[source]¶ Bases:
evcouplings.utils.batch.AClusterSubmitter
Implements an LSF submitter
-
cancel_command
¶
-
db
¶ The persistent DB to keep track of all submitted jobs and their status
Returns: The Persistent DB Return type: PersistentDict
-
isBlocking
¶ Indicator whether the submitter is blocking or not
Returns: whether submitter blocks by calling join or not Return type: bool
-
job_id_pattern
¶
-
monitor_command
¶
-
resource_flags
¶
-
submit_command
¶
-
-
class
evcouplings.utils.batch.
SlurmSubmitter
(blocking=False, db_path=None)[source]¶ Bases:
evcouplings.utils.batch.AClusterSubmitter
Implements an LSF submitter
-
cancel_command
¶
-
db
¶ The persistent DB to keep track of all submitted jobs and their status
Returns: The Persistent DB Return type: PersistentDict
-
isBlocking
¶ Indicator whether the submitter is blocking or not
Returns: whether submitter blocks by calling join or not Return type: bool
-
job_id_pattern
¶
-
monitor_command
¶
-
resource_flags
¶
-
submit_command
¶
-
evcouplings.utils.calculations module¶
General calculation functions.
- Authors:
- Thomas A. Hopf
-
evcouplings.utils.calculations.
dihedral_angle
(p0, p1, p2, p3)[source]¶ Compute dihedral angle given four points
Adapted from the following source: http://stackoverflow.com/questions/20305272/dihedral-torsion-angle-from-four-points-in-cartesian-coordinates-in-python (answer by user Praxeolitic)
Parameters: - p0 (np.array) – Coordinates of first point
- p1 (np.array) – Coordinates of second point
- p2 (np.array) – Coordinates of third point
- p3 (np.array) – Coordinates of fourth point
Returns: Dihedral angle (in radians)
Return type: numpy.float
-
evcouplings.utils.calculations.
entropy
(X, normalize=False)[source]¶ Calculate entropy of distribution
Parameters: - X (np.array) – Vector for which entropy will be calculated
- normalize – Rescale entropy to range from 0 (“variable”, “flat”) to 1 (“conserved”)
Returns: Entropy of X
Return type:
-
evcouplings.utils.calculations.
entropy_map
(model, normalize=True)[source]¶ Compute dictionary of positional entropies for single-site frequencies in a CouplingsModel
Parameters: - model (CouplingsModel) – Model for which entropy of sequence alignment will be computed (based on single-site frequencies f_i(A_i) contained in model)
- normalize (bool, default: True) – Normalize entropy to range 0 (variable) to 1 (conserved) instead of raw values
Returns: Map from positions in sequence (int) to entropy of column (float) in alignment
Return type:
-
evcouplings.utils.calculations.
entropy_vector
(model, normalize=True)[source]¶ Compute vector of positional entropies for single-site frequencies in a CouplingsModel
Parameters: - model (CouplingsModel) – Model for which entropy of sequence alignment will be computed (based on single-site frequencies f_i(A_i) contained in model)
- normalize (bool, default: True) – Normalize entropy to range 0 (variable) to 1 (conserved) instead of raw values
Returns: Vector of length model.L containing entropy for each position
Return type: np.array
-
evcouplings.utils.calculations.
median_absolute_deviation
(x, scale=1.4826)[source]¶ Compute median absolute deviation of a set of numbers (median of deviations from median)
Parameters: - x (list-like of float) – Numbers for which median absolute deviation will be computed
- scale (float, optional (default: 1.4826)) – Rescale median absolute deviation by this factor; default value is such that median absolute deviation will match regular standard deviation of Gaussian distribution
evcouplings.utils.config module¶
Configuration handling
Todo
switch ruamel.yaml to round trip loading to preserver order and comments?
- Authors:
- Thomas A. Hopf
-
exception
evcouplings.utils.config.
InvalidParameterError
[source]¶ Bases:
Exception
Exception for invalid parameter settings
-
exception
evcouplings.utils.config.
MissingParameterError
[source]¶ Bases:
Exception
Exception for missing parameters
-
evcouplings.utils.config.
check_required
(params, keys)[source]¶ Verify if required set of parameters is present in configuration
Parameters: - params (dict) – Dictionary with parameters
- keys (list-like) – Set of parameters that has to be present in params
Raises:
-
evcouplings.utils.config.
iterate_files
(outcfg, subset=None)[source]¶ Generator function to iterate a list of file items in an outconfig
Parameters: Returns: Generator over tuples (file path, entry key, index). index will be None if this is a single file entry (i.e. ending with _file rather than _files).
Return type:
-
evcouplings.utils.config.
parse_config
(config_str, preserve_order=False)[source]¶ Parse a configuration string
Parameters: Returns: Configuration dictionary
Return type:
evcouplings.utils.constants module¶
Useful values and constants for all of package
- Authors:
- Thomas A. Hopf
evcouplings.utils.database module¶
evcouplings.utils.helpers module¶
Useful Python helpers
- Authors:
- Thomas A. Hopf, Benjamin Schubert
-
class
evcouplings.utils.helpers.
DefaultOrderedDict
(default_factory=None, **kwargs)[source]¶ Bases:
collections.OrderedDict
Source: http://stackoverflow.com/questions/36727877/inheriting-from-defaultddict-and-ordereddict Answer by http://stackoverflow.com/users/3555845/daniel
Maybe this one would be better? http://stackoverflow.com/questions/6190331/can-i-do-an-ordered-default-dict-in-python
-
class
evcouplings.utils.helpers.
PersistentDict
(filename, flag='c', mode=None, format='json', *args, **kwds)[source]¶ Bases:
dict
Persistent dictionary with an API compatible with shelve and anydbm.
The dict is kept in memory, so the dictionary operations run as fast as a regular dictionary.
Write to disk is delayed until close or sync (similar to gdbm’s fast mode).
Input file format is automatically discovered. Output file format is selectable between pickle, json, and csv. All three serialization formats are backed by fast C implementations.
-
class
evcouplings.utils.helpers.
Progressbar
(total_size, bar_length=60)[source]¶ Bases:
object
Progress bar for command line programs
Parameters:
-
evcouplings.utils.helpers.
find_segments
(data)[source]¶ Find consecutive number segments, based on Python 2.7 itertools recipe
Parameters: data (iterable) – Iterable in which to look for consecutive number segments (has to be in order)
-
evcouplings.utils.helpers.
range_overlap
(a, b)[source]¶ - Source: http://stackoverflow.com/questions/2953967/
- built-in-function-for-computing-overlap-in-python
Function assumes that start < end for a and b
Note
Ends of range are not inclusive
Parameters: Returns: Length of overlap between ranges a and b
Return type:
-
evcouplings.utils.helpers.
render_template
(template_file, mapping)[source]¶ Render a template using jinja2 and substitute values from mapping
Parameters: Returns: Rendered template
Return type:
-
evcouplings.utils.helpers.
retry
(func, retry_max_number=None, retry_wait=None, exceptions=None, retry_action=None, fail_action=None)[source]¶ Retry to execute a function as often as requested
Parameters: - func (callable) – Function to be executed until succcessful
- retry_max_number (int, optional (default: None)) – Maximum number of retries. If None, will retry forever.
- retry_wait (int, optional (default: None)) – Number of seconds to wait before attempting retry
- exceptions (exception or tuple(exception)) – Single or tuple of exceptions to catch for retrying (any other exception will cause immediate fail)
- retry_action (callable) – Function to execute upon a retry
- fail_action – Function to execute upon final failure
evcouplings.utils.pipeline module¶
evcouplings.utils.summarize module¶
evcouplings.utils.system module¶
System-level calls to external tools, directory creation, etc.
- Authors:
- Thomas A. Hopf
-
exception
evcouplings.utils.system.
ExternalToolError
[source]¶ Bases:
Exception
Exception for failing external calculations
-
exception
evcouplings.utils.system.
ResourceError
[source]¶ Bases:
Exception
Exception for missing resources (files, URLs, …)
-
evcouplings.utils.system.
create_prefix_folders
(prefix)[source]¶ Create a directory tree contained in a prefix.
- prefix : str
- Prefix containing directory tree
-
evcouplings.utils.system.
get
(url, output_path=None, allow_redirects=False)[source]¶ Download external resource
Parameters: Returns: r – Response object, use r.text to access text, r.json() to decode json, and r.content for raw bytestring
Return type: requests.models.Response
Raises:
-
evcouplings.utils.system.
get_urllib
(url, output_path)[source]¶ Download external resource to file using urllib. This function is intended for cases where get() implemented using requests can not be used, e.g. for download from an FTP server.
Parameters:
-
evcouplings.utils.system.
insert_dir
(prefix, *dirs, rootname_subdir=True)[source]¶ Create new path by inserting additional directories into the folder tree of prefix (but keeping the filename prefix at the end),
Parameters: Returns: Extended path
Return type:
-
evcouplings.utils.system.
makedirs
(directories)[source]¶ Create directory subtree, some or all of the folders may already exist.
Parameters: directories (str) – Directory subtree to create
-
evcouplings.utils.system.
run
(cmd, stdin=None, check_returncode=True, working_dir=None, shell=False, env=None)[source]¶ Run external program as subprocess.
Parameters: - cmd (str or list of str) – Command (and optional command line arguments)
- stdin (str or byte sequence, optional (default: None)) – Input to be sent to STDIN of the process
- check_returncode (bool, optional (default=True)) – Verify if call had returncode == 0, otherwise raise ExternalToolError
- working_dir (str, optional (default: None)) – Change to this directory before running command
- shell (bool, optional (default: False)) – Invoke shell when calling subprocess (default: False)
- env (dict, optional (default: None)) – Use this environment for executing the subprocess
Returns: - int – Return code of process
- stdout – Byte string with stdout output
- stderr – Byte string of stderr output
Raises:
-
evcouplings.utils.system.
temp
()[source]¶ Create a temporary file
Returns: Path of temporary file Return type: str
-
evcouplings.utils.system.
tempdir
()[source]¶ Create a temporary directory
Returns: Path of temporary directory Return type: str
-
evcouplings.utils.system.
valid_file
(file_path)[source]¶ Verify if a file exists and is not empty.
Parameters: file_path (str) – Path to file to check Returns: True if file exists and is non-zero size, False otherwise. Return type: bool
-
evcouplings.utils.system.
verify_resources
(message, *args)[source]¶ Verify if a set of files exists and is not empty.
Parameters: - message (str) – Message to display with raised ResourceError
- *args (List of str) – Path(s) of file(s) to be checked
Raises: ResourceError
– If any of the resources does not exist or is empty
evcouplings.utils.update_database module¶
command-line app to update the necessary databases
- Authors:
- Benjamin Schubert
-
evcouplings.utils.update_database.
download_ftp_file
(ftp_url, ftp_cwd, file_url, output_path, file_handling='wb', gziped=False, verbose=False)[source]¶ Downloads a gzip file from a remote ftp server and decompresses it on the fly into an output file
Parameters: - ftp_url (str) – the FTP server url
- ftp_cwd (str) – the FTP directory of the file to download
- file_url (str) – the file name that gets downloaded
- output_path (str) – the path to the output file on the local system
- file_handling (str) – the file handling mode (default: ‘wb’)
- verbose (bool) – determines whether a progressbar is printed