evcouplings.align package¶
evcouplings.align.alignment module¶
evcouplings.align.pfam module¶
Code for identifying Pfam domains and mapping Pfam alignments and ECs into target sequence mode.
- Authors:
- Thomas A. Hopf
Todo
- Write code to create list of family sizes
- Implement alignments against Pfam-HMM so precomputed results can be reused in focus mode
-
evcouplings.align.pfam.
create_family_size_table
(full_pfam_file, outfile=None)[source]¶ Parse family size table from Pfam flat file (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz)
Parameters: Returns: Parsed Pfam table.
Return type: pd.DataFrame
-
evcouplings.align.pfam.
pfam_hits
(query_file, hmm_database, prefix, clan_table_file, size_table_file, resolve_overlaps=True, **kwargs)[source]¶ Identify hits of Pfam HMMs in a set of sequences.
Parameters: - query_file (str) – File containing query sequence(s)
- hmm_database (str) – File containing HMM database (Pfam-A.hmm, with hmmpress applied)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- clan_table_file (str) – File with table linking Pfam families to clans (Pfam-A.clans.tsv). Set to None if not available, but resolve_overlaps cannot be True in that case.
- size_table_file (str) – File with table of family sizes. Create using create_family_size_table(). Set to None if not available.
- resolve_overlaps (bool) – Resolve overlapping hits by families from the same clan. Only possible if clan_table_file is given.
- **kwargs (dict) – kwargs passed on to evcouplings.align.tools.run_hmmscan
Returns: Pfam hit table
Return type: pd.DataFrame
-
evcouplings.align.pfam.
remove_clan_overlaps
(pfam_table)[source]¶ Remove overlapping Pfam hits from same Pfam clan (equivalent of PfamScan.pl). Currently only allows to remove overlaps by domain bitscore.
Todo
is bitscore the most sensible choice if different length hits?
Parameters: pfam_table (pd.DataFrame) – Pfam hit table as generated by pfam_hits() function (must contain Pfam clan annotation). Returns: Pfam hit table with lower-scoring overlaps removed Return type: pd.DataFrame
evcouplings.align.protocol module¶
Protein sequence alignment creation protocols/workflows.
- Authors:
- Thomas A. Hopf Anna G. Green - complex protocol, hmm_build_and_search Chan Kang - hmm_build_and_search
-
evcouplings.align.protocol.
complex
(**kwargs)[source]¶ Protocol:
Run monomer alignment protocol and postprocess it for EVcomplex calculations
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the alignment protocol, and the following additional field: - genome_location_file : path to file containing
- the genomic locations for CDs’s corresponding to identifiers in the alignment.
Return type: dict
-
evcouplings.align.protocol.
cut_sequence
(sequence, sequence_id, region=None, first_index=None, out_file=None)[source]¶ Cut a given sequence to sub-range and save it in a file
Parameters: - sequence (str) – Full sequence that will be cut
- sequence_id (str) – Identifier of sequence, used to construct header in output file
- region (tuple(int, int), optional (default: None)) – Region that will be cut out of full sequence. If None, full sequence will be returned.
- first_index (int, optional (default: None)) – Define index of first position in sequence. Will be set to 1 if None.
- out_file (str, optional (default: None)) – Save sequence in a FASTA file (header: >sequence_id/start_region-end_region)
Returns: - str – Subsequence contained in region
- tuple(int, int) – Region. If no input region is given, this will be (1, len(sequence)); otherwise, the input region is returned.
Raises: InvalidParameterError
– Upon invalid region specification (violating boundaries of sequence)
-
evcouplings.align.protocol.
describe_coverage
(alignment, prefix, first_index, minimum_column_coverage)[source]¶ Produce “classical” buildali coverage statistics, i.e. number of sequences, how many residues have too many gaps, etc.
Only to be applied to alignments focused around the target sequence.
Parameters: - alignment (Alignment) – Alignment for which coverage statistics will be calculated
- prefix (str) – Prefix of alignment file that will be stored as identifier in table
- first_index (int) – Sequence index of first position of target sequence
- minimum_column_coverage (Iterable(float) or float) –
Minimum column coverage threshold(s) that will be tested (creating one row for each threshold in output table).
Note
int
values given to this function instead of a float will be divided by 100 to create the corresponding floating point representation. This parameter is 1.0 - maximum fraction of gaps per column.
Returns: Table with coverage statistics for different gap thresholds
Return type: pd.DataFrame
-
evcouplings.align.protocol.
describe_frequencies
(alignment, first_index, target_seq_index=None)[source]¶ Get parameters of alignment such as gaps, coverage, conservation and summarize.
Parameters: - alignment (Alignment) – Alignment for which description statistics will be calculated
- first_index (int) – Sequence index of first residue in target sequence
- target_seq_index (int, optional (default: None)) – If given, will add the symbol in the target sequence into a separate column of the output table
Returns: Table detailing conservation and symbol frequencies for all positions in the alignment
Return type: pandas.DataFrame
-
evcouplings.align.protocol.
describe_seq_identities
(alignment, target_seq_index=0)[source]¶ Calculate sequence identities of any sequence to target sequence and create result dataframe.
Parameters: alignment (Alignment) – Alignment for which description statistics will be calculated Returns: Table giving the identity to target sequence for each sequence in alignment (in order of occurrence) Return type: pandas.DataFrame
-
evcouplings.align.protocol.
existing
(**kwargs)[source]¶ Protocol:
Use external sequence alignment and extract all relevant information from there (e.g. sequence, region, etc.), then apply gap & fragment filtering as usual
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the pipeline, including the following fields: - sequence_id (passed through from input)
- alignment_file
- raw_focus_alignment_file
- statistics_file
- sequence_file
- first_index
- target_sequence_file
- annotation_file (None)
- frequencies_file
- identities_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
extract_header_annotation
(alignment, from_annotation=True)[source]¶ Extract Uniprot/Uniref sequence annotation from Stockholm file (as output by jackhmmer). This function may not work for other formats.
Parameters: - alignment (Alignment) – Multiple sequence alignment object
- from_annotation (bool, optional (default: True)) – Use annotation line (in Stockholm file) rather than sequence ID line (e.g. in FASTA file)
Returns: Table containing all annotation (one row per sequence in alignment, in order of occurrence)
Return type: pandas.DataFrame
-
evcouplings.align.protocol.
fetch_sequence
(sequence_id, sequence_file, sequence_download_url, out_file)[source]¶ Fetch sequence either from database based on identifier, or from input sequence file.
Parameters: - sequence_id (str) – Identifier of sequence that should be retrieved
- sequence_file (str) – File containing sequence. If None, sqeuence will be downloaded from sequence_download_url
- sequence_download_url (str) – URL from which to download missing sequence. Must contain “{}” at the position where sequence ID will be inserted into download URL (using str.format).
- out_file (str) – Output file in which sequence will be stored, if sequence_file is not existing.
Returns: - str – Path of file with stored sequence (can be sequence_file or out_file)
- tuple (str, str) – Identifier of sequence as stored in file, and sequence
-
evcouplings.align.protocol.
hmmbuild_and_search
(**kwargs)[source]¶ Protocol:
Build HMM from sequence alignment using hmmbuild and search against a sequence database using hmmsearch.
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: outcfg – Output configuration of the protocol, including the following fields: - target_sequence_file
- sequence_file
- raw_alignment_file
- hittable_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
jackhmmer_search
(**kwargs)[source]¶ Protocol:
Iterative jackhmmer search against a sequence database.
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required :param .. todo::: explain meaning of parameters in detail.
Returns: outcfg – Output configuration of the protocol, including the following fields: - sequence_id (passed through from input)
- first_index (passed through from input)
- target_sequence_file
- sequence_file
- raw_alignment_file
- hittable_file
- focus_mode
- focus_sequence
- segments
Return type: dict
-
evcouplings.align.protocol.
modify_alignment
(focus_ali, target_seq_index, target_seq_id, region_start, **kwargs)[source]¶ Apply pairwise identity filtering, fragment filtering, and exclusion of columns with too many gaps to a sequence alignment. Also generates files describing properties of the alignment such as frequency distributions, conservation, and “old-style” alignment statistics files.
Note
assumes focus alignment (otherwise unprocessed) as input.
Todo
come up with something more clever to filter fragments than fixed width (e.g. use 95% quantile of length distribution as reference point)
Parameters: - focus_ali (Alignment) – Focus-mode input alignment
- target_seq_index (int) – Index of target sequence in alignment
- target_seq_id (str) – Identifier of target sequence (without range)
- region_start (int) – Index of first sequence position in target sequence
- kwargs (See required arguments in source code) –
Returns: - outcfg (Dict) – File products generated by the function:
- alignment_file
- statistics_file
- frequencies_file
- identities_file
- raw_focus_alignment_file
- ali (Alignment) – Final processed alignment
-
evcouplings.align.protocol.
run
(**kwargs)[source]¶ Run alignment protocol to generate multiple sequence alignment from input sequence.
Parameters: - kwargs arguments (Mandatory) – protocol: Alignment protocol to run prefix: Output prefix for all generated files
- Optional –
Returns: - Alignment
- Dictionary with results of stage in following fields (in brackets - not returned by all protocols) –
- alignment_file
- [raw_alignment_file]
- statistics_file
- target_sequence_file
- sequence_file
- [annotation_file]
- frequencies_file
- identities_file
- [hittable_file]
- focus_mode
- focus_sequence
- segments
-
evcouplings.align.protocol.
search_thresholds
(use_bitscores, seq_threshold, domain_threshold, seq_len)[source]¶ Set homology search inclusion parameters.
- HMMER hits get included in the HMM according to a two-step rule
- sequence passes sequence-level treshold
- domain passes domain-level threshold
- Therefore, search thresholds are set based on the following logic:
- If only sequence threshold is given, a MissingParameterException is raised
- If only bitscore threshold is given, sequence threshold is set to the same
- If both thresholds are given, they are according to defined values
- Valid inputs for bitscore thresholds:
- int or str: taken as absolute score threshold
- float: taken as relative threshold (absolute threshold derived by
multiplication with domain length)
- Valid inputs for integer thresholds:
- int: Used as negative exponent, threshold will be set to 1E-<exponent>
- float or str: Interpreted literally
Parameters: - use_bitscores (bool) – Use bitscore threshold instead of E-value threshold
- domain_threshold (str or int or float) – Domain-level threshold. See rules above.
- seq_threshold (str or int or float) – Sequence-level threshold. See rules above.
- seq_len (int) – Length of sequence. Used to calculate absolute bitscore threshold for relative bitscore thresholds.
Returns: Sequence- and domain-level thresholds ready to be fed into HMMER
Return type:
-
evcouplings.align.protocol.
standard
(**kwargs)[source]¶ Protocol:
Standard buildali4 workflow (run iterative jackhmmer search against sequence database, than determine which sequences and columns to include in the calculation based on coverage and maximum gap thresholds).
Parameters: kwargs arguments (Mandatory) – See list below in code where calling check_required Returns: - outcfg (dict) – Output configuration of the pipeline, including
the following fields:
- sequence_id (passed through from input)
- first_index (passed through from input)
- alignment_file
- raw_alignment_file
- raw_focus_alignment_file
- statistics_file
- target_sequence_file
- sequence_file
- annotation_file
- frequencies_file
- identities_file
- hittable_file
- focus_mode
- focus_sequence
- segments
- ali (Alignment) – Final sequence alignment
- outcfg (dict) – Output configuration of the pipeline, including
the following fields:
evcouplings.align.tools module¶
Wrappers for running external sequence alignment tools
- Authors:
- Thomas A. Hopf Anna G. Green - run_hmmbuild, run_hmmsearch Chan Kang - run_hmmbuild, run_hmmsearch
-
class
evcouplings.align.tools.
HmmbuildResult
(prefix, hmmfile, output)¶ Bases:
tuple
-
hmmfile
¶ Alias for field number 1
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
-
class
evcouplings.align.tools.
HmmscanResult
(prefix, output, tblout, domtblout, pfamtblout)¶ Bases:
tuple
-
domtblout
¶ Alias for field number 3
-
output
¶ Alias for field number 1
-
pfamtblout
¶ Alias for field number 4
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 2
-
-
class
evcouplings.align.tools.
HmmsearchResult
(prefix, alignment, output, tblout, domtblout)¶ Bases:
tuple
-
alignment
¶ Alias for field number 1
-
domtblout
¶ Alias for field number 4
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 3
-
-
class
evcouplings.align.tools.
JackhmmerResult
(prefix, alignment, output, tblout, domtblout)¶ Bases:
tuple
-
alignment
¶ Alias for field number 1
-
domtblout
¶ Alias for field number 4
-
output
¶ Alias for field number 2
-
prefix
¶ Alias for field number 0
-
tblout
¶ Alias for field number 3
-
-
evcouplings.align.tools.
read_hmmer_domtbl
(filename)[source]¶ Read a HMMER domtbl file into DataFrame.
Parameters: filename (str) – Path of domtbl file Returns: DataFrame with parsed domtbl Return type: pd.DataFrame
-
evcouplings.align.tools.
read_hmmer_tbl
(filename)[source]¶ Read a HMMER tbl file into DataFrame.
Parameters: filename (str) – Path of tbl file Returns: DataFrame with parsed tbl Return type: pd.DataFrame
-
evcouplings.align.tools.
run_hhfilter
(input_file, output_file, threshold=95, columns='a2m', binary='hhfilter')[source]¶ Redundancy-reduce a sequence alignment using hhfilter from the HHsuite alignment suite.
Parameters: - input_file (str) – Path to input alignment in A2M/FASTA format
- output_file (str) – Path to output alignment (will be in A3M format)
- threshold (int, optional (default: 95)) – Sequence identity threshold for maximum pairwise identity (between 0 and 100)
- columns ({"first", "a2m"}, optional (default: "a2m")) – Definition of match columns (based on first sequence or upper-case columns (a2m))
- binary (str) – Path to hhfilter binary
Returns: output_file
Return type: Raises: ResourceError
– If output alignment is non-existent/emptyValueError
– Upon invalid value of columns parameter
-
evcouplings.align.tools.
run_hmmbuild
(alignment_file, prefix, cpu=None, stdout_redirect=None, symfrac=None, binary='hmmbuild')[source]¶ Profile HMM construction from multiple sequence alignments Refer to HMMER documentation for details.
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
Parameters: - alignment_file (str) – File containing the multiple sequence alignment. Can be in Stockholm, a2m, or clustal formats, or any other format recognized by hmmer. Please note that ALL POSITIONS above the symfrac cutoff will be used in HMM construction (if the alignment contains columns that are insertions relative to the query sequence, this may be problematic for structure comparison)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- symfrac (float, optional (default: None)) – range 0.0 - 1.0, HMMbuild will use columns with > symfrac percent gaps to construct the HMM. If None provided, HMMbuild internal default is 0.5. (Note: this is calculated after their internal sequence weighting is calculated)
- binary (str (default: "hmmbuild")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_hmmscan
(query, database, prefix, use_model_threshold=True, threshold_type='cut_ga', use_bitscores=True, domain_threshold=None, seq_threshold=None, nobias=False, cpu=None, stdout_redirect=None, binary='hmmscan')[source]¶ Run hmmscan of HMMs in database against sequences in query to identify matches of these HMMs. Refer to HMMER Userguide for explanation of these parameters.
Parameters: - query (str) – File containing query sequence(s)
- database (str) – File containing HMM database (prepared with hmmpress)
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_model_threshold (bool (default: True)) – Use model-specific inclusion thresholds from HMM database rather than global bitscore/E-value thresholds (use_bitscores, domain_threshold and seq_threshold are overriden by this flag).
- threshold-type ({"cut_ga", "cut_nc", "cut_tc"} (default: "cut_ga")) – Use gathering (default), noise or trusted cutoff to define scan hits. Please refer to HMMER manual for details.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values. Overriden by use_model_threshold flag.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- binary (str (default: "hmmscan")) – Path to hmmscan binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, output, tblout, domtblout, pfamtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_hmmsearch
(hmmfile, database, prefix, use_bitscores, domain_threshold, seq_threshold, nobias=False, cpu=None, stdout_redirect=None, binary='hmmsearch')[source]¶ Search profile(s) against a sequence database. Refer to HMMER documentation for details.
http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf
Parameters: - hmmfile (str) – File containing the profile(s)
- database (str) – File containing sequence database
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- binary (str (default: "hmmsearch")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError
-
evcouplings.align.tools.
run_jackhmmer
(query, database, prefix, use_bitscores, domain_threshold, seq_threshold, iterations=5, nobias=False, cpu=None, stdout_redirect=None, checkpoints_hmm=False, checkpoints_ali=False, binary='jackhmmer')[source]¶ Run jackhmmer sequence search against target database. Refer to HMMER Userguide for explanation of these parameters.
Parameters: - query (str) – File containing query sequence
- database (str) – File containing sequence database
- prefix (str) – Prefix path for output files. Folder structure in the prefix will be created if not existing.
- use_bitscores (bool) – Use bitscore inclusion thresholds rather than E-values.
- domain_threshold (int or float or str) – Inclusion threshold applied on the domain level (e.g. “1E-03” or 0.001 or 50)
- seq_threshold (int or float or str) – Inclusion threshold applied on the sequence level (e.g. “1E-03” or 0.001 or 50)
- iterations (int) – number of jackhmmer search iterations
- nobias (bool, optional (default: False)) – Turn of bias correction
- cpu (int, optional (default: None)) – Number of CPUs to use for search. Uses all if None.
- stdout_redirect (str, optional (default: None)) – Redirect bulky stdout instead of storing with rest of results (use “/dev/null” to dispose)
- checkpoints_hmm (bool, optional (default: False)) – Store checkpoint HMMs to prefix.<iter>.hmm
- checkpoints_ali (bool, optional (default: False)) – Store checkpoint alignments to prefix.<iter>.sto
- binary (str (default: "jackhmmer")) – Path to jackhmmer binary (put in PATH for default to work)
Returns: namedtuple with fields corresponding to the different output files (prefix, alignment, output, tblout, domtblout)
Return type: Raises: ExternalToolError, ResourceError