Fred2.IO Module

IO.ADBAdapter

IO.EnsemblAdapter

IO.FileReader

Fred2.IO.FileReader.read_annovar_exonic(annovar_file, gene_filter=None, experimentalDesig=None)

Reads an gene-based ANNOVAR output file and generates Variant objects containing all annotated Transcript ids an outputs a list Variant.

Parameters:
  • annovar_file (str) – The path ot the ANNOVAR file
  • gene_filter (list(str)) – A list of gene names of interest (only variants associated with these genes are generated)
Returns:

List of :class:`~Fred2.Core.Variant.Variants fully annotated

Return type:

list(Variant)

Fred2.IO.FileReader.read_fasta(files, type=<class 'Fred2.Core.Peptide.Peptide'>, id_position=1)

Generator function:

Read a (couple of) peptide, protein or rna sequence from a FASTA file. User needs to specify the correct type of the underlying sequences. It can either be: Peptide, Protein or Transcript (for RNA).

Parameters:
  • files (list(str) or str) – A (list) of file names to read in
  • type (Peptide or Transcript or Protein) – The type to read in
  • id_position (int) – the position of the id specified counted by |
Returns:

a list of the specified sequence type derived from the FASTA file sequences.

Return type:

(list(type))

Raises ValueError:
 

if a file is not readable

Fred2.IO.FileReader.read_lines(files, type=<class 'Fred2.Core.Peptide.Peptide'>)

Generator function:

Read a sequence directly from a line. User needs to manually specify the correct type of the underlying data. It can either be: Peptide, Protein or Transcript, Allele.

Parameters:
Returns:

A list of the specified objects

Return type:

(list(type))

Raises IOError:

if a file is not readable

IO.MartsAdapter

class Fred2.IO.MartsAdapter.MartsAdapter(usr=None, host=None, pwd=None, db=None, biomart=None)

Bases: Fred2.IO.ADBAdapter.ADBAdapter

get_all_variant_gene(locations, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

Fetches the important db ids and names for given chromosomal location :param chrom: integer value of the chromosome in question :param start: integer value of the variation start position on given chromosome :param stop: integer value of the variation stop position on given chromosome :return: The respective gene name, i.e. the first one reported

get_all_variant_ids(**kwargs)

Fetches the important db ids and names for given gene _or_ chromosomal location. The former is recommended. AResult is a list of dicts with either of the tree combinations:

  • ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Protein ID’
  • ‘RefSeq Protein ID [e.g. NP_001005353]’, ‘RefSeq mRNA [e.g. NM_001195597]’, first triplet
  • ‘RefSeq Predicted Protein ID [e.g. XP_001720922]’, ‘RefSeq mRNA predicted [e.g. XM_001125684]’, first triplet
Parameters:
  • 'locations' – list of locations as triplets of integer values representing (chrom, start, stop)
  • 'genes' – list of genes as string value of the genes of variation
Returns:

The list of dicts of entries with transcript and protein ids (either NM+NP or XM+XP)

get_product_sequence(product_refseq, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

fetches product sequence for the given id :param product_refseq: given refseq id :return: list of dictionaries of the requested sequence, the respective strand and the associated gene name

get_protein_sequence_from_protein_id(**kwargs)

Returns the protein sequence for a given protein ID that can either be refeseq, uniprot or ensamble id

Parameters:kwargs
Returns:
get_transcript_information(transcript_refseq, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

It also already uses the Field-Enum for DBAdapters

Fetches transcript sequence for the given id :param transcript_refseq: :return: list of dictionary of the requested sequence, the respective strand and the associated gene name

get_transcript_information_from_protein_id(**kwargs)

It also already uses the Field-Enum for DBAdapters

Fetches transcript sequence for the given id :param transcript_refseq: :return: list of dictionary of the requested sequence, the respective strand and the associated gene name

get_transcript_position(start, stop, gene_id, transcript_id, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

If no transcript position is available for the variant :param start: :param stop: :param gene_id: :param transcript_id: :param _db: :param _dataset: :return:

get_transcript_sequence(transcript_refseq, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

Fetches transcript sequence for the given id :param transcript_refseq: :return: list of dictionary of the requested sequence, the respective strand and the associated gene name

get_variant_gene(chrom, start, stop, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

Fetches the important db ids and names for given chromosomal location :param chrom: integer value of the chromosome in question :param start: integer value of the variation start position on given chromosome :param stop: integer value of the variation stop position on given chromosome :return: The respective gene name, i.e. the first one reported

get_variant_id_from_gene_id(**kwargs)

returns all information needed to instantiate a variation

Parameters:trans_id – A transcript ID (either ENSAMBLE (ENS) or RefSeq (NM, XN)
Returns:list of dicts – containing all information needed for a variant initialization
get_variant_id_from_protein_id(**kwargs)

returns all information needed to instantiate a variation

Parameters:trans_id – A transcript ID (either ENSAMBLE (ENS) or RefSeq (NM, XN)
Returns:list of dicts – containing all information needed for a variant initialization
get_variant_ids(**kwargs)

Fetches the important db ids and names for given gene _or_ chromosomal location. The former is recommended. AResult is a list of dicts with either of the tree combinations:

  • ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Protein ID’
  • ‘RefSeq Protein ID [e.g. NP_001005353]’, ‘RefSeq mRNA [e.g. NM_001195597]’, first triplet
  • ‘RefSeq Predicted Protein ID [e.g. XP_001720922]’, ‘RefSeq mRNA predicted [e.g. XM_001125684]’, first triplet
Parameters:
  • 'chrom' – integer value of the chromosome in question
  • 'start' – integer value of the variation start position on given chromosome
  • 'stop' – integer value of the variation stop position on given chromosome
  • 'gene' – string value of the gene of variation
  • 'transcript_id' – string value of the gene of variation
Returns:

The list of dicts of entries with transcript and protein ids (either NM+NP or XM+XP)

IO.RefSeqAdapter

class Fred2.IO.RefSeqAdapter.RefSeqAdapter(prot_file=None, prot_vers=None, mrna_file=None, mrna_vers=None)

Bases: Fred2.IO.ADBAdapter.ADBAdapter

get_product_sequence(product_refseq)

fetches product sequence for the given id :param product_refseq: given refseq id :return: list of dictionaries of the requested sequence, the respective strand and the associated gene name

get_transcript_information(transcript_refseq)
get_transcript_sequence(transcript_refseq)

Fetches transcript sequence for the given id :param transcript_refseq: :return: list of dictionary of the requested sequence, the respective strand and the associated gene name

load(filename)

IO.UniProtAdapter

class Fred2.IO.UniProtAdapter.UniProtDB(name='fdb')
exists(seq)

fast check if given sequence exists (as subsequence) in one of the UniProtDB objects collection of sequences.

Parameters:seq – the subsequence to be searched for
Returns:True, if it is found somewhere, False otherwise
read_seqs(sequence_file)

read sequences from uniprot files (.dat or .fasta) or from lists or dicts of BioPython SeqRecords and make them available for fast search. Appending also with this function.

Parameters:sequence_file – uniprot files (.dat or .fasta)
Returns:
search(seq)

search for first occurrence of given sequence(s) in the UniProtDB objects collection returning (each) the fasta header front part of the first occurrence.

Parameters:seq – a string interpreted as a single sequence or a list (of str) interpreted as a coll. of sequences
Returns:a dictionary of sequences to lists (of ids, ‘null’ if n/a)
search_all(seq)

search for all occurrences of given sequence(s) in the UniProtDB objects collection returning (each) the fasta header front part of all occurrences.

Parameters:seq – a string interpreted as a single sequence or a list (of str) interpreted as a coll. of sequences
Returns:a dictionary of the given sequences to lists (of ids, ‘null’ if n/a)
write_seqs(name)

writes all fasta entries in the current object into one fasta file

Parameters:name – the complete path with file name where the fasta is going to be written