Title: | Basic Sequence Processing Tool for Biological Data |
---|---|
Description: | Primarily created as an easy and understanding way to do basic sequences surrounding the central dogma of molecular biology. |
Authors: | Ambu Vijayan [aut, cre] |
Maintainer: | Ambu Vijayan <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2025-02-28 03:30:47 UTC |
Source: | https://github.com/ambuvjyn/baseq |
This function reads a multi FASTA file containing DNA sequences, removes any characters other than A, T, G, and C, and writes the cleaned sequences to a new multi FASTA file. The output file name is generated from the input file name with the suffix '_clean.fasta'.
clean_DNA_file(input_file)
clean_DNA_file(input_file)
input_file |
The name of the input multi FASTA file. |
A character string specifying the path to the output FASTA file.
sample_file_path_three <- system.file("extdata", "sample2_fa.fasta", package = "baseq") clean_DNA_file(sample_file_path_three)
sample_file_path_three <- system.file("extdata", "sample2_fa.fasta", package = "baseq") clean_DNA_file(sample_file_path_three)
This function takes a DNA sequence as input and removes any characters other than A, C, G, and T.
clean_DNA_sequence(sequence)
clean_DNA_sequence(sequence)
sequence |
DNA sequence to be cleaned |
Cleaned DNA sequence
clean_DNA_sequence("ATGTCGTAGCTAGCTN") # Output: "ATGTCGTAGCTAGCT"
clean_DNA_sequence("ATGTCGTAGCTAGCTN") # Output: "ATGTCGTAGCTAGCT"
This function reads a multi FASTA file containing RNA sequences, removes any characters other than A, T, G, and C, and writes the cleaned sequences to a new multi FASTA file. The output file name is generated from the input file name with the suffix '_clean.fasta'.
clean_RNA_file(input_file)
clean_RNA_file(input_file)
input_file |
The name of the input multi FASTA file. |
A character string specifying the path to the output FASTA file.
sample_file_path_three <- system.file("extdata", "sample2_fa.fasta", package = "baseq") clean_RNA_file(sample_file_path_three)
sample_file_path_three <- system.file("extdata", "sample2_fa.fasta", package = "baseq") clean_RNA_file(sample_file_path_three)
This function takes a RNA sequence as input and removes any characters other than A, C, G, and T.
clean_RNA_sequence(sequence)
clean_RNA_sequence(sequence)
sequence |
RNA sequence to be cleaned |
Cleaned RNA sequence
clean_RNA_sequence("AUGUCGTAGCTAGCTN") # Output: "AUGUCGAGCAGC"
clean_RNA_sequence("AUGUCGTAGCTAGCTN") # Output: "AUGUCGAGCAGC"
This function takes a DNA or RNA sequence as input and removes any characters that are not A, C, G, T (for DNA) or A, C, G, U (for RNA).
clean_sequence(sequence, type = "DNA")
clean_sequence(sequence, type = "DNA")
sequence |
A character string containing the DNA or RNA sequence to be cleaned. |
type |
A character string indicating the type of sequence. The default is "DNA". If set to "RNA", the function will remove any characters that are not A, C, G, U. |
A character string containing the cleaned DNA or RNA sequence.
clean_sequence("atgcNnRYMK") # Returns "ATGC" clean_sequence("auggcuuNnRYMK", type = "RNA") # Returns "AUGGCUU"
clean_sequence("atgcNnRYMK") # Returns "ATGC" clean_sequence("auggcuuNnRYMK", type = "RNA") # Returns "AUGGCUU"
This function takes a single argument, a DNA sequence as a character string, and counts the number of A's, C's, G's, and T's in the sequence. The counts are returned as a named vector.
count_bases(sequence)
count_bases(sequence)
sequence |
a character string containing a DNA sequence |
a named integer vector containing the counts of A's, C's, G's, and T's
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" count_bases(sequence) # A C G T # 6 6 6 6
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" count_bases(sequence) # A C G T # 6 6 6 6
This function counts the frequency of a specific character or pattern in a given sequence.
count_seq_pattern(seq, pattern)
count_seq_pattern(seq, pattern)
seq |
A character vector representing the sequence to count the pattern in. |
pattern |
A character string representing the pattern to count in the sequence. |
An integer representing the count of the pattern in the sequence.
seq <- "ATGGTGCTCCGTGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTACGTAG" count_seq_pattern(seq, "CG") # [1] 31
seq <- "ATGGTGCTCCGTGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTACGTAG" count_seq_pattern(seq, "CG") # [1] 31
This function takes a DNA sequence as input and translates it in all six reading frames.
dna_to_protein(sequence)
dna_to_protein(sequence)
sequence |
A character string representing a DNA sequence. |
A list of character strings representing the translated protein sequences in all six frames.
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" dna_to_protein(sequence) # Returns a list containing the translated protein sequences in all six frames: # $`Frame F1` # [1] "IELAS" # # $`Frame F2` # [1] "SS" # # $`Frame F3` # [1] "RAS" # # $`Frame R1` # [1] "S" # # $`Frame R2` # [1] "AS" # # $`Frame R3` # [1] "LAS"
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" dna_to_protein(sequence) # Returns a list containing the translated protein sequences in all six frames: # $`Frame F1` # [1] "IELAS" # # $`Frame F2` # [1] "SS" # # $`Frame F3` # [1] "RAS" # # $`Frame R1` # [1] "S" # # $`Frame R2` # [1] "AS" # # $`Frame R3` # [1] "LAS"
This function takes a DNA sequence as input and returns its RNA transcript.
dna_to_rna(sequence)
dna_to_rna(sequence)
sequence |
A character string representing a DNA sequence. |
A character string representing the RNA transcript of the input DNA sequence.
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" dna_to_rna(sequence) # Returns "AUCGAGCUAGCUAGCUAGCUAGCU"
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" dna_to_rna(sequence) # Returns "AUCGAGCUAGCUAGCUAGCUAGCU"
This function converts a FASTQ file to a FASTA file. The output file has the same name as the input
FASTQ file, but with the extension changed to .fasta
. This function removes the @
symbol at the beginning
of FASTQ sequence names and replaces it with the >
symbol for the FASTA format.
fastq_to_fasta(fastq_file)
fastq_to_fasta(fastq_file)
fastq_file |
A character string specifying the path to the input FASTQ file. |
A character string specifying the path to the output FASTA file.
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") fastq_to_fasta(sample_file_path_two) # Output: "path/to/library/baseq/extdata/sample_fa.fasta"
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") fastq_to_fasta(sample_file_path_two) # Output: "path/to/library/baseq/extdata/sample_fa.fasta"
Calculates the percentage of nucleotides in a DNA sequence that are either guanine (G) or cytosine (C).
gc_content(sequence)
gc_content(sequence)
sequence |
A character string containing the DNA sequence. |
A numeric value representing the percentage of nucleotides in the sequence that are G or C.
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" gc_content(sequence) 50
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" gc_content(sequence) 50
Function to calculate GC content of sequences in a multi FASTA file and write the results to a new FASTA file
gc_content_file(input_file)
gc_content_file(input_file)
input_file |
A string indicating the path and name of the input multi-FASTA file |
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") clean_DNA_file(sample_file_path)
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") clean_DNA_file(sample_file_path)
This function reads a fasta file and creates a dataframe with two columns: Header and Sequence. The dataframe is then assigned to the environment with the name same as the fasta file name but without the .fasta extension.
read.fasta_to_df(fasta_file)
read.fasta_to_df(fasta_file)
fasta_file |
The path to the fasta file to be read. |
This function does not return anything. It assigns the resulting dataframe to the environment.
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_df(sample_file_path)
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_df(sample_file_path)
This function reads a fasta file and creates a list with two columns: Header and Sequence. The list is then assigned to the environment with the name same as the fasta file name but without the .fasta extension.
read.fasta_to_list(fasta_file)
read.fasta_to_list(fasta_file)
fasta_file |
The path to the fasta file to be read. |
This function does not return anything. It assigns the resulting list to the environment.
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_list(sample_file_path) # Access a specific sequence by name sample_fa[["sample_seq.1"]]
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_list(sample_file_path) # Access a specific sequence by name sample_fa[["sample_seq.1"]]
This function reads a Fastq file and stores it as a dataframe with three columns: Header, Sequence, and QualityScore.
read.fastq_to_df(fastq_file)
read.fastq_to_df(fastq_file)
fastq_file |
A character string specifying the path to the Fastq file to be read. |
This function returns a dataframe with three columns: Header, Sequence, and QualityScore.
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_df(sample_file_path_two)
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_df(sample_file_path_two)
This function reads a Fastq file and stores it as a list with three columns: Header, Sequence, and QualityScore.
read.fastq_to_list(fastq_file)
read.fastq_to_list(fastq_file)
fastq_file |
A character string specifying the path to the Fastq file to be read. |
This function returns a list with three columns: Header, Sequence, and QualityScore.
# Read in sequences from a FASTQ file sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_list(sample_file_path_two)
# Read in sequences from a FASTQ file sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_list(sample_file_path_two)
Given a DNA sequence, the function generates the reverse complement of the sequence and returns it.
reverse_complement(sequence)
reverse_complement(sequence)
sequence |
A character string containing the DNA sequence to be reversed and complemented |
A character string containing the reverse complement of the input DNA sequence
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" reverse_complement(sequence) # [1] "AGCTAGCTAGCTAGCTAGCTCGAT"
sequence <- "ATCGAGCTAGCTAGCTAGCTAGCT" reverse_complement(sequence) # [1] "AGCTAGCTAGCTAGCTAGCTCGAT"
Given a DNA sequence, the function generates the reverse complement of the sequence and returns it.
rna_reverse_complement(sequence)
rna_reverse_complement(sequence)
sequence |
A character string containing the DNA sequence to be reversed and complemented |
A character string containing the reverse complement of the input DNA sequence
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_reverse_complement(sequence) # [1] "AGCUAGCUAGCUAGCUAGCUCGAU"
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_reverse_complement(sequence) # [1] "AGCUAGCUAGCUAGCUAGCUCGAU"
This function takes a RNA sequence as input and returns its DNA transcript.
rna_to_dna(sequence)
rna_to_dna(sequence)
sequence |
A character string representing a RNA sequence. |
A character string representing the RNA transcript of the input RNA sequence.
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_to_dna(sequence) # Returns "ATCGAGCTAGCTAGCTAGCTAGCT"
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_to_dna(sequence) # Returns "ATCGAGCTAGCTAGCTAGCTAGCT"
This function takes a RNA sequence as input and translates it in all six reading frames.
rna_to_protein(sequence)
rna_to_protein(sequence)
sequence |
A character string representing a RNA sequence. |
A list of character strings representing the translated protein sequences in all six frames.
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_to_protein(sequence) # Returns a list containing the translated protein sequences in all six frames: # $`Frame F1` # [1] "IELAS" # # $`Frame F2` # [1] "SS" # # $`Frame F3` # [1] "RAS" # # $`Frame R1` # [1] "S" # # $`Frame R2` # [1] "AS" # # $`Frame R3` # [1] "LAS"
sequence <- "AUCGAGCUAGCUAGCUAGCUAGCU" rna_to_protein(sequence) # Returns a list containing the translated protein sequences in all six frames: # $`Frame F1` # [1] "IELAS" # # $`Frame F2` # [1] "SS" # # $`Frame F3` # [1] "RAS" # # $`Frame R1` # [1] "S" # # $`Frame R2` # [1] "AS" # # $`Frame R3` # [1] "LAS"
This function writes a data frame to a fasta file with the same name as the data frame. The data frame is assumed to have two columns, "Header" and "Sequence", which represent the header and sequence lines of each fasta record, respectively.
write.df_to_fasta(df)
write.df_to_fasta(df)
df |
A data frame containing fasta records with "Header" and "Sequence" columns. |
This function does not return a value, but writes a fasta file to the working directory.
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_df(sample_file_path) write.df_to_fasta(sample_fa)
sample_file_path <- system.file("extdata", "sample_fa.fasta", package = "baseq") read.fasta_to_df(sample_file_path) write.df_to_fasta(sample_fa)
Write a FASTQ file from a dataframe of reads
write.df_to_fastq(df)
write.df_to_fastq(df)
df |
A dataframe containing reads in the format "Header", "Sequence", and "QualityScore". |
A FASTQ file with the same name as the input dataframe.
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_df(sample_file_path_two) write.df_to_fastq(sample_fq)
sample_file_path_two <- system.file("extdata", "sample_fq.fastq", package = "baseq") read.fastq_to_df(sample_file_path_two) write.df_to_fastq(sample_fq)
This function takes a list of sequences and writes them to a FASTA file. The name of the list is used as the base name for the output file with the .fasta extension. Each sequence in the list is written to the output file in FASTA format with the sequence name as the header.
write.list_to_fasta(sequence_list)
write.list_to_fasta(sequence_list)
sequence_list |
A list of sequences where each element of the list is a character string representing a single sequence. |
sequences <- list("ACGT", "ATCG") write.list_to_fasta(sequences)
sequences <- list("ACGT", "ATCG") write.list_to_fasta(sequences)
This function takes a list of sequences and quality scores and writes them to a FASTQ file. The name of the list is used as the base name for the output file with the .fastq extension. Each sequence in the list is written to the output file in FASTQ format with the sequence name as the header and the quality scores on the following line.
write.list_to_fastq(sequence_list)
write.list_to_fastq(sequence_list)
sequence_list |
A list of sequences where each element of the list is a named list containing "Sequence" and "QualityScore" elements. |
sequences <- list("ACGT", "ATCG") quality_scores <- list("IIII", "JJJJ") sequences_list <- list(seq1=list(Sequence=sequences[[1]], QualityScore=quality_scores[[1]]), seq2=list(Sequence=sequences[[2]], QualityScore=quality_scores[[2]])) write.list_to_fastq(sequences_list)
sequences <- list("ACGT", "ATCG") quality_scores <- list("IIII", "JJJJ") sequences_list <- list(seq1=list(Sequence=sequences[[1]], QualityScore=quality_scores[[1]]), seq2=list(Sequence=sequences[[2]], QualityScore=quality_scores[[2]])) write.list_to_fastq(sequences_list)