| Title: | Basic Sequence Processing Tool for Biological Data |
|---|---|
| Description: | Primarily created as an easy and understanding way to do basic sequences surrounding the central dogma of molecular biology. |
| Authors: | Ambu Vijayan [aut, cre]
|
| Maintainer: | Ambu Vijayan <[email protected]> |
| License: | GPL-3 |
| Version: | 0.2.0 |
| Built: | 2026-06-10 10:35:56 UTC |
| Source: | https://github.com/ambuvjyn/baseq |
Creates an S3 object of class baseq_dna.
as_baseq_dna(s)as_baseq_dna(s)
s |
A character string containing the sequence |
A baseq_dna object
Creates an S3 object of class baseq_rna.
as_baseq_rna(s)as_baseq_rna(s)
s |
A character string containing the sequence |
A baseq_rna object
Converts baseq sequences to Biostrings format.
as_Biostrings(s)as_Biostrings(s)
s |
A character vector or list of sequences |
A DNAStringSet object
Computes N50, L50, and other assembly statistics.
calculate_assembly_stats(seqs)calculate_assembly_stats(seqs)
seqs |
A character vector or list of sequences (contigs) |
A named numeric vector of statistics
contigs <- c("ATGC", "ATGCATGC", "ATGCATGCATGC") calculate_assembly_stats(contigs)contigs <- c("ATGC", "ATGCATGC", "ATGCATGCATGC") calculate_assembly_stats(contigs)
Calculates the net electrical charge of a protein at a given pH.
calculate_charge(s, ph = 7.4)calculate_charge(s, ph = 7.4)
s |
A character string containing the protein sequence |
ph |
Numeric pH value (default: 7.4) |
Numeric net charge
Calculates Relative Synonymous Codon Usage (RSCU).
calculate_codon_usage(s)calculate_codon_usage(s)
s |
A character string containing the coding DNA sequence |
A dataframe with codon statistics
data(sars_fragment) calculate_codon_usage(sars_fragment)data(sars_fragment) calculate_codon_usage(sars_fragment)
Compares two sequences of equal length.
calculate_identity(s1, s2)calculate_identity(s1, s2)
s1 |
First sequence |
s2 |
Second sequence |
A list with Identity percentage and Hamming Distance
calculate_identity("ATGC", "ATGG")calculate_identity("ATGC", "ATGG")
Calculates the molecular weight of a protein sequence.
calculate_mw(s)calculate_mw(s)
s |
A character string containing the protein sequence |
Numeric molecular weight in Daltons
Estimates the isoelectric point of a protein sequence.
calculate_pi(s)calculate_pi(s)
s |
A character string containing the protein sequence |
Numeric pI value
Calculates the melting temperature of a primer sequence.
calculate_tm(s, salt = 50)calculate_tm(s, salt = 50)
s |
A character string containing the sequence |
salt |
Numeric salt concentration in mM (default: 50) |
Numeric Tm in Celsius
Cleans all sequences in a FASTA or FASTQ file.
clean_file(input_file, type = "auto", output_dir = "")clean_file(input_file, type = "auto", output_dir = "")
input_file |
Path to input file |
type |
Sequence type ("DNA", "RNA", or "auto") |
output_dir |
Optional output directory |
Path to the cleaned file
Removes non-standard characters from DNA or RNA sequences.
clean_seq(sequence, type = "auto")clean_seq(sequence, type = "auto")
sequence |
A character string containing the sequence |
type |
A string "DNA", "RNA", or "auto" |
A character string of the cleaned sequence
Returns a frequency table of the bases in a sequence.
count_bases(s)count_bases(s)
s |
A character string containing the sequence |
A table object with base counts
data(sars_fragment) count_bases(sars_fragment)data(sars_fragment) count_bases(sars_fragment)
Counts all possible substrings of length k.
count_kmers(s, k = 3)count_kmers(s, k = 3)
s |
A character string containing the sequence |
k |
Integer length of k-mer |
A table of k-mer counts
data(sars_fragment) count_kmers(sars_fragment, k = 3)data(sars_fragment) count_kmers(sars_fragment, k = 3)
Counts the occurrences of a specific pattern in a sequence.
count_pattern(s, p)count_pattern(s, p)
s |
A character string containing the sequence |
p |
A character string containing the pattern to count |
Integer count of occurrences
data(sars_fragment) count_pattern(sars_fragment, "ATTA")data(sars_fragment) count_pattern(sars_fragment, "ATTA")
Translates a DNA sequence into protein in all 6 reading frames.
dna_to_protein(s, table = 1)dna_to_protein(s, table = 1)
s |
A character string containing the DNA sequence |
table |
Integer indicating the NCBI genetic code table (default: 1) |
A list of translated protein sequences
Transcribes a DNA sequence into RNA.
dna_to_rna(s)dna_to_rna(s)
s |
A character string containing the DNA sequence |
A character string of the RNA sequence
Converts a FASTQ file to FASTA format.
fastq_to_fasta(fastq_file)fastq_to_fasta(fastq_file)
fastq_file |
Path to input FASTQ |
Path to output FASTA
Filters FASTQ reads based on average quality score.
filter_fastq_quality( input_file, output_file, min_avg_quality = 20, phred_offset = 33 )filter_fastq_quality( input_file, output_file, min_avg_quality = 20, phred_offset = 33 )
input_file |
Path to input FASTQ |
output_file |
Path to output FASTQ |
min_avg_quality |
Minimum average Phred score (default: 20) |
phred_offset |
Phred offset (default: 33) |
Identifies candidate CpG islands in a DNA sequence.
find_cpg_islands(s, window = 200)find_cpg_islands(s, window = 200)
s |
A character string containing the DNA sequence |
window |
Sliding window size (default: 200) |
A dataframe with start and end positions
Scans a DNA sequence in all 6 reading frames to find the longest open reading frame.
find_longest_orf(s)find_longest_orf(s)
s |
A character string containing the DNA sequence |
A character string of the longest translated protein sequence
Calculates the percentage of G and C bases in a DNA sequence.
gc_content(s)gc_content(s)
s |
A character string containing the sequence |
Numeric percentage of GC content
data(sars_fragment) gc_content(sars_fragment)data(sars_fragment) gc_content(sars_fragment)
Returns a mapping of codons to amino acids.
get_genetic_code(table = 1)get_genetic_code(table = 1)
table |
Integer NCBI genetic code table index |
A named character vector
Visualizes the amino acid composition categorized by biochemical properties.
plot_aa_composition(s)plot_aa_composition(s)
s |
A character string containing the protein sequence |
A ggplot object
prot <- "MKFLVLALAL" plot_aa_composition(prot)prot <- "MKFLVLALAL" plot_aa_composition(prot)
Generates a dot plot comparison of two sequences.
plot_dotplot(s1, s2, window = 1)plot_dotplot(s1, s2, window = 1)
s1 |
First sequence |
s2 |
Second sequence |
window |
Integer word size for matching (default: 1) |
A ggplot object
s1 <- "ATGCATGCATGC" s2 <- "ATGCGTGCATGC" plot_dotplot(s1, s2, window = 3)s1 <- "ATGCATGCATGC" s2 <- "ATGCGTGCATGC" plot_dotplot(s1, s2, window = 3)
Generates a sliding window plot of GC skew (G-C)/(G+C).
plot_gc_skew(s, window = 100)plot_gc_skew(s, window = 100)
s |
A character string containing the DNA sequence |
window |
Integer window size (default: 100) |
A ggplot object
data(sars_fragment) plot_gc_skew(sars_fragment, window = 100)data(sars_fragment) plot_gc_skew(sars_fragment, window = 100)
Generates a sliding window plot of protein hydrophobicity using the Kyte-Doolittle scale.
plot_hydrophobicity(s, window = 9)plot_hydrophobicity(s, window = 9)
s |
A character string containing the protein sequence |
window |
Integer window size (default: 9) |
A ggplot object
prot <- "MKFLVLALAL" plot_hydrophobicity(prot, window = 3)prot <- "MKFLVLALAL" plot_hydrophobicity(prot, window = 3)
Reads a FASTA or FASTQ file and returns it as a dataframe or list.
read_seq(file, format = "df")read_seq(file, format = "df")
file |
Path to the input sequence file |
format |
A string indicating "df" (dataframe) or "list" (default: "df") |
A dataframe or list of the sequence data.
Generates the reverse complement of a DNA or RNA sequence.
rev_comp(sequence)rev_comp(sequence)
sequence |
A character string containing the sequence |
A character string of the reverse complement
Converts a protein sequence back into DNA using common codons.
reverse_translate(s)reverse_translate(s)
s |
A character string containing the protein sequence |
A character string of the resulting DNA sequence
Reverse transcribes an RNA sequence into DNA.
rna_to_dna(s)rna_to_dna(s)
s |
A character string containing the RNA sequence |
A character string of the DNA sequence
Translates an RNA sequence into protein in all 6 reading frames.
rna_to_protein(s, table = 1)rna_to_protein(s, table = 1)
s |
A character string containing the RNA sequence |
table |
Integer indicating the NCBI genetic code table (default: 1) |
A list of translated protein sequences
A small fragment of the SARS-CoV-2 genome used for examples and testing.
sars_fragmentsars_fragment
A character string.
NCBI GenBank
Finds all occurrences of a motif in a sequence.
search_motif(s, p)search_motif(s, p)
s |
A character string containing the sequence |
p |
A character string containing the motif (regex) |
A dataframe with the Start, End, and Match string
Randomly permutes the characters of a sequence.
shuffle_sequence(s)shuffle_sequence(s)
s |
A character string containing the sequence |
A character string of the shuffled sequence
Simulates restriction enzyme digestion.
simulate_digestion(s, p)simulate_digestion(s, p)
s |
A character string containing the DNA sequence |
p |
A character string containing the restriction site (regex) |
A numeric vector of fragment lengths
Generates a dummy FASTA dataset.
simulate_fasta(n_seq = 5, seq_len = 100, gc = NULL, type = "DNA", file = NULL)simulate_fasta(n_seq = 5, seq_len = 100, gc = NULL, type = "DNA", file = NULL)
n_seq |
Number of sequences |
seq_len |
Length of each sequence |
gc |
Target GC content |
type |
"DNA" or "RNA" |
file |
Optional file path to save |
A dataframe of simulated sequences
Generates a dummy FASTQ dataset.
simulate_fastq( n_reads = 5, read_len = 100, gc = NULL, type = "DNA", file = NULL )simulate_fastq( n_reads = 5, read_len = 100, gc = NULL, type = "DNA", file = NULL )
n_reads |
Number of reads |
read_len |
Length of each read |
gc |
Target GC content |
type |
"DNA" or "RNA" |
file |
Optional file path to save |
A dataframe of simulated reads
Simulates a PCR reaction and predicts amplicon sizes.
simulate_pcr(template, fwd, rev_p)simulate_pcr(template, fwd, rev_p)
template |
A character string containing the DNA template |
fwd |
A character string of the forward primer |
rev_p |
A character string of the reverse primer |
A numeric vector of amplicon sizes
Generates a random DNA or RNA sequence.
simulate_sequence(len, gc = NULL, type = "DNA")simulate_sequence(len, gc = NULL, type = "DNA")
len |
Integer length of the sequence |
gc |
Numeric target GC content (0 to 1) |
type |
"DNA" or "RNA" |
A character string of the simulated sequence
Generates a comprehensive summary of a multi-FASTA file.
summarize_fasta(file)summarize_fasta(file)
file |
Path to the FASTA file |
A summary dataframe
# summarize_fasta("path/to/my.fasta")# summarize_fasta("path/to/my.fasta")
Generic function to translate DNA or RNA to protein.
translate(x, ...)translate(x, ...)
x |
A baseq_dna or baseq_rna object |
... |
Additional arguments |
A list of translated sequences
Writes a sequence object (dataframe or list) to a FASTA or FASTQ file.
write_seq(x, file)write_seq(x, file)
x |
A sequence object (dataframe or list) |
file |
Path to the output sequence file |
Invisible TRUE