Title: | Fixing Data Made Easy for Statistical Analysis |
---|---|
Description: | Primarily created as an easy way to do basic data manipulations for statistical analysis. |
Authors: | Ambu Vijayan [aut, cre] |
Maintainer: | Ambu Vijayan <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2025-02-14 20:19:56 UTC |
Source: | https://github.com/ambuvjyn/fixr |
This function compares the column names and number of rows in two data frames and returns a message indicating whether the data is consistent or not.
check_data_consistency(df1, df2)
check_data_consistency(df1, df2)
df1 |
First data frame to compare |
df2 |
Second data frame to compare |
A message indicating whether the data is consistent or not.
df1 <- data.frame(x = c(1,2,3), y = c(4,5,6)) df2 <- data.frame(x = c(1,2,3), y = c(4,5,6)) check_data_consistency(df1, df2) # Data is consistent across the two sources. df3 <- data.frame(a = c(1,2,3), b = c(4,5,6)) check_data_consistency(df1, df3) # Data is not consistent across the two sources.
df1 <- data.frame(x = c(1,2,3), y = c(4,5,6)) df2 <- data.frame(x = c(1,2,3), y = c(4,5,6)) check_data_consistency(df1, df2) # Data is consistent across the two sources. df3 <- data.frame(a = c(1,2,3), b = c(4,5,6)) check_data_consistency(df1, df3) # Data is not consistent across the two sources.
This function checks if the data is normally distributed for each numeric column in a data frame.
check_data_distribution(df)
check_data_distribution(df)
df |
A data frame |
This function does not return anything, it only prints messages to the console.
df <- data.frame(x = c("a", "b", "c"), y = c(4, 5, 6), z = c(7, 8, 9)) check_data_distribution(df)
df <- data.frame(x = c("a", "b", "c"), y = c(4, 5, 6), z = c(7, 8, 9)) check_data_distribution(df)
This function performs a series of data quality checks on a given dataframe, including checking the data structure, missing values, data accuracy, negative values, outliers, sample size, duplicate rows, and duplicate columns.
check_data_quality(df)
check_data_quality(df)
df |
A dataframe. |
A message indicating the results of each data quality check.
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, NA, -6, 4), z = c(7, 8, 180, 7)) # Check the data quality of the example dataframe check_data_quality(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, NA, -6, 4), z = c(7, 8, 180, 7)) # Check the data quality of the example dataframe check_data_quality(df)
This function checks for inter-rater or test-retest reliability between all pairs of numeric columns in a data frame by computing the correlation between each pair and reporting if it is less than 0.8.
check_data_reliability(df)
check_data_reliability(df)
df |
A data frame |
A message indicating whether the data is reliable or not between each pair of columns.
df <- data.frame(x = c("a", "b", "c"), y = c(4, 5, 6), z = c(7, 8, 180)) check_data_reliability(df)
df <- data.frame(x = c("a", "b", "c"), y = c(4, 5, 6), z = c(7, 8, 180)) check_data_reliability(df)
This function checks the structure of the given data frame and prints the number of rows, number of columns, column names, column data types, and number of missing values.
check_data_structure(df)
check_data_structure(df)
df |
The data frame to be checked. |
None
df <- data.frame(id = 1:10, gender = c("male", "female", "male", "male", "male", "male", "male", "male", "female", "female"), age = c(25, 32, 45, 19, 27, 56, 38, 42, 33, NA), salary = c(50000, 60000, 75000, 45000, 55000, 90000, NA, 80000, 65000, 70000)) # Check the data structure of the example dataframe check_data_structure(df)
df <- data.frame(id = 1:10, gender = c("male", "female", "male", "male", "male", "male", "male", "male", "female", "female"), age = c(25, 32, 45, 19, 27, 56, 38, 42, 33, NA), salary = c(50000, 60000, 75000, 45000, 55000, 90000, NA, 80000, 65000, 70000)) # Check the data structure of the example dataframe check_data_structure(df)
This function checks if a data frame contains negative values and returns their indices if any are found.
check_for_negative_values(df)
check_for_negative_values(df)
df |
The data frame to check for negative values. |
If negative values are found, the function returns their indices as an array index object. If no negative values are found, NULL is returned.
df <- data.frame(a = c(1, 2, 3), b = c(-1, 0, 1)) check_for_negative_values(df) # [1] "Data frame contains negative values." # row col # [1,] 2 1"
df <- data.frame(a = c(1, 2, 3), b = c(-1, 0, 1)) check_for_negative_values(df) # [1] "Data frame contains negative values." # row col # [1,] 2 1"
This function checks for missing values in a data frame and prints out the names of the columns with missing values and their counts.
check_missing_values(df)
check_missing_values(df)
df |
A data frame to check for missing values. |
A message indicating if missing values were found or not.
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) check_missing_values(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) check_missing_values(df)
This function checks for outliers or extreme values in a given dataframe.
check_outliers(df)
check_outliers(df)
df |
A dataframe. |
A message indicating whether or not extreme values were found.
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) check_outliers(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) check_outliers(df)
This function checks if the sample size of a data frame is adequate for statistical analysis.
check_sample_size(df)
check_sample_size(df)
df |
A data frame to be checked |
A message indicating if the sample size is adequate or too small
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, 18, 7)) check_sample_size(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, 18, 7)) check_sample_size(df)
This function takes a data frame as input and checks for duplicate columns. A column is considered a duplicate of another column if all values in both columns are the same. If any duplicate columns are found, the function prints a message indicating which columns are duplicates of which other columns. If no duplicate columns are found, the function prints a message indicating that no duplicates were found.
find_duplicate_cols(df)
find_duplicate_cols(df)
df |
A data frame |
A message indicating which columns are duplicates of which other columns
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, NA, -6, 4), z = c(7, 8, 180, 7)) find_duplicate_cols(df) # Column 'c' is a duplicate of column 'a'
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, NA, -6, 4), z = c(7, 8, 180, 7)) find_duplicate_cols(df) # Column 'c' is a duplicate of column 'a'
This function identifies and reports duplicate rows in a given data frame. It first removes any rows with no values in all cells, and then compares each row to subsequent rows to check for duplicates. Duplicate rows are identified by having the same values in all columns. The function returns a message stating whether or not duplicate rows were found, and if so, the row numbers of the duplicate and original rows.
find_duplicate_rows(df)
find_duplicate_rows(df)
df |
A data frame to check for duplicate rows. |
A message stating whether or not duplicate rows were found, and if so, the row numbers of the duplicate and original rows.
# Create example data frame df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) # Find duplicate rows find_duplicate_rows(df)
# Create example data frame df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "a"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) # Find duplicate rows find_duplicate_rows(df)
This function searches the CRAN repository for R packages that can be used to import a file format
find.packages(file_extension)
find.packages(file_extension)
file_extension |
The file extension for the file format to search for packages to import |
A character vector of package names that can be used to import the file format
This function takes a file path as input and searches the CRAN repository for R packages that can import the file format.
find.packages_path(file_path)
find.packages_path(file_path)
file_path |
A character string specifying the file path of the file to be imported. |
A character string that lists the R packages that can be used to import the file format of the input file.
# Search for packages that can import a CSV file find.packages_path("sample.csv") # Search for packages that can import a JSON file find.packages_path("sample.json")
# Search for packages that can import a CSV file find.packages_path("sample.csv") # Search for packages that can import a JSON file find.packages_path("sample.json")
This function replaces all empty string values ("") in a given data frame with NA values.
fix_blanks_with_na(df)
fix_blanks_with_na(df)
df |
A data frame to be processed. |
The data frame with empty string values replaced with NAs.
df <- data.frame(x = c("", "foo", ""), y = c("", "", "bar"), z = c(1, 2, 3)) fix_blanks_with_na(df)
df <- data.frame(x = c("", "foo", ""), y = c("", "", "bar"), z = c(1, 2, 3)) fix_blanks_with_na(df)
This function takes a data frame as an argument and replaces all spaces in the column names with underscores.
fix_col_spaces(df)
fix_col_spaces(df)
df |
A data frame |
A modified data frame with spaces in column names replaced by underscores.
my_data <- data.frame("Column Name 1" = c(1, 2, 3), "Column Name 2" = c(4, 5, 6)) fix_col_spaces(my_data) # Returns a data frame with column names where spaces are replaced by underscores.
my_data <- data.frame("Column Name 1" = c(1, 2, 3), "Column Name 2" = c(4, 5, 6)) fix_col_spaces(my_data) # Returns a data frame with column names where spaces are replaced by underscores.
This function removes "X." or "X" from the beginning of column names and replaces any "." with "_". It also removes leading/trailing symbols and spaces, and ensures that there is only one underscore between two words. If there are duplicate column names, it appends a number to each duplicate column name to make it unique.
fix_column_names(data)
fix_column_names(data)
data |
A data frame with improperly formatted column names. |
The modified data frame with fixed column names.
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) fix_column_names(my_data)
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) fix_column_names(my_data)
This function fixes the row and column names of a data frame by removing leading and trailing spaces, replacing spaces with underscores, and modifying duplicate names.
fix_data_names(df)
fix_data_names(df)
df |
A data frame to be fixed |
A fixed data frame with modified row and column names
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) rownames(my_data) <- c(" Row1", " Row.2", "Row.3 ") fix_column_names(fix_row_names(my_data))
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) rownames(my_data) <- c(" Row1", " Row.2", "Row.3 ") fix_column_names(fix_row_names(my_data))
This function removes duplicate columns from a data frame.
fix_duplicate_cols(df)
fix_duplicate_cols(df)
df |
A data frame |
A data frame with duplicate columns removed
df <- data.frame(a = c(1, 1, 2), b = c(2, 2, 3)) fix_duplicate_cols(df)
df <- data.frame(a = c(1, 1, 2), b = c(2, 2, 3)) fix_duplicate_cols(df)
This function removes duplicate rows from a data frame.
fix_duplicate_rows(df)
fix_duplicate_rows(df)
df |
A data frame |
A data frame with duplicate rows removed
df <- data.frame(a = c(1, 1, 2), b = c(2, 2, 3)) fix_duplicate_rows(df)
df <- data.frame(a = c(1, 1, 2), b = c(2, 2, 3)) fix_duplicate_rows(df)
This function imputes missing values in alphanumeric columns of a data frame. If a column is numeric, missing values are imputed with the column mean. Otherwise, missing values are imputed with the column mode (most common value).
fix_missing_alphanumeric_values(df)
fix_missing_alphanumeric_values(df)
df |
A data frame with missing values. |
A data frame with imputed missing values.
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", NA), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) fix_missing_alphanumeric_values(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", NA), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) fix_missing_alphanumeric_values(df)
A function to fill missing values in numeric columns of a data frame with the mean of the column.
fix_missing_numeric_values(df)
fix_missing_numeric_values(df)
df |
A data frame with missing values. |
A data frame with missing numeric values filled with the column mean.
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "d"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) fix_missing_numeric_values(df)
df <- data.frame(w = c(7, 8, 180, 7), x = c("a", "b", "c", "d"), y = c(4, 5, -6, 4), z = c(7, 8, NA, 7)) fix_missing_numeric_values(df)
This function removes outlier rows from a data frame by identifying rows with values that are more than 2 standard deviations away from the mean in any column.
fix_outliers(df)
fix_outliers(df)
df |
A data frame to clean |
A cleaned data frame with outlier rows removed
df <- data.frame(x = c(1,2,3,4,5,6,7,8,9,10), y = c(1,1,1,1,1,1,1,100,1,1)) fix_outliers(df)
df <- data.frame(x = c(1,2,3,4,5,6,7,8,9,10), y = c(1,1,1,1,1,1,1,100,1,1)) fix_outliers(df)
This function removes any leading "X." or "X" from the row names of a data frame, replaces any "." with "_", removes any leading or trailing symbols and spaces, and ensures that there is only one underscore between two words. Additionally, if there are duplicate row names, the function appends a number to each duplicate row name to make it unique.
fix_row_names(data)
fix_row_names(data)
data |
a data frame with improperly formatted row names |
a modified data frame with fixed row names
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) rownames(my_data) <- c(" Row1", " Row.2", "Row.3 ") fix_row_names(my_data)
my_data <- data.frame(" Col1" = c(1, 2, 3), "Col.2" = c(4, 5, 6), check.names = FALSE) rownames(my_data) <- c(" Row1", " Row.2", "Row.3 ") fix_row_names(my_data)
This function takes a data frame as an argument and replaces all spaces in the row names with underscores.
fix_row_spaces(df)
fix_row_spaces(df)
df |
A data frame |
A modified data frame with spaces in row names replaced by underscores.
my_data <- data.frame("Column Name 1" = c(1, 2, 3), "Column Name 2" = c(4, 5, 6)) rownames(my_data) <- c("Row Name 1", "Row Name 2", "Row Name 3") fix_row_spaces(my_data) # Returns a data frame with row names where spaces are replaced by underscores.
my_data <- data.frame("Column Name 1" = c(1, 2, 3), "Column Name 2" = c(4, 5, 6)) rownames(my_data) <- c("Row Name 1", "Row Name 2", "Row Name 3") fix_row_spaces(my_data) # Returns a data frame with row names where spaces are replaced by underscores.
This function removes non-alphanumeric characters from all non-numeric columns in a data frame. The columns are modified in-place.
fix_special_characters_in_data(df)
fix_special_characters_in_data(df)
df |
A data frame. |
A modified data frame where all non-numeric columns have had non-alphanumeric characters removed.
df <- data.frame(a = c("A*B", "C&D"), b = c("1.2", "3.4")) fix_special_characters_in_data(df) # Output: # a b # 1 AB 1.2 # 2 CD 3.4
df <- data.frame(a = c("A*B", "C&D"), b = c("1.2", "3.4")) fix_special_characters_in_data(df) # Output: # a b # 1 AB 1.2 # 2 CD 3.4
This function removes any non-alphanumeric characters from both the row and column names of a given data frame.
fix_special_characters_in_names(df)
fix_special_characters_in_names(df)
df |
A data frame with non-alphanumeric characters in the column or row names. |
A data frame with all non-alphanumeric characters removed from the column and row names.
df <- data.frame("Col1!" = c(1, 2, 3), "Col2?" = c(4, 5, 6)) rownames(df) <- c("Row1@", "Row2#", "Row3$") fix_special_characters_in_names(df)
df <- data.frame("Col1!" = c(1, 2, 3), "Col2?" = c(4, 5, 6)) rownames(df) <- c("Row1@", "Row2#", "Row3$") fix_special_characters_in_names(df)
This function applies several data cleaning functions from the fixr
package to a given data frame. The fix_data_names
, remove_spaces
, remove_symbols_data
, and replace_blanks_with_na
functions are used to add "X_" before column and row names that start with a number, remove leading/trailing spaces, remove non-alphanumeric characters from the data, replace spaces with underscores in column and row names, and replace empty string values with NAs, respectively.
fix.data(df)
fix.data(df)
df |
A data frame to be processed. |
The cleaned data frame.
df <- data.frame(" 1st col " = c("", "foo", ""), "2nd col" = c(" ", " ", "bar"), "3rd col" = c(1, 2, 3)) fix.data(df)
df <- data.frame(" 1st col " = c("", "foo", ""), "2nd col" = c(" ", " ", "bar"), "3rd col" = c(1, 2, 3)) fix.data(df)