duplicate_count_colpair() takes a data frame and checks each combination of
columns for duplicates. Results are presented in a tibble, ordered by the
number of duplicates.
Value
A tibble (data frame) with these columns –
xandy: Each line contains a unique combination ofdata's columns, stored in thexandyoutput columns.count: Number of "duplicates", i.e., values that are present in bothxandy.total_x,total_y,rate_x, andrate_y(added by default):total_xis the number of non-missing values in the column named underx. Also,rate_xis the proportion ofxvalues that are duplicated iny, i.e.,count / total_x. Likewise withtotal_yandrate_y. The tworate_*columns will be equal unlessNAvalues are present.
Summaries with audit()
There is an S3 method for audit(),
so you can call audit() following duplicate_count_colpair(). It
returns a tibble with summary statistics.
See also
duplicate_count()for a frequency table.duplicate_tally()to show instances of a value next to each instance.janitor::get_dupes()to search for duplicate rows.corrr::colpair_map(), a versatile tool for pairwise column analysis which the present function wraps.
Examples
# Basic usage:
mtcars %>%
duplicate_count_colpair()
#> # A tibble: 55 × 7
#> x y count total_x total_y rate_x rate_y
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 cyl carb 32 32 32 1 1
#> 2 vs am 32 32 32 1 1
#> 3 gear carb 27 32 32 0.844 0.844
#> 4 vs carb 14 32 32 0.438 0.438
#> 5 am carb 13 32 32 0.406 0.406
#> 6 cyl gear 11 32 32 0.344 0.344
#> 7 drat wt 3 32 32 0.0938 0.0938
#> 8 mpg qsec 2 32 32 0.0625 0.0625
#> 9 drat gear 1 32 32 0.0312 0.0312
#> 10 drat carb 1 32 32 0.0312 0.0312
#> # ℹ 45 more rows
# Summaries with `audit()`:
mtcars %>%
duplicate_count_colpair() %>%
audit()
#> # A tibble: 5 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 count 2.47 7.38 0 0 32 0 0
#> 2 total_x 32 0 32 32 32 0 0
#> 3 total_y 32 0 32 32 32 0 0
#> 4 rate_x 0.0773 0.231 0 0 1 0 0
#> 5 rate_y 0.0773 0.231 0 0 1 0 0