duplicate_count_colpair()
takes a data frame and checks each combination of
columns for duplicates. Results are presented in a tibble, ordered by the
number of duplicates.
Value
A tibble (data frame) with these columns –
x
andy
: Each line contains a unique combination ofdata
's columns, stored in thex
andy
output columns.count
: Number of "duplicates", i.e., values that are present in bothx
andy
.total_x
,total_y
,rate_x
, andrate_y
(added by default):total_x
is the number of non-missing values in the column named underx
. Also,rate_x
is the proportion ofx
values that are duplicated iny
, i.e.,count / total_x
. Likewise withtotal_y
andrate_y
. The tworate_*
columns will be equal unlessNA
values are present.
Summaries with audit()
There is an S3 method for audit()
,
so you can call audit()
following duplicate_count_colpair()
. It
returns a tibble with summary statistics.
See also
duplicate_count()
for a frequency table.duplicate_tally()
to show instances of a value next to each instance.janitor::get_dupes()
to search for duplicate rows.corrr::colpair_map()
, a versatile tool for pairwise column analysis which the present function wraps.
Examples
# Basic usage:
mtcars %>%
duplicate_count_colpair()
#> # A tibble: 55 × 7
#> x y count total_x total_y rate_x rate_y
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 cyl carb 32 32 32 1 1
#> 2 vs am 32 32 32 1 1
#> 3 gear carb 27 32 32 0.844 0.844
#> 4 vs carb 14 32 32 0.438 0.438
#> 5 am carb 13 32 32 0.406 0.406
#> 6 cyl gear 11 32 32 0.344 0.344
#> 7 drat wt 3 32 32 0.0938 0.0938
#> 8 mpg qsec 2 32 32 0.0625 0.0625
#> 9 drat gear 1 32 32 0.0312 0.0312
#> 10 drat carb 1 32 32 0.0312 0.0312
#> # ℹ 45 more rows
# Summaries with `audit()`:
mtcars %>%
duplicate_count_colpair() %>%
audit()
#> # A tibble: 5 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 count 2.47 7.38 0 0 32 0 0
#> 2 total_x 32 0 32 32 32 0 0
#> 3 total_y 32 0 32 32 32 0 0
#> 4 rate_x 0.0773 0.231 0 0 1 0 0
#> 5 rate_y 0.0773 0.231 0 0 1 0 0