Count duplicates at each observation

For every value in a vector or data frame, duplicate_tally() counts how often it appears in total. Tallies are presented next to each value.

For summary statistics, call audit() on the results.

Usage

duplicate_tally(x, ignore = NULL, colname_end = "n")

Arguments

x: Vector or data frame.
ignore: Optionally, a vector of values that should not be checked. In the test result columns, they will be marked NA.
colname_end: String. Name ending of the logical test result columns. Default is "n".

Value

A tibble (data frame). It has all the columns from x, and to each of these columns' right, the corresponding tally column.

The tibble has the scr_dup_detect class, which is recognized by the audit() generic.

Details

This function is not very informative with many input values that only have a few characters each. Many of them may have duplicates just by chance. For example, in R's built-in iris data set, 99% of values have duplicates.

In general, the fewer values and the more characters per value, the more significant the results.

Summaries with `audit()`

There is an S3 method for the audit() generic, so you can call audit() following duplicate_tally(). It returns a tibble with summary statistics.

Examples

# Tally duplicate values in a data frame...
duplicate_tally(x = pigs4)
#> # A tibble: 5 × 6
#>   snout snout_n tail  tail_n wings wings_n
#>   <chr>   <int> <chr>  <int> <chr>   <int>
#> 1 4.73        1 6.88       1 6.09        1
#> 2 8.13        2 7.33       1 8.27        1
#> 3 4.22        2 5.17       3 4.40        1
#> 4 4.22        2 7.57       1 5.92        1
#> 5 5.17        3 8.13       2 5.17        3

# ...or in a single vector:
duplicate_tally(x = pigs4$snout)
#> # A tibble: 5 × 2
#>   value value_n
#>   <chr>   <int>
#> 1 4.73        1
#> 2 8.13        1
#> 3 4.22        2
#> 4 4.22        2
#> 5 5.17        1

# Summary statistics with `audit()`:
pigs4 %>%
  duplicate_tally() %>%
  audit()
#> # A tibble: 4 × 8
#>   term     mean    sd median   min   max na_count na_rate
#>   <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>   <dbl>
#> 1 snout_n  2    0.707      2     1     3        0       0
#> 2 tail_n   1.6  0.894      1     1     3        0       0
#> 3 wings_n  1.4  0.894      1     1     3        0       0
#> 4 .total   1.67 0.816      1     1     3        0       0

# Any values can be ignored:
pigs4 %>%
  duplicate_tally(ignore = c(8.131, 7.574))
#> # A tibble: 5 × 6
#>   snout snout_n tail  tail_n wings wings_n
#>   <chr>   <int> <chr>  <int> <chr>   <int>
#> 1 4.73        1 6.88       1 6.09        1
#> 2 8.13        2 7.33       1 8.27        1
#> 3 4.22        2 5.17       3 4.40        1
#> 4 4.22        2 7.57       1 5.92        1
#> 5 5.17        3 8.13       2 5.17        3