Skip to contents

[Superseded]

duplicate_detect() is superseded because it's less informative than duplicate_tally() and duplicate_count(). Use these functions instead.

For every value in a vector or data frame, duplicate_detect() tests whether there is at least one identical value. Test results are presented next to every value.

This function is a blunt tool designed for initial data checking. Don't put too much weight on its results.

For summary statistics, call audit() on the results.

Usage

duplicate_detect(x, ignore = NULL, colname_end = "dup", numeric_only)

Arguments

x

Vector or data frame.

ignore

Optionally, a vector of values that should not be checked. In the test result columns, they will be marked NA.

colname_end

String. Name ending of the logical test result columns. Default is "dup".

numeric_only

[Deprecated] No longer used: All values are coerced to character.

Value

A tibble (data frame). It has all the columns from x, and to each of these columns' right, the corresponding test result column.

The tibble has the scr_dup_detect class, which is recognized by the audit() generic.

Details

This function is not very informative with many input values that only have a few characters each. Many of them may have duplicates just by chance. For example, in R's built-in iris data set, 99% of values have duplicates.

In general, the fewer values and the more characters per value, the more significant the results.

Summaries with audit()

There is an S3 method for the audit() generic, so you can call audit() following duplicate_detect(). It returns a tibble with these columns ---

  • term: The original data frame's variables.

  • dup_count: Number of "duplicated" values of that term variable: those which have at least one duplicate anywhere in the data frame.

  • total: Number of all non-NA values of that term variable.

  • dup_rate: Rate of "duplicated" values of that term variable.

The final row, .total, summarizes across all other rows: It adds up the dup_count and total_count columns, and calculates the mean of the dup_rate column.

See also

Examples

# Find duplicate values in a data frame...
duplicate_detect(x = pigs4)
#> # A tibble: 5 × 6
#>   snout snout_dup tail  tail_dup wings wings_dup
#>   <chr> <lgl>     <chr> <lgl>    <chr> <lgl>    
#> 1 4.73  FALSE     6.88  FALSE    6.09  FALSE    
#> 2 8.13  TRUE      7.33  FALSE    8.27  FALSE    
#> 3 4.22  TRUE      5.17  TRUE     4.40  FALSE    
#> 4 4.22  TRUE      7.57  FALSE    5.92  FALSE    
#> 5 5.17  TRUE      8.13  TRUE     5.17  TRUE     

# ...or in a single vector:
duplicate_detect(x = pigs4$snout)
#> # A tibble: 5 × 2
#>   value value_dup
#>   <chr> <lgl>    
#> 1 4.73  FALSE    
#> 2 8.13  FALSE    
#> 3 4.22  TRUE     
#> 4 4.22  TRUE     
#> 5 5.17  FALSE    

# Summary statistics with `audit()`:
pigs4 %>%
  duplicate_detect() %>%
  audit()
#> # A tibble: 4 × 4
#>   term   dup_count total_count dup_rate
#>   <chr>      <int>       <int>    <dbl>
#> 1 snout          4           5    0.8  
#> 2 tail           2           5    0.4  
#> 3 wings          1           5    0.2  
#> 4 .total         7          15    0.467

# Any values can be ignored:
pigs4 %>%
  duplicate_detect(ignore = c(8.131, 7.574))
#> # A tibble: 5 × 6
#>   snout snout_dup tail  tail_dup wings wings_dup
#>   <chr> <lgl>     <chr> <lgl>    <chr> <lgl>    
#> 1 4.73  FALSE     6.88  FALSE    6.09  FALSE    
#> 2 8.13  TRUE      7.33  FALSE    8.27  FALSE    
#> 3 4.22  TRUE      5.17  TRUE     4.40  FALSE    
#> 4 4.22  TRUE      7.57  FALSE    5.92  FALSE    
#> 5 5.17  TRUE      8.13  TRUE     5.17  TRUE