Skip to contents

duplicate_count() returns a frequency table. When searching a data frame, it includes values from all columns for each frequency count.

This function is a blunt tool designed for initial data checking. It is not too informative if many values have few characters each.

For summary statistics, call audit() on the results.

Usage

duplicate_count(x, ignore = NULL, locations_type = c("character", "list"))

Arguments

x

Vector or data frame.

ignore

Optionally, a vector of values that should not be counted.

locations_type

String. One of "character" or "list". With "list", each locations value is a vector of column names, which is better for further programming. By default ("character"), the column names are pasted into a string, which is more readable.

Value

If x is a data frame or another named vector, a tibble with four columns. If x isn't named, only the first two columns appear:

  • value: All the values from x.

  • frequency: Absolute frequency of each value in x, in descending order.

  • locations: Names of all columns from x in which value appears.

  • locations_n: Number of columns named in locations.

The tibble has the scr_dup_count class, which is recognized by the audit() generic.

Summaries with audit()

There is an S3 method for the audit() generic, so you can call audit() following duplicate_count(). It returns a tibble with summary statistics for the two numeric columns, frequency and locations_n (or, if x isn't named, only for frequency).

See also

Examples

# Count duplicate values...
iris %>%
  duplicate_count()
#> # A tibble: 77 × 4
#>    value      frequency locations                  locations_n
#>    <chr>          <int> <chr>                            <int>
#>  1 setosa            50 Species                              1
#>  2 versicolor        50 Species                              1
#>  3 virginica         50 Species                              1
#>  4 0.2               29 Petal.Width                          1
#>  5 3                 27 Sepal.Width, Petal.Length            2
#>  6 1.5               25 Petal.Length, Petal.Width            2
#>  7 1.4               21 Petal.Length, Petal.Width            2
#>  8 1.3               20 Petal.Length, Petal.Width            2
#>  9 5.1               17 Sepal.Length, Petal.Length           2
#> 10 5                 14 Sepal.Length, Petal.Length           2
#> # ℹ 67 more rows

# ...and compute summaries:
iris %>%
  duplicate_count() %>%
  audit()
#> # A tibble: 2 × 8
#>   term         mean    sd median   min   max na_count na_rate
#>   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>   <dbl>
#> 1 frequency    9.74 9.92       7     1    50        0       0
#> 2 locations_n  1.64 0.511      2     1     3        0       0

# Any values can be ignored:
iris %>%
  duplicate_count(ignore = c("setosa", "versicolor", "virginica"))
#> # A tibble: 74 × 4
#>    value frequency locations                  locations_n
#>    <chr>     <int> <chr>                            <int>
#>  1 0.2          29 Petal.Width                          1
#>  2 3            27 Sepal.Width, Petal.Length            2
#>  3 1.5          25 Petal.Length, Petal.Width            2
#>  4 1.4          21 Petal.Length, Petal.Width            2
#>  5 1.3          20 Petal.Length, Petal.Width            2
#>  6 5.1          17 Sepal.Length, Petal.Length           2
#>  7 5            14 Sepal.Length, Petal.Length           2
#>  8 2.8          14 Sepal.Width                          1
#>  9 3.2          13 Sepal.Width                          1
#> 10 5.6          12 Sepal.Length, Petal.Length           2
#> # ℹ 64 more rows