duplicate_count()
returns a frequency table. When searching a
data frame, it includes values from all columns for each frequency count.
This function is a blunt tool designed for initial data checking. It is not too informative if many values have few characters each.
For summary statistics, call audit()
on the results.
Usage
duplicate_count(x, ignore = NULL, locations_type = c("character", "list"))
Arguments
- x
Vector or data frame.
- ignore
Optionally, a vector of values that should not be counted.
- locations_type
String. One of
"character"
or"list"
. With"list"
, eachlocations
value is a vector of column names, which is better for further programming. By default ("character"
), the column names are pasted into a string, which is more readable.
Value
If x
is a data frame or another named vector, a tibble with four
columns. If x
isn't named, only the first two columns appear:
value
: All the values fromx
.frequency
: Absolute frequency of each value inx
, in descending order.locations
: Names of all columns fromx
in whichvalue
appears.locations_n
: Number of columns named inlocations
.
The tibble has the scr_dup_count
class, which is recognized by the
audit()
generic.
Summaries with audit()
There is an S3 method for the
audit()
generic, so you can call audit()
following
duplicate_count()
. It returns a tibble with summary statistics for the
two numeric columns, frequency
and locations_n
(or, if x
isn't named,
only for frequency
).
See also
duplicate_count_colpair()
to check each combination of columns for duplicates.duplicate_tally()
to show instances of a value next to each instance.janitor::get_dupes()
to search for duplicate rows.
Examples
# Count duplicate values...
iris %>%
duplicate_count()
#> # A tibble: 77 × 4
#> value frequency locations locations_n
#> <chr> <int> <chr> <int>
#> 1 setosa 50 Species 1
#> 2 versicolor 50 Species 1
#> 3 virginica 50 Species 1
#> 4 0.2 29 Petal.Width 1
#> 5 3 27 Sepal.Width, Petal.Length 2
#> 6 1.5 25 Petal.Length, Petal.Width 2
#> 7 1.4 21 Petal.Length, Petal.Width 2
#> 8 1.3 20 Petal.Length, Petal.Width 2
#> 9 5.1 17 Sepal.Length, Petal.Length 2
#> 10 5 14 Sepal.Length, Petal.Length 2
#> # ℹ 67 more rows
# ...and compute summaries:
iris %>%
duplicate_count() %>%
audit()
#> # A tibble: 2 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 frequency 9.74 9.92 7 1 50 0 0
#> 2 locations_n 1.64 0.511 2 1 3 0 0
# Any values can be ignored:
iris %>%
duplicate_count(ignore = c("setosa", "versicolor", "virginica"))
#> # A tibble: 74 × 4
#> value frequency locations locations_n
#> <chr> <int> <chr> <int>
#> 1 0.2 29 Petal.Width 1
#> 2 3 27 Sepal.Width, Petal.Length 2
#> 3 1.5 25 Petal.Length, Petal.Width 2
#> 4 1.4 21 Petal.Length, Petal.Width 2
#> 5 1.3 20 Petal.Length, Petal.Width 2
#> 6 5.1 17 Sepal.Length, Petal.Length 2
#> 7 5 14 Sepal.Length, Petal.Length 2
#> 8 2.8 14 Sepal.Width 1
#> 9 3.2 13 Sepal.Width 1
#> 10 5.6 12 Sepal.Length, Petal.Length 2
#> # ℹ 64 more rows