duplicate_count()
returns a frequency table. When searching a
data frame, it includes values from all columns for each frequency count.
This function is a blunt tool designed for initial data checking. It is not too informative if many values have few characters each.
For summary statistics, call audit()
on the results.
Usage
duplicate_count(
x,
ignore = NULL,
locations_type = c("character", "list"),
numeric_only = deprecated()
)
Arguments
- x
Vector or data frame.
- ignore
Optionally, a vector of values that should not be counted.
- locations_type
String. One of
"character"
or"list"
. With"list"
, eachlocations
value is a vector of column names, which is better for further programming. By default ("character"
), the column names are pasted into a string, which is more readable.- numeric_only
[Deprecated] No longer used: All values are coerced to character.
Value
If x
is a data frame or another named vector, a tibble with four
columns. If x
isn't named, only the first two columns appear:
value
: All the values fromx
.frequency
: Absolute frequency of each value inx
, in descending order.locations
: Names of all columns fromx
in whichvalue
appears.locations_n
: Number of columns named inlocations
.
The tibble has the scr_dup_count
class, which is recognized by the
audit()
generic.
Details
Don't use numeric_only
. It no longer has any effect and will be
removed in the future. The only reason for this argument was the risk of
errors introduced by coercing values to numeric. This is no longer an issue
because all values are now coerced to character, which is more appropriate
for checking reported statistics.
Summaries with audit()
There is an S3 method for the
audit()
generic, so you can call audit()
following
duplicate_count()
. It returns a tibble with summary statistics for the
two numeric columns, frequency
and locations_n
(or, if x
isn't named,
only for frequency
).
See also
duplicate_count_colpair()
to check each combination of columns for duplicates.duplicate_tally()
to show instances of a value next to each instance.janitor::get_dupes()
to search for duplicate rows.
Examples
# Count duplicate values...
iris %>%
duplicate_count()
#> # A tibble: 77 × 4
#> value frequency locations locations_n
#> <chr> <int> <chr> <int>
#> 1 setosa 50 Species 1
#> 2 versicolor 50 Species 1
#> 3 virginica 50 Species 1
#> 4 0.2 29 Petal.Width 1
#> 5 3 27 Sepal.Width, Petal.Length 2
#> 6 1.5 25 Petal.Length, Petal.Width 2
#> 7 1.4 21 Petal.Length, Petal.Width 2
#> 8 1.3 20 Petal.Length, Petal.Width 2
#> 9 5.1 17 Sepal.Length, Petal.Length 2
#> 10 5 14 Sepal.Length, Petal.Length 2
#> # ℹ 67 more rows
# ...and compute summaries:
iris %>%
duplicate_count() %>%
audit()
#> # A tibble: 2 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 frequency 9.74 9.92 7 1 50 0 0
#> 2 locations_n 1.64 0.511 2 1 3 0 0
# Any values can be ignored:
iris %>%
duplicate_count(ignore = c("setosa", "versicolor", "virginica"))
#> # A tibble: 74 × 4
#> value frequency locations locations_n
#> <chr> <int> <chr> <int>
#> 1 0.2 29 Petal.Width 1
#> 2 3 27 Sepal.Width, Petal.Length 2
#> 3 1.5 25 Petal.Length, Petal.Width 2
#> 4 1.4 21 Petal.Length, Petal.Width 2
#> 5 1.3 20 Petal.Length, Petal.Width 2
#> 6 5.1 17 Sepal.Length, Petal.Length 2
#> 7 5 14 Sepal.Length, Petal.Length 2
#> 8 2.8 14 Sepal.Width 1
#> 9 3.2 13 Sepal.Width 1
#> 10 5.6 12 Sepal.Length, Petal.Length 2
#> # ℹ 64 more rows