---
title: "Introduction to santoku"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Introduction to santoku}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
set.seed(23479)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
options(digits = 4)
```
## Introduction
Santoku is a package for cutting data into intervals. It provides `chop()`,
a replacement for base R's `cut()` function, as well as several convenience
functions to cut different kinds of intervals.
To install santoku, run:
``` r
install.packages("santoku")
```
## Basic usage
Use `chop()` like `cut()`, to cut numeric data into intervals between a set of
`breaks`.
```{r}
library(santoku)
x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
data.frame(x, chopped)
```
`chop()` returns a factor.
If data is beyond the limits of `breaks`, they will be extended automatically:
```{r}
chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)
```
To chop a single number into a separate category, put the number twice in
`breaks`:
```{r}
x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)
```
To quickly produce a table of chopped data, use `tab()`:
```{r}
tab(1:10, c(2, 5, 8))
```
## More ways to chop
To chop into fixed-width intervals, starting at the minimum value, use
`chop_width()`:
```{r}
chopped <- chop_width(x, 2)
data.frame(x, chopped)
```
To chop into a fixed number of intervals, each with the same width,
use `chop_evenly()`:
```{r}
chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)
```
To chop into groups with a fixed number of members, use `chop_n()`:
```{r}
chopped <- chop_n(x, 4)
table(chopped)
```
To chop into a fixed number of groups, each with the same number of elements,
use `chop_equally()`:
```{r}
chopped <- chop_equally(x, groups = 5)
table(chopped)
```
To chop data up by quantiles, use `chop_quantiles()`:
```{r}
chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
```
To chop data up by proportions of the data range, use `chop_proportions()`:
```{r}
chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
```
You can think of these six functions as logically arranged in a table.
To chop into... | Sizing intervals by... |
:------------------------------|:--------------------------|:----------------------
| number of elements: | interval width:
a specific number of equal intervals... | `chop_equally()` | `chop_evenly()`
intervals of one specific size... | `chop_n()` | `chop_width()`
intervals of different specific sizes... | `chop_quantiles()` | `chop_proportions()`
: Different ways to chop by size
To chop data by standard deviations around the mean, use `chop_mean_sd()`:
```{r}
chopped <- chop_mean_sd(x)
data.frame(x, chopped)
```
To chop data into attractive intervals, use `chop_pretty()`. This
selects intervals which are a multiple of 2, 5 or 10. It's useful for producing
bar plots.
```{r}
chopped <- chop_pretty(x)
data.frame(x, chopped)
```
`tab_n()`, `tab_width()`, and friends act similarly to
`tab()`, calling the related `chop_*` function and then `table()` on the result.
```{r}
tab_n(x, 4)
tab_width(x, 2)
tab_evenly(x, 5)
tab_mean_sd(x)
```
## Specifying labels
By default, santoku labels intervals using mathematical notation:
* `[0, 1]` means all numbers between 0 and 1 inclusive.
* `(0, 1)` means all numbers _strictly_ between 0 and 1, not including the
endpoints.
* `[0, 1)` means all numbers between 0 and 1, including 0 but not 1.
* `(0, 1]` means all numbers between 0 and 1, including 1 but not 0.
* `{0}` means just the number 0.
To override these labels, provide names to the `breaks` argument:
```{r}
chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)
```
Or, you can specify factor labels with the `labels` argument:
```{r}
chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)
```
You need as many labels as there are intervals - one fewer than `length(breaks)`
if your data doesn't extend beyond `breaks`, one more than `length(breaks)` if
it does.
To label intervals with a dash, use `lbl_dash()`:
```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)
```
To label integer data, use `lbl_discrete()`. It uses more informative right
endpoints:
```{r}
chopped <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
```
You can customize the first or last labels:
```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)
```
To label intervals in order use `lbl_seq()`:
```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)
```
You can use numerals or even roman numerals:
```{r}
chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
chop(x, c(2, 5, 8), labels = lbl_seq("i."))
```
Other labelling functions include:
* `lbl_endpoints()` - use left endpoints as labels
* `lbl_midpoints()` - use interval midpoints as labels
* `lbl_glue()` - specify labels flexibly with the `{glue}` package
## Specifying breaks
By default, `chop()` extends `breaks` if necessary. If you don't want
that, set `extend = FALSE`:
```{r}
chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)
```
Data outside the range of `breaks` will become `NA`.
By default, intervals are closed on the left, i.e. they include their left
endpoints. If you want right-closed intervals, set `left = FALSE`:
```{r}
y <- 1:5
data.frame(
y = y,
left_closed = chop(y, 1:5),
right_closed = chop(y, 1:5, left = FALSE)
)
```
By default, the last interval is closed on both ends.
If you want to keep the last interval open at the end,
set `close_end = FALSE`:
```{r}
data.frame(
y = y,
end_closed = chop(y, 1:5),
end_open = chop(y, 1:5, close_end = FALSE)
)
```
# Chopping dates, times and other vectors
You can chop many kinds of vectors with santoku, including Date objects...
```{r}
y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
y2k = y2k,
chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)
```
... and POSIXct (date-time) objects:
```{r}
# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"),
length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")
data.frame(
crew_dragon = crew_dragon,
chopped = chop(crew_dragon, c(liftoff, dock),
labels = c("pre-flight", "flight", "docked"))
)
```
Note how santoku correctly handles the different timezones.
You can use `chop_width()` with objects from the `lubridate` package,
to chop by irregular periods such as months:
```{r}
library(lubridate)
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1))
)
```
You can format labels using format strings from `strptime()`.
`lbl_discrete()` is useful here:
```{r}
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b"))
)
```
You can also chop vectors with units, using the `units` package:
```{r}
library(units)
x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
x = x,
chopped = chop(x, br)
)
```
You should be able to chop anything that has a comparison operator. You can
even chop character data using lexical ordering. By default santoku emits a
warning in this case, to avoid accidentally misinterpreting results:
```{r}
chop(letters[1:10], c("d", "f"))
```
If you find a type of data that you can't chop, please
[file an issue](https://github.com/hughjonesd/santoku/issues).