--- title: "Introduction to santoku" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Introduction to santoku} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} set.seed(23479) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(digits = 4) ``` ## Introduction Santoku is a package for cutting data into intervals. It provides `chop()`, a replacement for base R's `cut()` function, as well as several convenience functions to cut different kinds of intervals. To install santoku, run: ``` r install.packages("santoku") ``` ## Basic usage Use `chop()` like `cut()`, to cut numeric data into intervals between a set of `breaks`. ```{r} library(santoku) x <- runif(10, 0, 10) (chopped <- chop(x, breaks = 0:10)) data.frame(x, chopped) ``` `chop()` returns a factor. If data is beyond the limits of `breaks`, they will be extended automatically: ```{r} chopped <- chop(x, breaks = 3:7) data.frame(x, chopped) ``` To chop a single number into a separate category, put the number twice in `breaks`: ```{r} x_fives <- x x_fives[1:5] <- 5 chopped <- chop(x_fives, c(2, 5, 5, 8)) data.frame(x_fives, chopped) ``` To quickly produce a table of chopped data, use `tab()`: ```{r} tab(1:10, c(2, 5, 8)) ``` ## More ways to chop To chop into fixed-width intervals, starting at the minimum value, use `chop_width()`: ```{r} chopped <- chop_width(x, 2) data.frame(x, chopped) ``` To chop into a fixed number of intervals, each with the same width, use `chop_evenly()`: ```{r} chopped <- chop_evenly(x, intervals = 3) data.frame(x, chopped) ``` To chop into groups with a fixed number of members, use `chop_n()`: ```{r} chopped <- chop_n(x, 4) table(chopped) ``` To chop into a fixed number of groups, each with the same number of elements, use `chop_equally()`: ```{r} chopped <- chop_equally(x, groups = 5) table(chopped) ``` To chop data up by quantiles, use `chop_quantiles()`: ```{r} chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75)) data.frame(x, chopped) ``` To chop data up by proportions of the data range, use `chop_proportions()`: ```{r} chopped <- chop_proportions(x, c(0.25, 0.5, 0.75)) data.frame(x, chopped) ``` You can think of these six functions as logically arranged in a table. To chop into... | Sizing intervals by... | :------------------------------|:--------------------------|:----------------------   | number of elements: | interval width: a specific number of equal intervals... | `chop_equally()` | `chop_evenly()` intervals of one specific size... | `chop_n()` | `chop_width()` intervals of different specific sizes... | `chop_quantiles()` | `chop_proportions()` : Different ways to chop by size To chop data by standard deviations around the mean, use `chop_mean_sd()`: ```{r} chopped <- chop_mean_sd(x) data.frame(x, chopped) ``` To chop data into attractive intervals, use `chop_pretty()`. This selects intervals which are a multiple of 2, 5 or 10. It's useful for producing bar plots. ```{r} chopped <- chop_pretty(x) data.frame(x, chopped) ``` `tab_n()`, `tab_width()`, and friends act similarly to `tab()`, calling the related `chop_*` function and then `table()` on the result. ```{r} tab_n(x, 4) tab_width(x, 2) tab_evenly(x, 5) tab_mean_sd(x) ``` ## Specifying labels By default, santoku labels intervals using mathematical notation: * `[0, 1]` means all numbers between 0 and 1 inclusive. * `(0, 1)` means all numbers _strictly_ between 0 and 1, not including the endpoints. * `[0, 1)` means all numbers between 0 and 1, including 0 but not 1. * `(0, 1]` means all numbers between 0 and 1, including 1 but not 0. * `{0}` means just the number 0. To override these labels, provide names to the `breaks` argument: ```{r} chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8)) data.frame(x, chopped) ``` Or, you can specify factor labels with the `labels` argument: ```{r} chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest")) data.frame(x, chopped) ``` You need as many labels as there are intervals - one fewer than `length(breaks)` if your data doesn't extend beyond `breaks`, one more than `length(breaks)` if it does. To label intervals with a dash, use `lbl_dash()`: ```{r} chopped <- chop(x, c(2, 5, 8), labels = lbl_dash()) data.frame(x, chopped) ``` To label integer data, use `lbl_discrete()`. It uses more informative right endpoints: ```{r} chopped <- chop(1:10, c(2, 5, 8), labels = lbl_discrete()) chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash()) data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2) ``` You can customize the first or last labels: ```{r} chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+")) data.frame(x, chopped) ``` To label intervals in order use `lbl_seq()`: ```{r} chopped <- chop(x, c(2, 5, 8), labels = lbl_seq()) data.frame(x, chopped) ``` You can use numerals or even roman numerals: ```{r} chop(x, c(2, 5, 8), labels = lbl_seq("(1)")) chop(x, c(2, 5, 8), labels = lbl_seq("i.")) ``` Other labelling functions include: * `lbl_endpoints()` - use left endpoints as labels * `lbl_midpoints()` - use interval midpoints as labels * `lbl_glue()` - specify labels flexibly with the `{glue}` package ## Specifying breaks By default, `chop()` extends `breaks` if necessary. If you don't want that, set `extend = FALSE`: ```{r} chopped <- chop(x, c(3, 5, 7), extend = FALSE) data.frame(x, chopped) ``` Data outside the range of `breaks` will become `NA`. By default, intervals are closed on the left, i.e. they include their left endpoints. If you want right-closed intervals, set `left = FALSE`: ```{r} y <- 1:5 data.frame( y = y, left_closed = chop(y, 1:5), right_closed = chop(y, 1:5, left = FALSE) ) ``` By default, the last interval is closed on both ends. If you want to keep the last interval open at the end, set `close_end = FALSE`: ```{r} data.frame( y = y, end_closed = chop(y, 1:5), end_open = chop(y, 1:5, close_end = FALSE) ) ``` # Chopping dates, times and other vectors You can chop many kinds of vectors with santoku, including Date objects... ```{r} y2k <- as.Date("2000-01-01") + 0:10 * 7 data.frame( y2k = y2k, chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01"))) ) ``` ... and POSIXct (date-time) objects: ```{r} # hours of the 2020 Crew Dragon flight: crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"), length.out = 24, by = "hours") liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York") dock <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York") data.frame( crew_dragon = crew_dragon, chopped = chop(crew_dragon, c(liftoff, dock), labels = c("pre-flight", "flight", "docked")) ) ``` Note how santoku correctly handles the different timezones. You can use `chop_width()` with objects from the `lubridate` package, to chop by irregular periods such as months: ```{r} library(lubridate) data.frame( y2k = y2k, chopped = chop_width(y2k, months(1)) ) ``` You can format labels using format strings from `strptime()`. `lbl_discrete()` is useful here: ```{r} data.frame( y2k = y2k, chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b")) ) ``` You can also chop vectors with units, using the `units` package: ```{r} library(units) x <- set_units(1:10 * 10, cm) br <- set_units(1:3, ft) data.frame( x = x, chopped = chop(x, br) ) ``` You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results: ```{r} chop(letters[1:10], c("d", "f")) ``` If you find a type of data that you can't chop, please [file an issue](https://github.com/hughjonesd/santoku/issues).