--- title: "Timing R Pipelines" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{timing_R_pipelines} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(pipetime) library(dplyr) library(data.table) library(ggplot2) ``` # Overview `pipetime` enables **inline** timing of R pipelines (`|>`), helping identify performance bottlenecks and compare different approaches without disrupting your workflow. We illustrate this with a classic comparison: `dplyr` vs `data.table` for a grouped aggregation pipeline. Both are excellent packages, but `data.table` is known for its speed advantage on larger datasets. - **Workflow A** 🐢 : Uses `dplyr` verbs. - **Workflow B** 🚀: Uses `data.table` syntax wrapped in a pipe-friendly style. ## Example Data ```{r} set.seed(42) n <- 1e6 sales <- data.frame( region = sample(state.abb, n, TRUE), product = sample(paste0("SKU_", 1:500), n, TRUE), revenue = round(runif(n, 1, 1000), 2), qty = sample(1:50, n, TRUE) ) sales_dt <- as.data.table(sales) head(sales, n = 3) ``` ## Timing Workflows We use the `log` argument so each workflow stores its timings separately. ```{r} library(pipetime) options(pipetime.console = FALSE) # Workflow A: dplyr wf_A <- sales |> filter(revenue > 100) |> time_pipe("filter", log = "dplyr") |> group_by(region) |> summarise(total = sum(revenue), avg_qty = mean(qty), .groups = "drop") |> time_pipe("group + summarise", log = "dplyr") |> arrange(desc(total)) |> time_pipe("arrange", log = "dplyr") # Workflow B: data.table wf_B <- sales_dt |> (\(dt) dt[revenue > 100])() |> time_pipe("filter", log = "data.table") |> (\(dt) dt[, .(total = sum(revenue), avg_qty = mean(qty)), by = region])() |> time_pipe("group + summarise", log = "data.table") |> (\(dt) dt[order(-total)])() |> time_pipe("arrange", log = "data.table") ``` # Results ```{r, dpi = 500} # Collect both logs logs <- get_log() |> bind_rows(.id = "workflow") |> group_by(workflow) |> # Add a starting point group_modify(~ add_row(.x, duration = 0, label = "start", .before = 1)) |> mutate(step = factor(row_number())) library(ggplot2) logs |> ggplot( aes( x = step, y = duration, colour = workflow, group = workflow ) ) + geom_line(linewidth = 1) + geom_point(size = 3) + geom_text(aes(label = label), vjust = -0.7, size = 3.5, show.legend = FALSE) + labs( x = "Step", y = "Cumulative time (sec)", title = "dplyr vs data.table", colour = "Workflow" ) + theme_classic() ``` `data.table`'s in-memory optimizations give it a consistent edge, especially on the grouped aggregation step. `pipetime` makes it easy to pinpoint exactly where the difference lies.