---
title: "Timing R Pipelines"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{timing_R_pipelines}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(pipetime)
library(dplyr)
library(data.table)
library(ggplot2)
```

# Overview
`pipetime` enables **inline** timing of R pipelines (`|>`), helping identify performance bottlenecks and compare different approaches without disrupting your workflow.

We illustrate this with a classic comparison: `dplyr` vs `data.table` for a grouped aggregation pipeline. Both are excellent packages, but `data.table` is known for its speed advantage on larger datasets.

-   **Workflow A** 🐢 : Uses `dplyr` verbs.

-   **Workflow B** 🚀: Uses `data.table` syntax wrapped in a pipe-friendly style.

## Example Data
```{r}
set.seed(42)
n <- 1e6

sales <- data.frame(
  region = sample(state.abb, n, TRUE),
  product = sample(paste0("SKU_", 1:500), n, TRUE),
  revenue = round(runif(n, 1, 1000), 2),
  qty = sample(1:50, n, TRUE)
)
sales_dt <- as.data.table(sales)
head(sales, n = 3)
```

## Timing Workflows
We use the `log` argument so each workflow stores its timings separately.

```{r}
library(pipetime)
options(pipetime.console = FALSE)

# Workflow A: dplyr
wf_A <- sales |>
  filter(revenue > 100) |>
  time_pipe("filter", log = "dplyr") |>
  group_by(region) |>
  summarise(total = sum(revenue), avg_qty = mean(qty), .groups = "drop") |>
  time_pipe("group + summarise", log = "dplyr") |>
  arrange(desc(total)) |>
  time_pipe("arrange", log = "dplyr")

# Workflow B: data.table
wf_B <- sales_dt |>
  (\(dt) dt[revenue > 100])() |>
  time_pipe("filter", log = "data.table") |>
  (\(dt) dt[, .(total = sum(revenue), avg_qty = mean(qty)), by = region])() |>
  time_pipe("group + summarise", log = "data.table") |>
  (\(dt) dt[order(-total)])() |>
  time_pipe("arrange", log = "data.table")

```

# Results
```{r, dpi = 500}
# Collect both logs
logs <- get_log() |>
  bind_rows(.id = "workflow") |>
  group_by(workflow) |>
  # Add a starting point
  group_modify(~ add_row(.x, duration = 0, label = "start", .before = 1)) |>
  mutate(step = factor(row_number()))

library(ggplot2)
logs |>
  ggplot(
    aes(
      x = step,
      y = duration,
      colour = workflow,
      group = workflow
    )
  ) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_text(aes(label = label), vjust = -0.7, size = 3.5, show.legend = FALSE) +
  labs(
    x = "Step",
    y = "Cumulative time (sec)",
    title = "dplyr vs data.table",
    colour = "Workflow"
  ) +
  theme_classic()
```

`data.table`'s in-memory optimizations give it a consistent edge, especially on the grouped aggregation step. `pipetime` makes it easy to pinpoint exactly where the difference lies.