Documentation

1. Installation

Download the installer for your operating system from Gumroad. On macOS, open the .dmg file and drag Scarab into your Applications folder. On Windows, run the .exe installer. On Linux, run the .AppImage directly.

2. Loading Data

Start your session by providing a dataset to Scarab. Click the "load" button or drag and drop a .csv or Excel (.xlsx) file into the app. Scarab runs completely locally, instantly reading the file into memory without any telemetry or remote calls.

Supported formats: .csv and .xlsx. Column headers must be present in the first row. Scarab auto-detects numeric, string, and date column types on load.

3. ScarabQL Overview

ScarabQL is an intuitive, plain-text query language designed specifically for lightning-fast exploratory data analysis. The general structure of a query is:

[action] [target] [where filters...] [|> pipe operations...]

Basic Exploration

peek [n] | Preview the top n rows of the dataset. Example: peek 10.
missing | Display a report of missing (null) values in every column, sorted by severity.
describe * | Generate standard statistical summaries (mean, std, quartiles, etc.) for all numeric columns.
describe [target] | Generate a deep summary profile for a single column.
find [stat(target)] [by col] | Calculate aggregate statistics. Supported functions: mean, median, sum, std, variance, min, max, count. Example: find mean(income) by department.

Advanced Analytics & Machine Learning

predict [target] from [col1, col2...] | Automatically fit a regression model predicting the target from the provided inputs.
explain [target] | Automatically discover and rank the top leading predictors/drivers of a target column.
outliers [target] | Detect anomalies in a column using robust statistical thresholds (IQR fences and Z-scores).
compare [target] where [filterA] vs [filterB] | Perform group-based statistical tests (Welch's t-test or Mann-Whitney U) between two custom data subsets. Example: compare income where gender = "M" vs gender = "F".
correlate [col] with [col] | Evaluate Pearson or Spearman correlation and statistical significance between two columns.
correlate * | Generate a pairwise correlation matrix across all numeric columns in the dataset.
cluster by [col1, col2...] [into k groups] | Perform unsupervised K-Means clustering across specified features. Default k=3.
segment [target] into [n] groups [by quartile|width] | Auto-segment numerical data using quartile grouping (default) or equal-width binning.
trend [target] over [date_col] | Groups the target metric chronologically by a date/time column to visualize trend series.
rank [target] [top|bottom N] | Show the top or bottom N rows sorted by value, with rank and percentile. Defaults to top 10. Example: rank income top 5 or rank score bottom 20.

Visualizations

histogram [target] [into N bins] | Generates a histogram distribution chart of the target variable. Optionally specify bin count (2–200).
scatter [x] vs [y] | Renders a scatterplot mapping the relationship between two specific features.
crosstab [col1] by [col2] | Produces a frequency cross-tabulation table to map category intersections.

Filtering Expressions

You can chain a where clause after any primary action to filter the dataset before generating computations. ScarabQL securely handles operators like =, !=, >, <, >=, <=, in, is null, and is not null.

Example: find mean(salary) where role = "Engineer" and age > 30

Data Transformations and Pipelines (`|>`)

You can push the output of your operations iteratively into various statistical checkers and pipeline algorithms via the pipe operator (|>), or use then as an equivalent keyword.

|> test significance | Perform automatic hypothesis testing (e.g., student t-tests) against grouped comparison queries.
|> test normality | Evaluate arrays against normality assumption tests.
|> test equal variance | Validates homogeneity of variances across split groups.
|> transform [log|sqrt|square|normalize|minmax|winsorize|boxcox] | Immediately force transformation models on aggregated arrays.
|> smooth [n] | Apply rolling-window signal smoothing over trend datasets using a window of size n.

Code Generation: Scarab automatically generates completely portable, drop-in Python and R code for every single engine operation executed above. Instead of wrestling with dataframes locally, simply query with ScarabQL and export your findings securely formatted straight into production scripts!