{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PySIPS Tutorial\n",
    "\n",
    "This tutorial demonstrates key features of PySIPS through practical examples and common use cases.\n",
    "\n",
    "**Note:** This is an interactive tutorial notebook. You can download and run it locally to experiment with the examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "1. [Basic Symbolic Regression](#1.-Basic-Symbolic-Regression)\n",
    "2. [Multivariate Regression](#2.-Multivariate-Regression)\n",
    "3. [Custom Operators](#3.-Customizing-Available-Operators)\n",
    "4. [Model Selection Strategies](#4.-Model-Selection-Strategies)\n",
    "5. [Checkpointing for Long Runs](#5.-Checkpointing-for-Long-Runs)\n",
    "6. [Hyperparameter Tuning](#6.-Hyperparameter-Tuning)\n",
    "7. [Analyzing the Posterior Distribution](#7.-Analyzing-the-Posterior-Distribution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic Symbolic Regression\n",
    "\n",
    "Let's start with a simple univariate symbolic regression problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from pysips import PysipsRegressor\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Generate data: y = sin(x) + noise\n",
    "np.random.seed(42)\n",
    "X = np.linspace(-np.pi, np.pi, 100).reshape(-1, 1)\n",
    "y = np.sin(X[:, 0]) + np.random.normal(0, 0.1, size=X.shape[0])\n",
    "\n",
    "# Create and fit the regressor\n",
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*', '/'],\n",
    "    max_complexity=15,\n",
    "    num_particles=100,\n",
    "    num_mcmc_samples=10,\n",
    "    max_time=120,\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "regressor.fit(X, y)\n",
    "\n",
    "# Get results\n",
    "expression = regressor.get_expression()\n",
    "y_pred = regressor.predict(X)\n",
    "\n",
    "print(f\"Discovered expression: {expression}\")\n",
    "\n",
    "# Visualize results\n",
    "plt.figure(figsize=(10, 6))\n",
    "plt.scatter(X, y, alpha=0.5, label='Data')\n",
    "plt.plot(X, y_pred, 'r-', linewidth=2, label='Predicted')\n",
    "plt.xlabel('x')\n",
    "plt.ylabel('y')\n",
    "plt.legend()\n",
    "plt.title(f'Discovered: {expression}')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Multivariate Regression\n",
    "\n",
    "PySIPS handles multiple input variables naturally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "\n",
    "# Generate data: y = x0^2 + 2*x1 - 3*x2 + noise\n",
    "np.random.seed(42)\n",
    "n_samples = 200\n",
    "X = np.random.randn(n_samples, 3)\n",
    "y = X[:, 0]**2 + 2*X[:, 1] - 3*X[:, 2] + np.random.normal(0, 0.1, n_samples)\n",
    "\n",
    "# Split the data\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "# Create regressor with power operator\n",
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*', 'pow'],\n",
    "    max_complexity=20,\n",
    "    num_particles=150,\n",
    "    num_mcmc_samples=10,\n",
    "    max_time=180,\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "# Fit and evaluate\n",
    "regressor.fit(X_train, y_train)\n",
    "y_pred = regressor.predict(X_test)\n",
    "\n",
    "expression = regressor.get_expression()\n",
    "r2 = r2_score(y_test, y_pred)\n",
    "mse = mean_squared_error(y_test, y_pred)\n",
    "\n",
    "print(f\"Discovered expression: {expression}\")\n",
    "print(f\"R\u00b2 score: {r2:.4f}\")\n",
    "print(f\"MSE: {mse:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Customizing Available Operators\n",
    "\n",
    "You can control which mathematical operators are available.\n",
    "\n",
    "### Available operators:\n",
    "\n",
    "- Arithmetic: `'+'`, `'-'`, `'*'`, `'/'`\n",
    "- Power: `'pow'`, `'sqrt'`\n",
    "- Trigonometric: `'sin'`, `'cos'`, `'tan'`\n",
    "- Exponential/Logarithmic: `'exp'`, `'log'`\n",
    "- Other: `'abs'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example 1: Simple arithmetic only\n",
    "regressor_simple = PysipsRegressor(\n",
    "    operators=['+', '-', '*'],  # No division\n",
    "    max_complexity=10\n",
    ")\n",
    "\n",
    "# Example 2: Include division and power\n",
    "regressor_extended = PysipsRegressor(\n",
    "    operators=['+', '-', '*', '/', 'pow'],\n",
    "    max_complexity=15\n",
    ")\n",
    "\n",
    "# Example 3: Include trigonometric functions\n",
    "regressor_trig = PysipsRegressor(\n",
    "    operators=['+', '-', '*', 'sin', 'cos'],\n",
    "    max_complexity=15\n",
    ")\n",
    "\n",
    "# Example 4: Include exponential and logarithm\n",
    "regressor_exp = PysipsRegressor(\n",
    "    operators=['+', '-', '*', 'exp', 'log'],\n",
    "    max_complexity=15\n",
    ")\n",
    "\n",
    "print(\"Operators configured successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Model Selection Strategies\n",
    "\n",
    "PySIPS offers two strategies for selecting the final model:\n",
    "\n",
    "- **`'mode'`**: More robust, favors expressions that appear frequently in the posterior\n",
    "- **`'max_likelihood'`**: Favors the single best-scoring expression, may overfit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.random.randn(100, 2)\n",
    "y = X[:, 0]**2 + X[:, 1] + np.random.normal(0, 0.1, 100)\n",
    "\n",
    "# Strategy 1: Mode selection (most frequently sampled)\n",
    "regressor_mode = PysipsRegressor(\n",
    "    operators=['+', '*', 'pow'],\n",
    "    model_selection='mode',  # Default\n",
    "    num_particles=100,\n",
    "    random_state=42\n",
    ")\n",
    "regressor_mode.fit(X, y)\n",
    "print(f\"Mode selection: {regressor_mode.get_expression()}\")\n",
    "\n",
    "# Strategy 2: Maximum likelihood selection\n",
    "regressor_maxlik = PysipsRegressor(\n",
    "    operators=['+', '*', 'pow'],\n",
    "    model_selection='max_likelihood',\n",
    "    num_particles=100,\n",
    "    random_state=42\n",
    ")\n",
    "regressor_maxlik.fit(X, y)\n",
    "print(f\"Max likelihood selection: {regressor_maxlik.get_expression()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Checkpointing for Long Runs\n",
    "\n",
    "For long-running experiments, use checkpointing to save progress."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.random.randn(500, 3)\n",
    "y = np.sum(X**2, axis=1) + np.random.normal(0, 0.1, 500)\n",
    "\n",
    "# Specify a checkpoint file\n",
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*', 'pow'],\n",
    "    max_complexity=25,\n",
    "    num_particles=200,\n",
    "    checkpoint_file='tutorial_experiment.checkpoint',  # Save/load from this file\n",
    "    max_time=600,  # 10 minutes\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "# First run: starts from scratch and saves progress\n",
    "regressor.fit(X, y)\n",
    "\n",
    "expression = regressor.get_expression()\n",
    "print(f\"Discovered expression: {expression}\")\n",
    "\n",
    "# Note: If interrupted, subsequent runs will resume from checkpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Hyperparameter Tuning\n",
    "\n",
    "PySIPS is compatible with scikit-learn's hyperparameter tuning tools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.datasets import make_friedman1\n",
    "\n",
    "# Generate data\n",
    "X, y = make_friedman1(n_samples=200, n_features=5, noise=0.1, random_state=42)\n",
    "\n",
    "# Define parameter grid\n",
    "param_grid = {\n",
    "    'max_complexity': [10, 15, 20],\n",
    "    'num_particles': [50, 100, 150],\n",
    "}\n",
    "\n",
    "# Create base regressor (disable progress bar for cleaner output)\n",
    "base_regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*'],\n",
    "    num_mcmc_samples=10,\n",
    "    max_time=60,\n",
    "    show_progress_bar=False,  # Important for grid search\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "# Perform grid search\n",
    "grid_search = GridSearchCV(\n",
    "    base_regressor,\n",
    "    param_grid,\n",
    "    cv=3,\n",
    "    scoring='r2',\n",
    "    n_jobs=1  # PySIPS doesn't support parallelization at this level\n",
    ")\n",
    "\n",
    "grid_search.fit(X, y)\n",
    "\n",
    "print(f\"Best parameters: {grid_search.best_params_}\")\n",
    "print(f\"Best score: {grid_search.best_score_:.4f}\")\n",
    "\n",
    "# Use best estimator\n",
    "best_expression = grid_search.best_estimator_.get_expression()\n",
    "print(f\"Best expression: {best_expression}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Analyzing the Posterior Distribution\n",
    "\n",
    "Access the full posterior distribution for uncertainty quantification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate data\n",
    "np.random.seed(42)\n",
    "X = np.random.randn(100, 2)\n",
    "y = X[:, 0]**2 + 2*X[:, 1] + np.random.normal(0, 0.1, 100)\n",
    "\n",
    "# Fit regressor\n",
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*', 'pow'],\n",
    "    max_complexity=15,\n",
    "    num_particles=200,\n",
    "    num_mcmc_samples=10,\n",
    "    max_time=120,\n",
    "    random_state=42\n",
    ")\n",
    "regressor.fit(X, y)\n",
    "\n",
    "# Get all sampled models and their likelihoods\n",
    "models, likelihoods = regressor.get_models()\n",
    "\n",
    "print(f\"Total unique models sampled: {len(models)}\")\n",
    "print(f\"\\nTop 5 models by likelihood:\")\n",
    "\n",
    "# Sort by likelihood\n",
    "sorted_indices = np.argsort(likelihoods)[::-1]\n",
    "\n",
    "for i in range(min(5, len(models))):\n",
    "    idx = sorted_indices[i]\n",
    "    print(f\"{i+1}. {models[idx]} (likelihood: {likelihoods[idx]:.4f})\")\n",
    "\n",
    "# Count frequency of each model in the posterior\n",
    "print(f\"\\nSelected model: {regressor.get_expression()}\")\n",
    "print(f\"(This is the most frequently sampled model)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Topics\n",
    "\n",
    "### Mutation and Crossover Control\n",
    "\n",
    "Fine-tune the proposal mechanism probabilities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*'],\n",
    "    # Mutation probabilities\n",
    "    command_probability=0.3,    # Probability of changing an operator\n",
    "    node_probability=0.3,        # Probability of replacing a node\n",
    "    parameter_probability=0.2,   # Probability of changing edges in an expression graph\n",
    "    prune_probability=0.1,       # Probability of simplifying\n",
    "    fork_probability=0.1,        # Probability of expanding\n",
    "    # Crossover settings\n",
    "    crossover_pool_size=20,      # Size of gene pool for crossover\n",
    "    random_state=42\n",
    ")\n",
    "print(\"Advanced configuration set!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Expression Complexity Control"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*'],\n",
    "    max_complexity=10,              # Maximum expression size\n",
    "    terminal_probability=0.5,       # Probability of leaf nodes\n",
    "    constant_probability=0.3,       # Probability of constants vs variables\n",
    "    random_state=42\n",
    ")\n",
    "print(\"Complexity controls configured!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SMC Sampling Control"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = PysipsRegressor(\n",
    "    operators=['+', '-', '*'],\n",
    "    num_particles=100,          # Population size\n",
    "    num_mcmc_samples=10,        # MCMC steps per SMC iteration\n",
    "    target_ess=0.8,             # Target effective sample size (0-1)\n",
    "    random_state=42\n",
    ")\n",
    "print(\"SMC parameters configured!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best Practices\n",
    "\n",
    "1. **Start simple**: Begin with basic operators and small complexity limits\n",
    "2. **Use checkpoints**: For experiments taking >5 minutes, always use checkpointing\n",
    "3. **Reproducibility**: Always set `random_state` for reproducible results\n",
    "4. **Progress monitoring**: Use `show_progress_bar=True` (default) for interactive use\n",
    "5. **Hyperparameter tuning**: Disable progress bar when using GridSearchCV\n",
    "6. **Model validation**: Always validate on held-out test data\n",
    "7. **Posterior inspection**: Check the full posterior distribution, not just the best model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Troubleshooting\n",
    "\n",
    "### Issue: Fitting is too slow\n",
    "\n",
    "- Reduce `num_particles` (e.g., 50-100)\n",
    "- Reduce `max_complexity` (e.g., 10-15)\n",
    "- Set `max_time` to limit runtime\n",
    "- Simplify the operator set\n",
    "\n",
    "### Issue: Poor model quality\n",
    "\n",
    "- Increase `num_particles` (e.g., 200-500)\n",
    "- Increase `num_mcmc_samples` (e.g., 20-50)\n",
    "- Increase `max_complexity` if needed\n",
    "- Add more relevant operators\n",
    "- Collect more/better training data\n",
    "\n",
    "### Issue: Overfitting\n",
    "\n",
    "- Use cross-validation for model selection\n",
    "- Reduce `max_complexity`\n",
    "- Use `model_selection='mode'` instead of `'max_likelihood'`\n",
    "\n",
    "### Issue: Checkpoint file corrupted\n",
    "\n",
    "- Delete the checkpoint file and start fresh\n",
    "- Ensure sufficient disk space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further Resources\n",
    "\n",
    "- [GitHub Repository](https://github.com/nasa/pysips)\n",
    "- [API Reference](https://nasa.github.io/pysips/api/modules.html)\n",
    "- [Paper: Bayesian Symbolic Regression via Posterior Sampling](https://arxiv.org/abs/2512.10849)\n",
    "\n",
    "For more examples, see the `demos/` directory in the repository."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}