{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PySIPS Tutorial\n", "\n", "This tutorial demonstrates key features of PySIPS through practical examples and common use cases.\n", "\n", "**Note:** This is an interactive tutorial notebook. You can download and run it locally to experiment with the examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "\n", "1. [Basic Symbolic Regression](#1.-Basic-Symbolic-Regression)\n", "2. [Multivariate Regression](#2.-Multivariate-Regression)\n", "3. [Custom Operators](#3.-Customizing-Available-Operators)\n", "4. [Model Selection Strategies](#4.-Model-Selection-Strategies)\n", "5. [Checkpointing for Long Runs](#5.-Checkpointing-for-Long-Runs)\n", "6. [Hyperparameter Tuning](#6.-Hyperparameter-Tuning)\n", "7. [Analyzing the Posterior Distribution](#7.-Analyzing-the-Posterior-Distribution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic Symbolic Regression\n", "\n", "Let's start with a simple univariate symbolic regression problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from pysips import PysipsRegressor\n", "import matplotlib.pyplot as plt\n", "\n", "# Generate data: y = sin(x) + noise\n", "np.random.seed(42)\n", "X = np.linspace(-np.pi, np.pi, 100).reshape(-1, 1)\n", "y = np.sin(X[:, 0]) + np.random.normal(0, 0.1, size=X.shape[0])\n", "\n", "# Create and fit the regressor\n", "regressor = PysipsRegressor(\n", " operators=['+', '-', '*', '/'],\n", " max_complexity=15,\n", " num_particles=100,\n", " num_mcmc_samples=10,\n", " max_time=120,\n", " random_state=42\n", ")\n", "\n", "regressor.fit(X, y)\n", "\n", "# Get results\n", "expression = regressor.get_expression()\n", "y_pred = regressor.predict(X)\n", "\n", "print(f\"Discovered expression: {expression}\")\n", "\n", "# Visualize results\n", "plt.figure(figsize=(10, 6))\n", "plt.scatter(X, y, alpha=0.5, label='Data')\n", "plt.plot(X, y_pred, 'r-', linewidth=2, label='Predicted')\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.legend()\n", "plt.title(f'Discovered: {expression}')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Multivariate Regression\n", "\n", "PySIPS handles multiple input variables naturally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "\n", "# Generate data: y = x0^2 + 2*x1 - 3*x2 + noise\n", "np.random.seed(42)\n", "n_samples = 200\n", "X = np.random.randn(n_samples, 3)\n", "y = X[:, 0]**2 + 2*X[:, 1] - 3*X[:, 2] + np.random.normal(0, 0.1, n_samples)\n", "\n", "# Split the data\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=42\n", ")\n", "\n", "# Create regressor with power operator\n", "regressor = PysipsRegressor(\n", " operators=['+', '-', '*', 'pow'],\n", " max_complexity=20,\n", " num_particles=150,\n", " num_mcmc_samples=10,\n", " max_time=180,\n", " random_state=42\n", ")\n", "\n", "# Fit and evaluate\n", "regressor.fit(X_train, y_train)\n", "y_pred = regressor.predict(X_test)\n", "\n", "expression = regressor.get_expression()\n", "r2 = r2_score(y_test, y_pred)\n", "mse = mean_squared_error(y_test, y_pred)\n", "\n", "print(f\"Discovered expression: {expression}\")\n", "print(f\"R\u00b2 score: {r2:.4f}\")\n", "print(f\"MSE: {mse:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Customizing Available Operators\n", "\n", "You can control which mathematical operators are available.\n", "\n", "### Available operators:\n", "\n", "- Arithmetic: `'+'`, `'-'`, `'*'`, `'/'`\n", "- Power: `'pow'`, `'sqrt'`\n", "- Trigonometric: `'sin'`, `'cos'`, `'tan'`\n", "- Exponential/Logarithmic: `'exp'`, `'log'`\n", "- Other: `'abs'`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example 1: Simple arithmetic only\n", "regressor_simple = PysipsRegressor(\n", " operators=['+', '-', '*'], # No division\n", " max_complexity=10\n", ")\n", "\n", "# Example 2: Include division and power\n", "regressor_extended = PysipsRegressor(\n", " operators=['+', '-', '*', '/', 'pow'],\n", " max_complexity=15\n", ")\n", "\n", "# Example 3: Include trigonometric functions\n", "regressor_trig = PysipsRegressor(\n", " operators=['+', '-', '*', 'sin', 'cos'],\n", " max_complexity=15\n", ")\n", "\n", "# Example 4: Include exponential and logarithm\n", "regressor_exp = PysipsRegressor(\n", " operators=['+', '-', '*', 'exp', 'log'],\n", " max_complexity=15\n", ")\n", "\n", "print(\"Operators configured successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Model Selection Strategies\n", "\n", "PySIPS offers two strategies for selecting the final model:\n", "\n", "- **`'mode'`**: More robust, favors expressions that appear frequently in the posterior\n", "- **`'max_likelihood'`**: Favors the single best-scoring expression, may overfit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.random.randn(100, 2)\n", "y = X[:, 0]**2 + X[:, 1] + np.random.normal(0, 0.1, 100)\n", "\n", "# Strategy 1: Mode selection (most frequently sampled)\n", "regressor_mode = PysipsRegressor(\n", " operators=['+', '*', 'pow'],\n", " model_selection='mode', # Default\n", " num_particles=100,\n", " random_state=42\n", ")\n", "regressor_mode.fit(X, y)\n", "print(f\"Mode selection: {regressor_mode.get_expression()}\")\n", "\n", "# Strategy 2: Maximum likelihood selection\n", "regressor_maxlik = PysipsRegressor(\n", " operators=['+', '*', 'pow'],\n", " model_selection='max_likelihood',\n", " num_particles=100,\n", " random_state=42\n", ")\n", "regressor_maxlik.fit(X, y)\n", "print(f\"Max likelihood selection: {regressor_maxlik.get_expression()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Checkpointing for Long Runs\n", "\n", "For long-running experiments, use checkpointing to save progress." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.random.randn(500, 3)\n", "y = np.sum(X**2, axis=1) + np.random.normal(0, 0.1, 500)\n", "\n", "# Specify a checkpoint file\n", "regressor = PysipsRegressor(\n", " operators=['+', '-', '*', 'pow'],\n", " max_complexity=25,\n", " num_particles=200,\n", " checkpoint_file='tutorial_experiment.checkpoint', # Save/load from this file\n", " max_time=600, # 10 minutes\n", " random_state=42\n", ")\n", "\n", "# First run: starts from scratch and saves progress\n", "regressor.fit(X, y)\n", "\n", "expression = regressor.get_expression()\n", "print(f\"Discovered expression: {expression}\")\n", "\n", "# Note: If interrupted, subsequent runs will resume from checkpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Hyperparameter Tuning\n", "\n", "PySIPS is compatible with scikit-learn's hyperparameter tuning tools." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.datasets import make_friedman1\n", "\n", "# Generate data\n", "X, y = make_friedman1(n_samples=200, n_features=5, noise=0.1, random_state=42)\n", "\n", "# Define parameter grid\n", "param_grid = {\n", " 'max_complexity': [10, 15, 20],\n", " 'num_particles': [50, 100, 150],\n", "}\n", "\n", "# Create base regressor (disable progress bar for cleaner output)\n", "base_regressor = PysipsRegressor(\n", " operators=['+', '-', '*'],\n", " num_mcmc_samples=10,\n", " max_time=60,\n", " show_progress_bar=False, # Important for grid search\n", " random_state=42\n", ")\n", "\n", "# Perform grid search\n", "grid_search = GridSearchCV(\n", " base_regressor,\n", " param_grid,\n", " cv=3,\n", " scoring='r2',\n", " n_jobs=1 # PySIPS doesn't support parallelization at this level\n", ")\n", "\n", "grid_search.fit(X, y)\n", "\n", "print(f\"Best parameters: {grid_search.best_params_}\")\n", "print(f\"Best score: {grid_search.best_score_:.4f}\")\n", "\n", "# Use best estimator\n", "best_expression = grid_search.best_estimator_.get_expression()\n", "print(f\"Best expression: {best_expression}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Analyzing the Posterior Distribution\n", "\n", "Access the full posterior distribution for uncertainty quantification." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate data\n", "np.random.seed(42)\n", "X = np.random.randn(100, 2)\n", "y = X[:, 0]**2 + 2*X[:, 1] + np.random.normal(0, 0.1, 100)\n", "\n", "# Fit regressor\n", "regressor = PysipsRegressor(\n", " operators=['+', '-', '*', 'pow'],\n", " max_complexity=15,\n", " num_particles=200,\n", " num_mcmc_samples=10,\n", " max_time=120,\n", " random_state=42\n", ")\n", "regressor.fit(X, y)\n", "\n", "# Get all sampled models and their likelihoods\n", "models, likelihoods = regressor.get_models()\n", "\n", "print(f\"Total unique models sampled: {len(models)}\")\n", "print(f\"\\nTop 5 models by likelihood:\")\n", "\n", "# Sort by likelihood\n", "sorted_indices = np.argsort(likelihoods)[::-1]\n", "\n", "for i in range(min(5, len(models))):\n", " idx = sorted_indices[i]\n", " print(f\"{i+1}. {models[idx]} (likelihood: {likelihoods[idx]:.4f})\")\n", "\n", "# Count frequency of each model in the posterior\n", "print(f\"\\nSelected model: {regressor.get_expression()}\")\n", "print(f\"(This is the most frequently sampled model)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Topics\n", "\n", "### Mutation and Crossover Control\n", "\n", "Fine-tune the proposal mechanism probabilities:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regressor = PysipsRegressor(\n", " operators=['+', '-', '*'],\n", " # Mutation probabilities\n", " command_probability=0.3, # Probability of changing an operator\n", " node_probability=0.3, # Probability of replacing a node\n", " parameter_probability=0.2, # Probability of changing edges in an expression graph\n", " prune_probability=0.1, # Probability of simplifying\n", " fork_probability=0.1, # Probability of expanding\n", " # Crossover settings\n", " crossover_pool_size=20, # Size of gene pool for crossover\n", " random_state=42\n", ")\n", "print(\"Advanced configuration set!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expression Complexity Control" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regressor = PysipsRegressor(\n", " operators=['+', '-', '*'],\n", " max_complexity=10, # Maximum expression size\n", " terminal_probability=0.5, # Probability of leaf nodes\n", " constant_probability=0.3, # Probability of constants vs variables\n", " random_state=42\n", ")\n", "print(\"Complexity controls configured!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SMC Sampling Control" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regressor = PysipsRegressor(\n", " operators=['+', '-', '*'],\n", " num_particles=100, # Population size\n", " num_mcmc_samples=10, # MCMC steps per SMC iteration\n", " target_ess=0.8, # Target effective sample size (0-1)\n", " random_state=42\n", ")\n", "print(\"SMC parameters configured!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best Practices\n", "\n", "1. **Start simple**: Begin with basic operators and small complexity limits\n", "2. **Use checkpoints**: For experiments taking >5 minutes, always use checkpointing\n", "3. **Reproducibility**: Always set `random_state` for reproducible results\n", "4. **Progress monitoring**: Use `show_progress_bar=True` (default) for interactive use\n", "5. **Hyperparameter tuning**: Disable progress bar when using GridSearchCV\n", "6. **Model validation**: Always validate on held-out test data\n", "7. **Posterior inspection**: Check the full posterior distribution, not just the best model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "### Issue: Fitting is too slow\n", "\n", "- Reduce `num_particles` (e.g., 50-100)\n", "- Reduce `max_complexity` (e.g., 10-15)\n", "- Set `max_time` to limit runtime\n", "- Simplify the operator set\n", "\n", "### Issue: Poor model quality\n", "\n", "- Increase `num_particles` (e.g., 200-500)\n", "- Increase `num_mcmc_samples` (e.g., 20-50)\n", "- Increase `max_complexity` if needed\n", "- Add more relevant operators\n", "- Collect more/better training data\n", "\n", "### Issue: Overfitting\n", "\n", "- Use cross-validation for model selection\n", "- Reduce `max_complexity`\n", "- Use `model_selection='mode'` instead of `'max_likelihood'`\n", "\n", "### Issue: Checkpoint file corrupted\n", "\n", "- Delete the checkpoint file and start fresh\n", "- Ensure sufficient disk space" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further Resources\n", "\n", "- [GitHub Repository](https://github.com/nasa/pysips)\n", "- [API Reference](https://nasa.github.io/pysips/api/modules.html)\n", "- [Paper: Bayesian Symbolic Regression via Posterior Sampling](https://arxiv.org/abs/2512.10849)\n", "\n", "For more examples, see the `demos/` directory in the repository." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 4 }