bingo.symbolic_regression package#

Subpackages#

Submodules#

bingo.symbolic_regression.atomic_potential_regression module#

Symbolic Regression of Inter-atomic Potentials

Symbolic regression of inter-atomic potentials is defined in this module as the search for a function, f, such that sum( f(r_i) ) - U. U is the total potential energy of an atomic configuration with a set of atoms separated at distances of r_i.

The classes in this module encapsulate the parts of bingo evolutionary analysis that are unique to symbolic regression of inter-atomic potentials. Namely, these classes are an appropriate fitness evaluator and a corresponding training data container.

class bingo.symbolic_regression.atomic_potential_regression.PairwiseAtomicPotential(training_data=None, metric='mae')#

Bases: VectorBasedFunction

Fitness based on total potential energy of a set of configurations.

Pairwise atomic potential which is fit with total potential energy for a set of configurations. Fitness is calculated as how well total potential energies are matched by the summation of pairwise energies which are calculated by the Equation individual

fitness = sum(abs(sum(\(f(r_i)\)) - \(U_{{true}_i}\)))

for i in config

Parameters:

training_data (PairwiseAtomicTrainingData) – data that is used in fitness evaluation. Must have attributes r, potential_energy and config_lims_r.

evaluate_fitness_vector(individual)#

Evaluates the fitness of an individual based on how well training_data’s total potential energies are matched by the summation of pairwise energies calculated by the individual.

fitness = sum(abs(sum(\(f(r_i)\)) - \(U_{{true}_i}\)))

for i in config

Parameters:

individual (Equation) – individual whose fitness will be evaluated

Returns:

fitness – a vector of individual’s fitness values

Return type:

list of numeric

class bingo.symbolic_regression.atomic_potential_regression.PairwiseAtomicTrainingData(potential_energy, configurations=None, r_list=None, config_lims_r=None)#

Bases: TrainingData

PairwiseAtomicTrainingData:

Training data of this type contains distances (r) between atoms in several atomic configurations. Each configuration has an associated potential energy. The r values belonging to each configuration are bounded by configuration limits (config_lims_r)

Parameters:
  • potential_energy (1d numpy array) – potential energy for each configuration

  • configurations ((optional) list of tuples (structure, period, r_cutoff),) – where the structure is an array of x,y,z locations of atoms. Period is the periodic size of the configuration. rcutoff is the cutoff distance after which the pairwise interaction does not effect the potential energy.

  • r_list (2d numpy array) – (optional) list of all pairwise distances

  • config_lims_r (1d numpy array) – (optional) bounds of all of the r_indices corresponding to each configuration

Notes

Initialization must be performed with either configurations or a combination of r_list and config_lims_r.

bingo.symbolic_regression.equation module#

The base of equation chromosomes in bingo.

This module defines the basis of equations in bingo evolutionary analyses. Equations are commonly used in symbolic regression, a specific application of genetic programming.

class bingo.symbolic_regression.equation.Equation(genetic_age=0, fitness=None, fit_set=False)#

Bases: Chromosome

Base representation of an equation

This class is the base of a equations used in symbolic regression analyses in bingo.

abstract evaluate_equation_at(x)#

Evaluate the equation.

Get value of the equation at points x.

Parameters:

x (MxD array of numeric.) – Values at which to evaluate the equations. D is the number of dimensions in x and M is the number of data points in x.

Returns:

\(f(x)\)

Return type:

Mx1 array of numeric

abstract evaluate_equation_with_local_opt_gradient_at(x)#

Evaluate equation and get its derivatives.

Get value the equation at x and its gradient with respect to optimization parameters.

Parameters:

x (MxD array of numeric.) – Values at which to evaluate the equations. D is the number of dimensions in x and M is the number of data points in x.

Returns:

\(f(x)\) and \(df(x)/dc_i\). L is the number of optimization parameters.

Return type:

tuple(Mx1 array of numeric, MxL array of numeric)

abstract evaluate_equation_with_x_gradient_at(x)#

Evaluate equation and get its derivatives.

Get value the equation at x and its gradient with respect to x.

Parameters:

x (MxD array of numeric.) – Values at which to evaluate the equations. D is the number of dimensions in x and M is the number of data points in x.

Returns:

\(f(x)\) and \(df(x)/dx_i\)

Return type:

tuple(Mx1 array of numeric, MxD array of numeric)

abstract get_complexity()#

Calculate complexity of equation.

Returns:

complexity measure of equation

Return type:

numeric

bingo.symbolic_regression.equation_regressor module#

This module contains a wrapper around the bingo Equation object to match the regressor interface in scikit-learn. Calling fit performs a fitting of the numerical constants in the equation.

class bingo.symbolic_regression.equation_regressor.EquationRegressor(equation, metric='mse', algo='lm', tol=1e-06, fit_retries=5)#

Bases: RegressorMixin, BaseEstimator

A thin scikit learn wrapper around bingo equations

Parameters:
  • equation (Equation) – equation that wiull be wrapped

  • metric (str, optional) – metric used for local optimization on parameters during fit, by default “mse”

  • algo (str, optional) – algorithm used for local optimization on parameters during fit, by default “lm”

  • tol (_type_, optional) – tolerance used for local optimization on parameters during fit, by default 1e-6

  • fit_retries (int, optional) – number of times to attempt to fit parameters. This is a hedge against the variability of selecting a random starting point for the local optimization, by default 5

property complexity#

Complexity of equation

fit(X, y, sample_weight=None)#

Fit constants in equation to the given data.

Parameters:
  • X (MxD numpy array of numeric) – Input values. D is the number of dimensions and M is the number of data points.

  • y (Mx1 numpy array of numeric) – Target/output values. M is the number of data points.

  • sample_weight (Mx1 numpy array of numeric, optional) – Weights per sample/data point. M is the number of data points. Not currently supported

property fitness#

Fitness of equation

predict(X)#

Evaluate the equation to predict the outputs of X.

Parameters:

X (MxD numpy array of numeric) – Input values. D is the number of dimensions and M is the number of data points.

Returns:

pred_y – Predicted target/output values. M is the number of data points.

Return type:

Mx1 numpy array of numeric

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EquationRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EquationRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

selfobject

The updated object.

bingo.symbolic_regression.explicit_regression module#

Explicit Symbolic Regression

Explicit symbolic regression is the search for a function, f, such that f(x) = y.

The classes in this module encapsulate the parts of bingo evolutionary analysis that are unique to explicit symbolic regression. Namely, these classes are an appropriate fitness evaluator and a corresponding training data container.

class bingo.symbolic_regression.explicit_regression.ExplicitRegression(training_data, metric='mae', relative=False, use_linear_correction=False)#

Bases: VectorGradientMixin, VectorBasedFunction

The traditional fitness evaluation for symbolic regression. fitness = M(y - f(x)) where x and y are in the training_data (i.e. training_data.x and training_data.y) and the function f is defined by the input Equation individual. M is an aggregation metric such as mean squared error.

Parameters:
  • training_data (ExplicitTrainingData) – data that is used in fitness evaluation.

  • metric (str) – String defining the measure of error to use. Available options are: ‘mean absolute error’, ‘mean squared error’, ‘root mean squared error’, and “negative nmll laplace”

  • relative (bool) – Whether to use relative, pointwise normalization of errors. Default: False.

  • use_linear_correction (bool) – Whether to adjust outputs of equations by a least squares linear correction. Default: False.

evaluate_fitness_vector(individual)#

Traditional fitness evaluation for symbolic regression

fitness = y - f(x) where x and y are in the training_data (i.e. training_data.x and training_data.y) and the function f is defined by the input Equation individual.

Parameters:

individual (Equation) – individual whose fitness is evaluated on training_data

Returns:

the fitness of the input Equation individual

Return type:

float

get_fitness_vector_and_jacobian(individual)#

Fitness and jacobian evaluation of individual

fitness = y - f(x) where x and y are in the training_data (i.e. training_data.x and training_data.y) and the function f is defined by the input Equation individual.

jacobian = [[\(df_1/dc_1\), \(df_1/dc_2\), …],

[\(df_2/dc_1\), \(df_2/dc_2\), …], …]

where \(f_\#\) is the fitness function corresponding with the #th fitness vector entry and \(c_\#\) is the corresponding constant of the individual

Parameters:

individual (Equation) – individual whose fitness will be evaluated on training_data and whose constants will be used for evaluating the jacobian

Returns:

the vectorized fitness of the individual and the partial derivatives of each fitness function with respect to the individual’s constants

Return type:

fitness_vector, jacobian

class bingo.symbolic_regression.explicit_regression.ExplicitTrainingData(x, y)#

Bases: TrainingData

ExplicitTrainingData: Training data of this type contains an input array of data (x) and an output array of data (y). Both must be 2 dimensional numpy arrays

Parameters:
  • x (2D numpy array) – independent variable

  • y (2D numpy array) – dependent variable

property x#

independent x data

property y#

dependent y data

bingo.symbolic_regression.implicit_regression module#

Implicit Symbolic Regression

Explicit symbolic regression is the search for a function, f, such that f(x) = constant. One of the most difficult part of this task is avoiding trivial solutions like f(x) = 0*x.

The classes in this module encapsulate the parts of bingo evolutionary analysis that are unique to implicit symbolic regression. Namely, these classes are appropriate fitness evaluators, a corresponding training data container, and two helper functions.

class bingo.symbolic_regression.implicit_regression.ImplicitRegression(training_data, required_params=None)#

Bases: VectorBasedFunction

Implicit Regression, version 2

Fitness of this metric is related to the cos of angle between between \(df_dx(x)\) and \(dx_dt\). \(df_dx(x)\) is calculated through derivatives of the input Equation individual at training_data.x. \(dx_dt\) is from training_data.dx_dt.

Different normalization and error checking are available.

Parameters:
  • training_data (ImplicitTrainingData) – data that is used in fitness evaluation.

  • required_params (int) – (optional) minimum number of nonzero components of dot

evaluate_fitness_vector(individual)#

Evaluates the fitness of an implicit individual

Evaluates the fitness of the input Equation individual based on the cos of the angle between \(df_dx(x)\) and \(dx_dt\). Where \(df_dx\) comes from the equation’s output w.r.t. training_data.x and \(dx_dt\) is training_data.dx_dt.

Parameters:

individual (Equation) – individual whose fitness is evaluated on training_data

Returns:

the fitness of the input Equation individual

Return type:

float

class bingo.symbolic_regression.implicit_regression.ImplicitTrainingData(x, dx_dt=None)#

Bases: TrainingData

ImplicitTrainingData: Training data of this type contains an input array of data (x) and its time derivative (dx_dt). Both must be 2 dimensional numpy arrays

Parameters:
  • x (2D numpy array) – independent variable

  • dx_dt (2D numpy array) – (optional) time derivative of x. If not provided dx_dt is calculated from x.

x#

independent variable

Type:

2D numpy array

dx_dt#

time derivative of x

Type:

2D numpy array

Notes

If dx_dt partials are calculated, smoothing of the data will occur. Because accuracy of the derivatives degrades near the boundaries, the first 3 and last 4 points are removed from the dataset.

If the dataset is broken into multiple trajectories (i.e., there are break points where is doesnt make sense to calculate partial derivatives), they should be split in the input x by a row of np.nan.

property dx_dt#

derivative of x data

property x#

x data

bingo.symbolic_regression.implicit_regression_schmidt module#

Implicit Regression, Adapted from Schmidt and Lipson papers

Fitness in this method is the difference of partial derivatives pairs calculated with the data and the input Equation individual.

This may not be a correct implementation of this algorithm. Importantly, it couldn’t reproduce the results in the papers. Currently, there is no effort to maintain functionality of this module.

class bingo.symbolic_regression.implicit_regression_schmidt.ImplicitRegressionSchmidt(training_data=None, metric='mae')#

Bases: VectorBasedFunction

Implicit Regression, Adapted from Schmidt and Lipson papers

Fitness in this method is the difference of partial derivatives pairs calculated with the data and the input Equation individual.

See “Symbolic Regression of Implicit Equations” by Michael Schmidt and Hod Lipson for more info.

Parameters:

training_data (ImplicitTrainingData) – data that is used in fitness evaluation. Must have attributes x and dx_dt.

evaluate_fitness_vector(individual)#

Evaluates the fitness of an implicit individual

Evaluates the fitness of the input Equation individual based on the ratio of partial derivatives between pairs of variables.

fitness = \(-\frac{1}{N} \sum_{i=1}^N \log \left(1 + | \frac{\Delta x_i}{\Delta y_i} + \frac{\delta x_i}{\delta y_i}| \right)\) for each \(x\) and \(y\) pair in training_data.x where \(N\) is the length of the training_data, \(\frac{\Delta x_i}{\Delta y_i} = \frac{dx/dt}{dy/dt}\) from training_data.dx_dt, and \(\frac{\delta x_i}{\delta y_i} = \frac{\delta f / \delta y}{\delta f / \delta x}\) from the input Equation individual’s output on training_data.x.

Parameters:

individual (Equation) – individual whose fitness is evaluated on training_data

Returns:

the fitness of the input Equation individual

Return type:

float

bingo.symbolic_regression.srbench_interface module#

An interface for Bingo with the symbolic regression benchmarking suite SRBENCH: github.com/cavalab/srbench

bingo.symbolic_regression.srbench_interface.eval_kwargs = {'test_params': {'generations': 2}}#
a dictionary of variables passed to the evaluate_model()

function. Allows one to configure aspects of the training process.

Options#

test_params: dict, default = None

Used primarily to shorten run-times during testing. for running the tests. called as

est = est.set_params(**test_params)

max_train_samples:int, default = 0

if training size is larger than this, sample it. if 0, use all training samples for fit.

scale_x: bool, default = True

Normalize the input data prior to fit.

scale_y: bool, default = True

Normalize the input label prior to fit.

pre_train: function, default = None

Adjust settings based on training data. Called prior to est.fit. The function signature should be (est, X, y).

est: sklearn regressor; the fitted model. X: pd.DataFrame; the training data. y: training labels.

type:

eval_kwargs

bingo.symbolic_regression.srbench_interface.get_best_solution(est)#

Return the best solution from the final model.

Return type:

A scikit-learn compatible estimator

bingo.symbolic_regression.srbench_interface.get_population(est)#

Return the final population of the model. This final population should be a list with at most 100 individuals. Each of the individuals must be compatible with scikit-learn, so they should have a predict method.

Also, it is expected that the model() function can operate with them, so they should have a way of getting a simpy string representation.

Return type:

A list of scikit-learn compatible estimators

bingo.symbolic_regression.srbench_interface.hyper_params = [{'population_size': [100], 'stack_size': [24]}, {'population_size': [500], 'stack_size': [24]}, {'population_size': [2500], 'stack_size': [32]}]#

a sklearn-compatible regressor.

Type:

est

bingo.symbolic_regression.srbench_interface.model(est, X=None)#

Return a sympy-compatible string of the final model.

Parameters:
  • est (sklearn regressor) – The fitted model.

  • X (pd.DataFrame, default=None) – The training data. This argument can be dropped if desired.

Return type:

A sympy-compatible string of the final model.

Notes

Ensure that the variable names appearing in the model are identical to those in the training data, X, which is a pd.Dataframe. If your method names variables some other way, e.g. [x_0 … x_m], you can specify a mapping in the model function such as:

``` def model(est, X):

mapping = {’x_’+str(i):k for i,k in enumerate(X.columns)} new_model = est.model_ for k,v in mapping.items():

new_model = new_model.replace(k,v)

```

bingo.symbolic_regression.symbolic_regressor module#

This module contains the implementation of an object used for symbolic regression via a scikit-learn interface.

class bingo.symbolic_regression.symbolic_regressor.SymbolicRegressor(*, population_size=500, stack_size=32, operators=None, use_simplification=False, crossover_prob=0.4, mutation_prob=0.4, metric='mse', clo_alg='lm', generations=10000000000000000000, fitness_threshold=1e-16, max_time=1800, max_evals=10000000000000000000, evolutionary_algorithm=None, clo_threshold=1e-05, scale_max_evals=False, random_state=None)#

Bases: RegressorMixin, BaseEstimator

Class for performing symbolic regression using genetic programming.

Parameters:
  • population_size (int, optional) – The number of individuals in a population.

  • stack_size (int, optional) – The max number of commands per individual.

  • operators (iterable of str, optional) – Potential operations that can be used.

  • use_simplification (bool, optional) – Whether to use simplification to speed up evaluation or not.

  • crossover_prob (float, optional) – Probability in [0, 1] of crossover occurring on an individual.

  • mutation_prob (float, optional) – Probability in [0, 1] of mutation occurring on an individual.

  • metric (str, optional) – Error metric to use for fitness (e.g., “rmse”, “mse”, “mae”).

  • clo_alg (str, optional) – Algorithm to use for local optimization (e.g., “lm”, “BFGS”, etc.).

  • generations (int, optional) – Maximum number of generations allowed for evolution.

  • fitness_threshold (float, optional) – Error/fitness threshold to stop evolution at.

  • max_time (int, optional) – Number of seconds to stop evolution at.

  • max_evals (int, optional) – Number of fitness evaluation to stop evolution at.

  • evolutionary_algorithm (EvolutionaryAlgorithm, optional) – Evolutionary algorithm to use in evolution.

  • clo_threshold (float, optional) – Threshold/tolerance for local optimization.

  • scale_max_evals (bool, optional) – Whether to scale max_evals based on fitness predictors or not.

  • random_state (int, optional) – Seed for random processes.

fit(X, y, sample_weight=None)#

Fit this model to the given data.

Parameters:
  • X (MxD numpy array of numeric) – Input values. D is the number of dimensions and M is the number of data points.

  • y (Mx1 numpy array of numeric) – Target/output values. M is the number of data points.

  • sample_weight (Mx1 numpy array of numeric, optional) – Weights per sample/data point. M is the number of data points. Not currently supported

Returns:

self – The fitted version of this object.

Return type:

SymbolicRegressor

get_best_individual()#

Gets the best model found from fit().

Returns:

best_individual – Model with the best fitness from fit().

Return type:

RegressorMixin

Raises:

ValueError – If fit() has not been called yet

get_best_population()#

Gets best group of models from fit()

Returns:

Models from pareto front and final population from fit().

Return type:

list of RegressorMixin

Raises:

ValueError – If fit() has not been called yet

get_pareto_front()#

Gets best group of models from fit()

Returns:

Models with the best fitnesses and complexities from fit().

Return type:

list of RegressorMixin

Raises:

ValueError – If fit() has not been called yet

predict(X)#

Use the best individual to predict the outputs of X.

Parameters:

X (MxD numpy array of numeric) – Input values. D is the number of dimensions and M is the number of data points.

Returns:

pred_y – Predicted target/output values. M is the number of data points.

Return type:

Mx1 numpy array of numeric

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SymbolicRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SymbolicRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

selfobject

The updated object.

bingo.symbolic_regression.symbolic_regressor.agraph_similarity(ag_1, ag_2)#

a similarity metric between agraphs

Module contents#

Import the core names of bingo symbolic_regression library

Programs that want to build bingo symbolic regression apps without having to import specific modules can import this.