The repository contains an implementation of the Gene Expression Programming [1], whereby the internal representation of the equation is fully tokenized as a vector of integers. This representation allows a lower memory footprint, leading to faster processing of the application of the genetic operators. Moreover, the implementation also contains a mechanism for semantic backpropagation, ensuring dimensional homogeneity for physical units [2].
-
Install the package:
using Pkg Pkg.add(url="https://github.com/maxreiss123/GeneExpressionProgramming.jl.git")
# Min_example using GeneExpressionProgramming using Random Random.seed!(1) #Define the iterations for the algorithm and the population size epochs = 1000 population_size = 1000 #Number of features which needs to be inserted number_features = 2 x_data = randn(Float64, 100, number_features) y_data = @. x_data[:,1] * x_data[:,1] + x_data[:,1] * x_data[:,2] - 2 * x_data[:,1] * x_data[:,2] #define the regressor regressor = GepRegressor(number_features) #perform the regression by entering epochs, population_size, the feature cols, the target col and the loss function fit!(regressor, epochs, population_size, x_data', y_data; loss_fun="mse") pred = regressor(x_data') # Can be utilized to perform the prediction for further data @show regressor.best_models_[1].compiled_function @show regressor.best_models_[1].fitness
- Imagine you want to find
$J$ explaining superconductivity as$J=-\rho \frac{q}{m} A$ (Fyneman III 21.20) -
$J$ marking the electric current,$q$ the electric charge,$\rho$ the charge density,$m$ the mass and$A$ the magnetic vector potential
# Min_example
using GeneExpressionProgramming
using Random
Random.seed!(1)
#Define the iterations for the algorithm and the population size
epochs = 1000
population_size = 1000
#By loading the data we end up with 5 cols => 4 for the features and the last one for the target
data = Matrix(CSV.read("paper/srsd/feynman-III.21.20\$0.01.txt", DataFrame))
data = data[all.(x -> !any(isnan, x), eachrow(data)), :]
num_cols = size(data, 2) #num_cols =5
# Perform a simple train test split
x_train, y_train, x_test, y_test = train_test_split(data[:, 1:num_cols-1], data[:, num_cols]; consider=4)
#define a target dimension - here ampere - (units inspired by openFoam) - https://doc.cfd.direct/openfoam/user-guide-v6/basic-file-format
target_dim = Float16[0, -2, 0, 0, 0, 1, 0] # Aiming for electric conductivity (Ampere/m^2)
#define dims for the features
#header of the reveals rho_c_0,q,A_vec,m -> internally mapt on x_1 ... x_n
feature_dims = Dict{Symbol,Vector{Float16}}(
:x1 => Float16[0, -3, 1, 0, 0, 1, 0], #rho m^(-3) * s * A
:x2 => Float16[0, 0, 1, 0, 0, 1, 0], #q s*A
:x3 => Float16[1, 1, -2, 0, 0, -1, 0], #A kg*m*s^(-2)*A^(-1)
:x4 => Float16[1, 0, 0, 0, 0, 0, 0], #m kg
)
#define the features, here the numbers of the first two cols - here we add the feature dims and a maximum of permutations per tree high - rounds, referring to the tree high
regressor = GepRegressor(num_cols-1; considered_dimensions=feature_dims,max_permutations_lib=10000, rounds=7)
#perform the regression by entering epochs, population_size, the feature cols, the target col and the loss function
fit!(regressor, epochs, population_size, x_train', y_train; x_test=x_test', y_test=y_test, loss_fun="mse")
pred = regressor(x_data')
@show regressor.best_models_[1].compiled_function
@show regressor.best_models_[1].fitness
- Remark: Template for rerunning the test from the paper is located in the paper directory
- Remark: the tutorial folder contains notebook, that can be run with google-colab, while showing a step-by-step introduction
- To conduct a regression involving higher dimensional objects we swap the underlying evaluation from DynamicExpression.jl to Flux.jl
- Hint: By involving such objects, the performance deteriorates significantly
using GeneExpressionProgramming
using Random
using Tensors
Random.seed!(1)
#Define the iterations for the algorithm and the population size
epochs = 100
population_size = 1000
#Number of features which needs to be inserted
number_features = 5
#define the
regressor = GepTensorRegressor(number_features,
gene_count=2, #2 works quite reliable
head_len=3) # 5 works quite reliable
#create some testdata - testing simply on a few velocity vectors
size_test = 1000
u1 = [randn(Tensor{1,3}) for _ in 1:size_test]
u2 = [randn(Tensor{1,3}) for _ in 1:size_test]
u3 = [randn(Tensor{1,3}) for _ in 1:size_test]
x1 = [2.0 for _ in 1:size_test]
x2 = [0.0 for _ in 1:size_test]
a = 0.5 * u1 .+ x2 .* u2 + 2* u3
inputs = (x1,x2,u1,u2,u3)
@inline function loss_new(elem, validate::Bool)
if isnan(mean(elem.fitness)) || validate
model = elem.compiled_function
a_pred = model(inputs)
!isfinite(norm(a_pred)) && return (typemax(Float64),)
size(a_pred) != size(a) && return (typemax(Float64),)
size(a_pred[1]) != size(a[1]) && return (typemax(Float64),)
loss = norm(a_pred .- a)
return (loss,)
else
return (elem.fitness,)
end
end
fit!(regressor, epochs, population_size, loss_new)
- DynamicExpressions.jl
- Flux.jl --> in development
- [1] Ferreira, C. (2001). Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems, 13.
- [2] Reissmann, M., Fang, Y., Ooi, A., & Sandberg, R. (2024). Constraining genetic symbolic regression via semantic backpropagation. arXiv. https://arxiv.org/abs/2409.07369
- The Coefficient optimization is inspired by https://github.com/MilesCranmer/SymbolicRegression.jl
- We employ the insane fast DynamicExpressions.jl for evaluating our expressions
- Documentation
- Naming conventions!
- Improve usability for user interaction
- Next operations: Tail flip, Connection symbol flip, wrapper class for easy usage, config class for predefinition, staggered exploration
- nice print flux
- constant node needs to be fixed