From 279ed71c8c684685b0a383fba003fd159fc75c61 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 11:26:43 -0700 Subject: [PATCH 01/17] update documentation --- CITATION.md | 2 ++ README.md | 50 +++++++++++++++++++++++++++++++++++++---------- src/LLMOptions.jl | 1 + 3 files changed, 43 insertions(+), 10 deletions(-) diff --git a/CITATION.md b/CITATION.md index 517ad4804..367d5089d 100644 --- a/CITATION.md +++ b/CITATION.md @@ -30,3 +30,5 @@ To cite symbolic distillation of neural networks, the following BibTeX entry can primaryClass={cs.LG} } ``` + +To cite Lang \ No newline at end of file diff --git a/README.md b/README.md index 144674c5e..dec25e782 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,15 @@
-LaSR.jl accelerates the search for symbolic expressions using library learning. +LibraryAugmentedSymbolicRegression.jl accelerates the search for symbolic expressions using library learning. | Latest release | Website | Forums | Paper | | :---: | :---: | :---: | :---: | -| [![version](https://juliahub.com/docs/LaSR/version.svg)](https://juliahub.com/ui/Packages/LaSR/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LaSR.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-????.?????-b31b1b)](https://atharvas.net/static/lasr.pdf) | +| [![version](https://juliahub.com/docs/LaSR/version.svg)](https://juliahub.com/ui/Packages/LaSR/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-????.?????-b31b1b)](https://atharvas.net/static/lasr.pdf) | | Build status | Coverage | | :---: | :---: | -| [![CI](https://github.com/trishullab/LaSR.jl/workflows/CI/badge.svg)](.github/workflows/CI.yml) | [![Coverage Status](https://coveralls.io/repos/github/trishullab/LaSR.jl/badge.svg?branch=master)](https://coveralls.io/github/trishullab/LaSR.jl?branch=master) | +| [![CI](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/workflows/CI/badge.svg)](.github/workflows/CI.yml) | [![Coverage Status](https://coveralls.io/repos/github/trishullab/LibraryAugmentedSymbolicRegression.jl/badge.svg?branch=master)](https://coveralls.io/github/trishullab/LibraryAugmentedSymbolicRegression.jl?branch=master) | LaSR is integrated with [SymbolicRegression.jl](https://github.com/MilesCranmer/SymbolicRegression.jl). Check out [PySR](https://github.com/MilesCranmer/PySR) for a Python frontend. @@ -22,6 +22,7 @@ a Python frontend. **Contents**: - [Quickstart](#quickstart) +- [Benchmarking](#benchmarking) - [Organization](#organization) - [LLM Utilities](#llm-utilities) @@ -34,7 +35,7 @@ using Pkg Pkg.add("LibraryAugmentedSymbolicRegression") ``` -LaSR uses the same interface as [SymbolicRegression.jl](https://github.com/MilesCranmer/SymbolicRegression.jl). The easiest way to use LaSR.jl +LaSR uses the same interface as [SymbolicRegression.jl](https://github.com/MilesCranmer/SymbolicRegression.jl). The easiest way to use LibraryAugmentedSymbolicRegression.jl is with [MLJ](https://github.com/alan-turing-institute/MLJ.jl). Let's see an example: @@ -105,27 +106,56 @@ where here we choose to evaluate the second equation. For fitting multiple outputs, one can use `MultitargetLaSRRegressor` (and pass an array of indices to `idx` in `predict` for selecting specific equations). -For a full list of options available to each regressor, see the [API page](https://astroautomata.com/LaSR.jl/dev/api/). +For a full list of options available to each regressor, see the [API page](https://astroautomata.com/LibraryAugmentedSymbolicRegression.jl/dev/api/). ### LLM Options LaSR uses PromptingTools.jl for zero shot prompting. If you wish to make changes to the prompting options, you can pass an `LLMOptions` object to the `LaSRRegressor` constructor. The options available are: ```julia llm_options = LLMOptions( - ... + active=true, # Whether to use LLM inference or not + weights=LLMWeights(llm_mutate=0.5, llm_crossover=0.3, llm_gen_random=0.2), # Probabilities of using LLM for mutation, crossover, and random generation + num_pareto_context=5, # Number of equations to sample from the Pareto frontier for summarization. + prompt_evol=true, # Whether to evolve natural language concepts through LLM calls. + prompt_concepts=true, # Whether to use natural language concepts in the search. + api_key="token-abc123", # API key to OpenAI API compatible server. + model="meta-llama/Meta-Llama-3-8B-Instruct", # LLM model to use. + api_kwargs=Dict("url" => "http://localhost:11440/v1"), # Keyword arguments passed to server. + http_kwargs=Dict("retries" => 3, "readtimeout" => 3600), # Keyword arguments passed to HTTP requests. + llm_recorder_dir="lasr_runs/debug_0", # Directory to log LLM interactions. + llm_context="", # Natural language concept to start with. You should also be able to initialize with a list of concepts. + var_order=nothing, # Dict(variable_name => new_name). + idea_threshold=30 # Number of concepts to keep track of. ) ``` +## Benchmarking + +If you wish to compare against LaSR, we've archived the code we used to run LaSR on top of PySR and SymbolicRegression.jl in the `lasr-experiments` branch. Run +```bash +$ git switch lasr-experiments +``` +and follow the instructions in the README to reproduce our results. This directory contains the code for evaluating LaSR on the + +- [x] Feynman Equations dataset +- [x] Synthetic equations dataset + - [x] and generation code +- [x] Bigbench experiments + - [x] and evaluation code ## Organization -LaSR.jl development is kept independent from the main codebase. However, to ensure LaSR can be used easily, it is integrated into SymbolicRegression.jl via the `ext/SymbolicRegressionLaSRExt` extension module. This, in turn, is loaded into PySR. This cartoon summarizes the interaction between the different packages: +LibraryAugmentedSymbolicRegression.jl development is kept independent from the main codebase. However, to ensure LaSR can be used easily, it is integrated into SymbolicRegression.jl via the [`ext/SymbolicRegressionLaSRExt`](https://www.example.com) extension module. This, in turn, is loaded into PySR. This cartoon summarizes the interaction between the different packages: + +![LibraryAugmentedSymbolicRegression.jl organization](https://raw.githubusercontent.com/trishullab/lasr-web/main/static/lasr-code-interactions.svg) + +> [!NOTE] +> The `ext/SymbolicRegressionLaSRExt` module is not yet available in the released version of SymbolicRegression.jl. It will be available in the release `vX.X.X` of SymbolicRegression.jl. -![LaSR.jl organization](https://raw.githubusercontent.com/trishullab/lasr-web/main/static/lasr-code-interactions.svg) ## Code structure -LaSR.jl is organized roughly as follows. +LibraryAugmentedSymbolicRegression.jl is organized roughly as follows. Rounded rectangles indicate objects, and rectangles indicate functions. > (if you can't see this diagram being rendered, try pasting it into [mermaid-js.github.io/mermaid-live-editor](https://mermaid-js.github.io/mermaid-live-editor)) @@ -274,4 +304,4 @@ done | vims -l 'f a--> ' | sort ## Search options -See https://astroautomata.com/LaSR.jl/stable/api/#Options +Other than `LLMOptions`, We have the same search options as SymbolicRegression.jl. See https://astroautomata.com/SymbolicRegression.jl/stable/api/#Options diff --git a/src/LLMOptions.jl b/src/LLMOptions.jl index 5f7a01432..4a9814c10 100644 --- a/src/LLMOptions.jl +++ b/src/LLMOptions.jl @@ -46,6 +46,7 @@ this module serves as the entry point to define new options for the LLM inferenc - `llm_recorder_dir::String`: File to save LLM logs to. Useful for debugging. - `llm_context::AbstractString`: Context string for LLM. - `var_order::Union{Dict,Nothing}`: Variable order for LLM. (default: nothing) +- `idea_threshold::UInt32`: Number of concepts to keep track of. (default: 30) """ Base.@kwdef mutable struct LLMOptions active::Bool = false From 6fc72b25bc6d9bee642489aa56e399b85b1e7832 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 12:50:35 -0700 Subject: [PATCH 02/17] formatting docs --- ext/SymbolicRegressionSymbolicUtilsExt.jl | 6 ++++-- src/Configure.jl | 4 +++- src/LLMFunctions.jl | 7 ++++--- src/LLMOptions.jl | 10 ++-------- src/LibraryAugmentedSymbolicRegression.jl | 8 ++++++-- src/MLJInterface.jl | 4 +++- src/Mutate.jl | 10 ++++++++-- src/RegularizedEvolution.jl | 6 +++++- src/SingleIteration.jl | 2 +- test/runtests.jl | 2 +- test/test_expression_derivatives.jl | 3 ++- test/test_graph_nodes.jl | 6 ++++-- test/test_lasr_integration.jl | 14 +++++++------ test/test_nested_constraints.jl | 24 +++++++++++++++++------ test/test_operators.jl | 12 +++++++++--- test/test_prob_pick_first.jl | 5 +++-- test/test_units.jl | 15 +++++++++----- 17 files changed, 91 insertions(+), 47 deletions(-) diff --git a/ext/SymbolicRegressionSymbolicUtilsExt.jl b/ext/SymbolicRegressionSymbolicUtilsExt.jl index 5fbefeb9a..2a7f32cc7 100644 --- a/ext/SymbolicRegressionSymbolicUtilsExt.jl +++ b/ext/SymbolicRegressionSymbolicUtilsExt.jl @@ -1,8 +1,10 @@ module LaSRSymbolicUtilsExt using SymbolicUtils: Symbolic -using LibraryAugmentedSymbolicRegression: AbstractExpressionNode, AbstractExpression, Node, Options -using LibraryAugmentedSymbolicRegression.MLJInterfaceModule: AbstractSRRegressor, get_options +using LibraryAugmentedSymbolicRegression: + AbstractExpressionNode, AbstractExpression, Node, Options +using LibraryAugmentedSymbolicRegression.MLJInterfaceModule: + AbstractSRRegressor, get_options using DynamicExpressions: get_tree, get_operators import LibraryAugmentedSymbolicRegression: node_to_symbolic, symbolic_to_node diff --git a/src/Configure.jl b/src/Configure.jl index a256a1ee0..1440ba7ad 100644 --- a/src/Configure.jl +++ b/src/Configure.jl @@ -257,7 +257,9 @@ function test_module_on_workers(procs, options::Options, verbosity) for proc in procs push!( futures, - @spawnat proc LibraryAugmentedSymbolicRegression.gen_random_tree(3, options, 5, TEST_TYPE) + @spawnat proc LibraryAugmentedSymbolicRegression.gen_random_tree( + 3, options, 5, TEST_TYPE + ) ) end for future in futures diff --git a/src/LLMFunctions.jl b/src/LLMFunctions.jl index 67ead9a8d..c5bc5b5f0 100644 --- a/src/LLMFunctions.jl +++ b/src/LLMFunctions.jl @@ -893,7 +893,7 @@ function llm_crossover_trees( String(strip(cross_tree_options[1], [' ', '\n', '"', ',', '.', '[', ']'])), options, ) - + llm_recorder(options.llm_options, tree_to_expr(t, options), "crossover") return t, tree2 @@ -934,10 +934,11 @@ function llm_crossover_trees( ) end - recording_str = tree_to_expr(cross_tree1, options) * " && " * tree_to_expr(cross_tree2, options) + recording_str = + tree_to_expr(cross_tree1, options) * " && " * tree_to_expr(cross_tree2, options) llm_recorder(options.llm_options, recording_str, "crossover") return cross_tree1, cross_tree2 end -end \ No newline at end of file +end diff --git a/src/LLMOptions.jl b/src/LLMOptions.jl index 4a9814c10..1dc8aa3a9 100644 --- a/src/LLMOptions.jl +++ b/src/LLMOptions.jl @@ -56,9 +56,7 @@ Base.@kwdef mutable struct LLMOptions prompt_evol::Bool = false api_key::String = "" model::String = "" - api_kwargs::Dict = Dict( - "max_tokens" => 1000 - ) + api_kwargs::Dict = Dict("max_tokens" => 1000) http_kwargs::Dict = Dict("retries" => 3, "readtimeout" => 3600) llm_recorder_dir::String = "lasr_runs/" prompts_dir::String = "prompts/" @@ -95,8 +93,6 @@ function validate_llm_options(options::LLMOptions) end end - - # """Sample LLM mutation, given the weightings.""" # function sample_llm_mutation(w::LLMWeights) # weights = convert(Vector, w) @@ -105,8 +101,6 @@ end end # module - - # sample invocation following: # python -m experiments.main --use_llm --use_prompt_evol --model "meta-llama/Meta-Llama-3-8B-Instruct" --api_key "vllm_api.key" --model_url "http://localhost:11440/v1" --exp_idx 0 --dataset_path FeynmanEquations.csv --start_idx 0 # options = LLMOptions( @@ -123,4 +117,4 @@ end # module # llm_context="", # var_order=nothing, # idea_threshold=30 -# ) \ No newline at end of file +# ) diff --git a/src/LibraryAugmentedSymbolicRegression.jl b/src/LibraryAugmentedSymbolicRegression.jl index c91d6717a..865316792 100644 --- a/src/LibraryAugmentedSymbolicRegression.jl +++ b/src/LibraryAugmentedSymbolicRegression.jl @@ -922,7 +922,9 @@ function _main_search_loop!( window_size=options.populations * 2 * nout, ) n_iterations = 0 - llm_recorder(options.llm_options, string(div(n_iterations, options.populations)), "n_iterations") + llm_recorder( + options.llm_options, string(div(n_iterations, options.populations)), "n_iterations" + ) worst_members = Vector{PopMember}() while sum(state.cycles_remaining) > 0 kappa += 1 @@ -1135,7 +1137,9 @@ function _main_search_loop!( end ################################################################ end - llm_recorder(options.llm_options, string(div(n_iterations, options.populations)), "n_iterations") + llm_recorder( + options.llm_options, string(div(n_iterations, options.populations)), "n_iterations" + ) return nothing end function _tear_down!(state::SearchState, ropt::RuntimeOptions, options::Options) diff --git a/src/MLJInterface.jl b/src/MLJInterface.jl index 04204d0d9..c86c3da40 100644 --- a/src/MLJInterface.jl +++ b/src/MLJInterface.jl @@ -455,7 +455,9 @@ end function get_equation_strings_for(::LaSRRegressor, trees, options, variable_names) return (t -> string_tree(t, options; variable_names=variable_names)).(trees) end -function get_equation_strings_for(::MultitargetLaSRRegressor, trees, options, variable_names) +function get_equation_strings_for( + ::MultitargetLaSRRegressor, trees, options, variable_names +) return [ (t -> string_tree(t, options; variable_names=variable_names)).(ts) for ts in trees ] diff --git a/src/Mutate.jl b/src/Mutate.jl index f377174f2..761e04d04 100644 --- a/src/Mutate.jl +++ b/src/Mutate.jl @@ -485,11 +485,17 @@ function crossover_generation( check_constraints(child_tree2, options, curmaxsize, afterSize2) if successful_crossover - recorder_str = tree_to_expr(child_tree1, options) * " && " * tree_to_expr(child_tree2, options) + recorder_str = + tree_to_expr(child_tree1, options) * + " && " * + tree_to_expr(child_tree2, options) llm_recorder(options.llm_options, recorder_str, "crossover") llm_skip = true else - recorder_str = tree_to_expr(child_tree1, options) * " && " * tree_to_expr(child_tree2, options) + recorder_str = + tree_to_expr(child_tree1, options) * + " && " * + tree_to_expr(child_tree2, options) llm_recorder(options.llm_options, recorder_str, "crossover|failed") child_tree1, child_tree2 = crossover_trees(tree1, tree2) end diff --git a/src/RegularizedEvolution.jl b/src/RegularizedEvolution.jl index 913df0d89..b2a08c990 100644 --- a/src/RegularizedEvolution.jl +++ b/src/RegularizedEvolution.jl @@ -93,7 +93,11 @@ function reg_evol_cycle( allstar2 = best_of_sample(pop, running_search_statistics, options) baby1, baby2, crossover_accepted, tmp_num_evals = crossover_generation( - allstar1, allstar2, dataset, curmaxsize, options; + allstar1, + allstar2, + dataset, + curmaxsize, + options; dominating=dominating, idea_database=idea_database, ) diff --git a/src/SingleIteration.jl b/src/SingleIteration.jl index ce420e5b0..ae1b3dad2 100644 --- a/src/SingleIteration.jl +++ b/src/SingleIteration.jl @@ -53,7 +53,7 @@ function s_r_cycle( curmaxsize, running_search_statistics, options, - record, + record; dominating=dominating, idea_database=idea_database, ) diff --git a/test/runtests.jl b/test/runtests.jl index 52a941def..0a484650d 100644 --- a/test/runtests.jl +++ b/test/runtests.jl @@ -178,4 +178,4 @@ end @testitem "LLM Integration tests" tags = [:part3, :llm] begin include("test_lasr_integration.jl") -end \ No newline at end of file +end diff --git a/test/test_expression_derivatives.jl b/test/test_expression_derivatives.jl index 78cfddbef..e9d2a0a89 100644 --- a/test/test_expression_derivatives.jl +++ b/test/test_expression_derivatives.jl @@ -37,7 +37,8 @@ end @testitem "Test derivatives during optimization" tags = [:part1] begin using LibraryAugmentedSymbolicRegression - using LibraryAugmentedSymbolicRegression.ConstantOptimizationModule: Evaluator, GradEvaluator + using LibraryAugmentedSymbolicRegression.ConstantOptimizationModule: + Evaluator, GradEvaluator using DynamicExpressions using Zygote: Zygote using Random: MersenneTwister diff --git a/test/test_graph_nodes.jl b/test/test_graph_nodes.jl index 82a612d60..4f3ebb2fa 100644 --- a/test/test_graph_nodes.jl +++ b/test/test_graph_nodes.jl @@ -59,7 +59,8 @@ end @testitem "GraphNode break connection mutation" tags = [:part1] begin using LibraryAugmentedSymbolicRegression - using LibraryAugmentedSymbolicRegression.MutationFunctionsModule: break_random_connection! + using LibraryAugmentedSymbolicRegression.MutationFunctionsModule: + break_random_connection! using Random: MersenneTwister options = Options(; @@ -92,7 +93,8 @@ end @testitem "GraphNode form connection mutation" tags = [:part1] begin using LibraryAugmentedSymbolicRegression - using LibraryAugmentedSymbolicRegression.MutationFunctionsModule: form_random_connection! + using LibraryAugmentedSymbolicRegression.MutationFunctionsModule: + form_random_connection! using Random: MersenneTwister options = Options(; diff --git a/test/test_lasr_integration.jl b/test/test_lasr_integration.jl index 0274bc103..f40cfbc52 100644 --- a/test/test_lasr_integration.jl +++ b/test/test_lasr_integration.jl @@ -1,13 +1,13 @@ using LibraryAugmentedSymbolicRegression: LLMOptions, Options # test that we can partially specify LLMOptions -op1 = LLMOptions(active=false) +op1 = LLMOptions(; active=false) @test op1.active == false # test that we can fully specify LLMOptions -op2 = LLMOptions( +op2 = LLMOptions(; active=true, - weights=LLMWeights(llm_mutate=0.5, llm_crossover=0.3, llm_gen_random=0.2), + weights=LLMWeights(; llm_mutate=0.5, llm_crossover=0.3, llm_gen_random=0.2), num_pareto_context=5, prompt_evol=true, prompt_concepts=true, @@ -18,12 +18,14 @@ op2 = LLMOptions( llm_recorder_dir="test/", llm_context="test", var_order=nothing, - idea_threshold=30 + idea_threshold=30, ) @test op2.active == true # test that we can pass LLMOptions to Options -llm_opt = LLMOptions(active=false) -op = Options(; optimizer_options=(iterations=16, f_calls_limit=100, x_tol=1e-16), llm_options=llm_opt) +llm_opt = LLMOptions(; active=false) +op = Options(; + optimizer_options=(iterations=16, f_calls_limit=100, x_tol=1e-16), llm_options=llm_opt +) @test isa(op.llm_options, LLMOptions) println("Passed.") diff --git a/test/test_nested_constraints.jl b/test/test_nested_constraints.jl index 59e89863a..c4527b3ed 100644 --- a/test/test_nested_constraints.jl +++ b/test/test_nested_constraints.jl @@ -34,21 +34,33 @@ tree = cos(exp(Node("x1")) + exp(exp(Node("x1") + exp(exp(exp(Node("x1"))))))) x1 = Node("x1") options = create_options(nothing) tree = cos(cos(x1)) + cos(x1) + exp(cos(x1)) -@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) options = create_options([cos => [cos => 0]]) -@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) options = create_options([cos => [cos => 1]]) -@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) options = create_options([cos => [exp => 0]]) -@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test !LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) options = create_options([exp => [cos => 0]]) -@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) options = create_options([(+) => [(+) => 0]]) -@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests(tree, options) +@test LibraryAugmentedSymbolicRegression.CheckConstraintsModule.flag_illegal_nests( + tree, options +) println("Passed.") diff --git a/test/test_operators.jl b/test/test_operators.jl index 1d3bf614e..384d4c668 100644 --- a/test/test_operators.jl +++ b/test/test_operators.jl @@ -79,7 +79,9 @@ end ], ) for T in types_to_test - @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined(T, options) + @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined( + T, options + ) end end @@ -90,7 +92,9 @@ end unary_operators=[square, cube, log, log2, log10, log1p, sqrt, acosh, neg], ) for T in types_to_test - @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined(T, options) + @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined( + T, options + ) end end @@ -115,7 +119,9 @@ end @test_throws "returned an output of type" LibraryAugmentedSymbolicRegression.assert_operators_well_defined( Float64, options ) - @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined(Float32, options) + @test_nowarn LibraryAugmentedSymbolicRegression.assert_operators_well_defined( + Float32, options + ) end @testset "Turbo mode should be the same" begin diff --git a/test/test_prob_pick_first.jl b/test/test_prob_pick_first.jl index dbd5cc1cc..9b21b7a06 100644 --- a/test/test_prob_pick_first.jl +++ b/test/test_prob_pick_first.jl @@ -38,8 +38,9 @@ for reverse in [false, true] options=options ) best_pop_member = [ - LibraryAugmentedSymbolicRegression.best_of_sample(pop, dummy_running_stats, options).score for - j in 1:100 + LibraryAugmentedSymbolicRegression.best_of_sample( + pop, dummy_running_stats, options + ).score for j in 1:100 ] mean_value = sum(best_pop_member) / length(best_pop_member) diff --git a/test/test_units.jl b/test/test_units.jl index c7f85130e..6046f8d21 100644 --- a/test/test_units.jl +++ b/test/test_units.jl @@ -1,7 +1,8 @@ @testitem "Dimensional analysis" tags = [:part3] begin using LibraryAugmentedSymbolicRegression using LibraryAugmentedSymbolicRegression.InterfaceDynamicQuantitiesModule: get_units - using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: violates_dimensional_constraints + using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: + violates_dimensional_constraints using DynamicQuantities using DynamicQuantities: DEFAULT_DIM_BASE_TYPE @@ -102,7 +103,8 @@ end @testitem "Search with dimensional constraints" tags = [:part3] begin using LibraryAugmentedSymbolicRegression - using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: violates_dimensional_constraints + using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: + violates_dimensional_constraints using Random: MersenneTwister rng = MersenneTwister(0) @@ -392,7 +394,8 @@ end @testitem "Dimensionless constants" tags = [:part3] begin using LibraryAugmentedSymbolicRegression - using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: violates_dimensional_constraints + using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: + violates_dimensional_constraints using DynamicQuantities include("utils.jl") @@ -435,9 +438,11 @@ end @testitem "Miscellaneous tests of unit interface" tags = [:part3] begin using LibraryAugmentedSymbolicRegression using DynamicQuantities - using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: @maybe_return_call, WildcardQuantity + using LibraryAugmentedSymbolicRegression.DimensionalAnalysisModule: + @maybe_return_call, WildcardQuantity using LibraryAugmentedSymbolicRegression.MLJInterfaceModule: unwrap_units_single - using LibraryAugmentedSymbolicRegression.InterfaceDynamicQuantitiesModule: get_dimensions_type + using LibraryAugmentedSymbolicRegression.InterfaceDynamicQuantitiesModule: + get_dimensions_type using MLJModelInterface: MLJModelInterface as MMI function test_return_call(op::Function, w...) From 7b35b3995207c8b18635b2472637c68acb1f67d1 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 12:55:05 -0700 Subject: [PATCH 03/17] update dependecy for PromptingTools.jl --- Project.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Project.toml b/Project.toml index d9765a717..a79bf5906 100644 --- a/Project.toml +++ b/Project.toml @@ -64,7 +64,7 @@ Pkg = "<0.0.1, 1" PrecompileTools = "1" Printf = "<0.0.1, 1" ProgressBars = "~1.4, ~1.5" -PromptingTools = "0.54.0" +PromptingTools = "~0.54, ~0.55" Random = "<0.0.1, 1" Reexport = "1" SpecialFunctions = "0.10.1, 1, 2" From 1b7ca81e8e027589b401149214a0a64c98e6303e Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 13:12:29 -0700 Subject: [PATCH 04/17] update extensions --- ext/{SymbolicRegressionEnzymeExt.jl => LaSREnzymeExt.jl} | 0 ext/{SymbolicRegressionJSON3Ext.jl => LaSRJSON3Ext.jl} | 0 ...licRegressionSymbolicUtilsExt.jl => LaSRSymbolicUtilsExt.jl} | 0 src/Population.jl | 2 +- 4 files changed, 1 insertion(+), 1 deletion(-) rename ext/{SymbolicRegressionEnzymeExt.jl => LaSREnzymeExt.jl} (100%) rename ext/{SymbolicRegressionJSON3Ext.jl => LaSRJSON3Ext.jl} (100%) rename ext/{SymbolicRegressionSymbolicUtilsExt.jl => LaSRSymbolicUtilsExt.jl} (100%) diff --git a/ext/SymbolicRegressionEnzymeExt.jl b/ext/LaSREnzymeExt.jl similarity index 100% rename from ext/SymbolicRegressionEnzymeExt.jl rename to ext/LaSREnzymeExt.jl diff --git a/ext/SymbolicRegressionJSON3Ext.jl b/ext/LaSRJSON3Ext.jl similarity index 100% rename from ext/SymbolicRegressionJSON3Ext.jl rename to ext/LaSRJSON3Ext.jl diff --git a/ext/SymbolicRegressionSymbolicUtilsExt.jl b/ext/LaSRSymbolicUtilsExt.jl similarity index 100% rename from ext/SymbolicRegressionSymbolicUtilsExt.jl rename to ext/LaSRSymbolicUtilsExt.jl diff --git a/src/Population.jl b/src/Population.jl index 547c8b81e..e12514e09 100644 --- a/src/Population.jl +++ b/src/Population.jl @@ -41,7 +41,7 @@ end Create random population and score them on the dataset. """ -function Population( +@unstable function Population( dataset::Dataset{T,L}; options::Options, population_size=nothing, From ed996fdf5658e08377855bdcd79357fa16d3cd6c Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 14:22:17 -0700 Subject: [PATCH 05/17] adding dispatch doctor ignore statements --- Project.toml | 2 +- src/Mutate.jl | 5 +- src/Population.jl | 14 +- src/SearchUtils.jl | 2 +- src/SymbolicRegression.jl | 1249 ------------------------------------- 5 files changed, 15 insertions(+), 1257 deletions(-) delete mode 100644 src/SymbolicRegression.jl diff --git a/Project.toml b/Project.toml index a79bf5906..03a45cd78 100644 --- a/Project.toml +++ b/Project.toml @@ -71,7 +71,7 @@ SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "1.6" +julia = "^1.6" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" diff --git a/src/Mutate.jl b/src/Mutate.jl index 761e04d04..496b8a337 100644 --- a/src/Mutate.jl +++ b/src/Mutate.jl @@ -1,5 +1,6 @@ module MutateModule +using DispatchDoctor: @unstable using DynamicExpressions: AbstractExpressionNode, AbstractExpression, @@ -94,7 +95,7 @@ end # Go through one simulated options.annealing mutation cycle # exp(-delta/T) defines probability of accepting a change -function next_generation( +@unstable function next_generation( dataset::D, member::P, temperature, @@ -430,7 +431,7 @@ function next_generation( end """Generate a generation via crossover of two members.""" -function crossover_generation( +@unstable function crossover_generation( member1::P, member2::P, dataset::D, diff --git a/src/Population.jl b/src/Population.jl index e12514e09..abd8d0937 100644 --- a/src/Population.jl +++ b/src/Population.jl @@ -26,7 +26,13 @@ function Population(pop::Vector{<:PopMember}) return Population(pop, size(pop, 1)) end -function gen_random_tree_pop(nlength, options, nfeatures, T, idea_database) +@unstable function gen_random_tree_pop( + nlength::Int, + options::Options, + nfeatures::Int, + ::Type{T}, + idea_database::Union{Vector{String},String,Nothing}, +) where {T<:DATA_TYPE} if options.llm_options.active && (rand() < options.llm_options.weights.llm_gen_random) gen_llm_random_tree(nlength, options, nfeatures, T, idea_database) else @@ -37,9 +43,9 @@ end """ Population(dataset::Dataset{T,L}; population_size, nlength::Int=3, options::Options, - nfeatures::Int) + nfeatures::Int, idea_database::Vector{String}) -Create random population and score them on the dataset. +Create random population with LLM and RNG and score them on the dataset. """ @unstable function Population( dataset::Dataset{T,L}; @@ -48,7 +54,7 @@ Create random population and score them on the dataset. nlength::Int=3, nfeatures::Int, npop=nothing, - idea_database=nothing, + idea_database::Union{Vector{String},String,Nothing}=nothing, ) where {T,L} @assert (population_size !== nothing) ⊻ (npop !== nothing) population_size = if npop === nothing diff --git a/src/SearchUtils.jl b/src/SearchUtils.jl index d8f5bd382..ff57e0eeb 100644 --- a/src/SearchUtils.jl +++ b/src/SearchUtils.jl @@ -125,7 +125,7 @@ macro sr_spawner(expr, kws...) end |> esc end -function init_dummy_pops( +@unstable function init_dummy_pops( npops::Int, datasets::Vector{D}, options::Options ) where {T,L,D<:Dataset{T,L}} prototype = Population( diff --git a/src/SymbolicRegression.jl b/src/SymbolicRegression.jl deleted file mode 100644 index d08d49780..000000000 --- a/src/SymbolicRegression.jl +++ /dev/null @@ -1,1249 +0,0 @@ -module LibraryAugmentedSymbolicRegression - -# Types -export Population, - PopMember, - HallOfFame, - Options, - Dataset, - MutationWeights, - LLMWeights, - LLMOptions, - Node, - GraphNode, - ParametricNode, - Expression, - ParametricExpression, - StructuredExpression, - NodeSampler, - AbstractExpression, - AbstractExpressionNode, - LaSRRegressor, - MultitargetLaSRRegressor, - LOSS_TYPE, - DATA_TYPE, - - #Functions: - equation_search, - s_r_cycle, - calculate_pareto_frontier, - count_nodes, - compute_complexity, - @parse_expression, - parse_expression, - print_tree, - string_tree, - eval_tree_array, - eval_diff_tree_array, - eval_grad_tree_array, - differentiable_eval_tree_array, - set_node!, - copy_node, - node_to_symbolic, - node_type, - symbolic_to_node, - simplify_tree!, - tree_mapreduce, - combine_operators, - gen_random_tree, - gen_random_tree_fixed_size, - @extend_operators, - get_tree, - get_contents, - get_metadata, - - #Operators - plus, - sub, - mult, - square, - cube, - pow, - safe_pow, - safe_log, - safe_log2, - safe_log10, - safe_log1p, - safe_acosh, - safe_sqrt, - neg, - greater, - cond, - relu, - logical_or, - logical_and, - - # special operators - gamma, - erf, - erfc, - atanh_clip - -using Distributed -using Printf: @printf, @sprintf -using PackageExtensionCompat: @require_extensions -using Pkg: Pkg -using TOML: parsefile -using Random: seed!, shuffle! -using Reexport -using DynamicExpressions: - Node, - GraphNode, - ParametricNode, - Expression, - ParametricExpression, - StructuredExpression, - NodeSampler, - AbstractExpression, - AbstractExpressionNode, - @parse_expression, - parse_expression, - copy_node, - set_node!, - string_tree, - print_tree, - count_nodes, - get_constants, - get_scalar_constants, - set_constants!, - set_scalar_constants!, - index_constants, - NodeIndex, - eval_tree_array, - differentiable_eval_tree_array, - eval_diff_tree_array, - eval_grad_tree_array, - node_to_symbolic, - symbolic_to_node, - combine_operators, - simplify_tree!, - tree_mapreduce, - set_default_variable_names!, - node_type, - get_tree, - get_contents, - get_metadata -using DynamicExpressions: with_type_parameters -@reexport using LossFunctions: - MarginLoss, - DistanceLoss, - SupervisedLoss, - ZeroOneLoss, - LogitMarginLoss, - PerceptronLoss, - HingeLoss, - L1HingeLoss, - L2HingeLoss, - SmoothedL1HingeLoss, - ModifiedHuberLoss, - L2MarginLoss, - ExpLoss, - SigmoidLoss, - DWDMarginLoss, - LPDistLoss, - L1DistLoss, - L2DistLoss, - PeriodicLoss, - HuberLoss, - EpsilonInsLoss, - L1EpsilonInsLoss, - L2EpsilonInsLoss, - LogitDistLoss, - QuantileLoss, - LogCoshLoss - -# https://discourse.julialang.org/t/how-to-find-out-the-version-of-a-package-from-its-module/37755/15 -const PACKAGE_VERSION = try - root = pkgdir(@__MODULE__) - if root == String - let project = parsefile(joinpath(root, "Project.toml")) - VersionNumber(project["version"]) - end - else - VersionNumber(0, 0, 0) - end -catch - VersionNumber(0, 0, 0) -end - -function deprecate_varmap(variable_names, varMap, func_name) - if varMap !== nothing - Base.depwarn("`varMap` is deprecated; use `variable_names` instead", func_name) - @assert variable_names === nothing "Cannot pass both `varMap` and `variable_names`" - variable_names = varMap - end - return variable_names -end - -using DispatchDoctor: @stable - -@stable default_mode = "disable" begin - include("Utils.jl") - include("InterfaceDynamicQuantities.jl") - include("Core.jl") - include("InterfaceDynamicExpressions.jl") - include("Recorder.jl") - include("Complexity.jl") - include("DimensionalAnalysis.jl") - include("CheckConstraints.jl") - include("AdaptiveParsimony.jl") - include("MutationFunctions.jl") - include("LLMFunctions.jl") - include("LossFunctions.jl") - include("PopMember.jl") - include("ConstantOptimization.jl") - include("Population.jl") - include("HallOfFame.jl") - include("Mutate.jl") - include("RegularizedEvolution.jl") - include("SingleIteration.jl") - include("ProgressBars.jl") - include("Migration.jl") - include("SearchUtils.jl") - include("ExpressionBuilder.jl") -end - -using .CoreModule: - MAX_DEGREE, - BATCH_DIM, - FEATURE_DIM, - DATA_TYPE, - LOSS_TYPE, - RecordType, - Dataset, - Options, - MutationWeights, - LLMOptions, - LLMWeights, - plus, - sub, - mult, - square, - cube, - pow, - safe_pow, - safe_log, - safe_log2, - safe_log10, - safe_log1p, - safe_sqrt, - safe_acosh, - neg, - greater, - cond, - relu, - logical_or, - logical_and, - gamma, - erf, - erfc, - atanh_clip, - create_expression -using .UtilsModule: is_anonymous_function, recursive_merge, json3_write, @ignore -using .ComplexityModule: compute_complexity -using .CheckConstraintsModule: check_constraints -using .AdaptiveParsimonyModule: - RunningSearchStatistics, update_frequencies!, move_window!, normalize_frequencies! -using .MutationFunctionsModule: - gen_random_tree, - gen_random_tree_fixed_size, - random_node, - random_node_and_parent, - crossover_trees -using .LLMFunctionsModule: update_idea_database - -using .InterfaceDynamicExpressionsModule: @extend_operators -using .LossFunctionsModule: eval_loss, score_func, update_baseline_loss! -using .PopMemberModule: PopMember, reset_birth! -using .PopulationModule: Population, best_sub_pop, record_population, best_of_sample -using .HallOfFameModule: - HallOfFame, calculate_pareto_frontier, string_dominating_pareto_curve -using .SingleIterationModule: s_r_cycle, optimize_and_simplify_population -using .ProgressBarsModule: WrappedProgressBar -using .RecorderModule: @recorder, find_iteration_from_record -using .MigrationModule: migrate! -using .SearchUtilsModule: - SearchState, - RuntimeOptions, - WorkerAssignments, - DefaultWorkerOutputType, - assign_next_worker!, - get_worker_output_type, - extract_from_worker, - @sr_spawner, - StdinReader, - watch_stream, - close_reader!, - check_for_user_quit, - check_for_loss_threshold, - check_for_timeout, - check_max_evals, - ResourceMonitor, - record_channel_state!, - estimate_work_fraction, - update_progress_bar!, - print_search_state, - init_dummy_pops, - load_saved_hall_of_fame, - load_saved_population, - construct_datasets, - save_to_file, - get_cur_maxsize, - update_hall_of_fame! -using .ExpressionBuilderModule: embed_metadata, strip_metadata - -@stable default_mode = "disable" begin - include("deprecates.jl") - include("Configure.jl") -end - -""" - equation_search(X, y[; kws...]) - -Perform a distributed equation search for functions `f_i` which -describe the mapping `f_i(X[:, j]) ≈ y[i, j]`. Options are -configured using LibraryAugmentedSymbolicRegression.Options(...), -which should be passed as a keyword argument to options. -One can turn off parallelism with `numprocs=0`, -which is useful for debugging and profiling. - -# Arguments -- `X::AbstractMatrix{T}`: The input dataset to predict `y` from. - The first dimension is features, the second dimension is rows. -- `y::Union{AbstractMatrix{T}, AbstractVector{T}}`: The values to predict. The first dimension - is the output feature to predict with each equation, and the - second dimension is rows. -- `niterations::Int=10`: The number of iterations to perform the search. - More iterations will improve the results. -- `weights::Union{AbstractMatrix{T}, AbstractVector{T}, Nothing}=nothing`: Optionally - weight the loss for each `y` by this value (same shape as `y`). -- `options::Options=Options()`: The options for the search, such as - which operators to use, evolution hyperparameters, etc. -- `variable_names::Union{Vector{String}, Nothing}=nothing`: The names - of each feature in `X`, which will be used during printing of equations. -- `display_variable_names::Union{Vector{String}, Nothing}=variable_names`: Names - to use when printing expressions during the search, but not when saving - to an equation file. -- `y_variable_names::Union{String,AbstractVector{String},Nothing}=nothing`: The - names of each output feature in `y`, which will be used during printing - of equations. -- `parallelism=:multithreading`: What parallelism mode to use. - The options are `:multithreading`, `:multiprocessing`, and `:serial`. - By default, multithreading will be used. Multithreading uses less memory, - but multiprocessing can handle multi-node compute. If using `:multithreading` - mode, the number of threads available to julia are used. If using - `:multiprocessing`, `numprocs` processes will be created dynamically if - `procs` is unset. If you have already allocated processes, pass them - to the `procs` argument and they will be used. - You may also pass a string instead of a symbol, like `"multithreading"`. -- `numprocs::Union{Int, Nothing}=nothing`: The number of processes to use, - if you want `equation_search` to set this up automatically. By default - this will be `4`, but can be any number (you should pick a number <= - the number of cores available). -- `procs::Union{Vector{Int}, Nothing}=nothing`: If you have set up - a distributed run manually with `procs = addprocs()` and `@everywhere`, - pass the `procs` to this keyword argument. -- `addprocs_function::Union{Function, Nothing}=nothing`: If using multiprocessing - (`parallelism=:multithreading`), and are not passing `procs` manually, - then they will be allocated dynamically using `addprocs`. However, - you may also pass a custom function to use instead of `addprocs`. - This function should take a single positional argument, - which is the number of processes to use, as well as the `lazy` keyword argument. - For example, if set up on a slurm cluster, you could pass - `addprocs_function = addprocs_slurm`, which will set up slurm processes. -- `heap_size_hint_in_bytes::Union{Int,Nothing}=nothing`: On Julia 1.9+, you may set the `--heap-size-hint` - flag on Julia processes, recommending garbage collection once a process - is close to the recommended size. This is important for long-running distributed - jobs where each process has an independent memory, and can help avoid - out-of-memory errors. By default, this is set to `Sys.free_memory() / numprocs`. -- `runtests::Bool=true`: Whether to run (quick) tests before starting the - search, to see if there will be any problems during the equation search - related to the host environment. -- `saved_state=nothing`: If you have already - run `equation_search` and want to resume it, pass the state here. - To get this to work, you need to have set return_state=true, - which will cause `equation_search` to return the state. The second - element of the state is the regular return value with the hall of fame. - Note that you cannot change the operators or dataset, but most other options - should be changeable. -- `return_state::Union{Bool, Nothing}=nothing`: Whether to return the - state of the search for warm starts. By default this is false. -- `loss_type::Type=Nothing`: If you would like to use a different type - for the loss than for the data you passed, specify the type here. - Note that if you pass complex data `::Complex{L}`, then the loss - type will automatically be set to `L`. -- `verbosity`: Whether to print debugging statements or not. -- `progress`: Whether to use a progress bar output. Only available for - single target output. -- `X_units::Union{AbstractVector,Nothing}=nothing`: The units of the dataset, - to be used for dimensional constraints. For example, if `X_units=["kg", "m"]`, - then the first feature will have units of kilograms, and the second will - have units of meters. -- `y_units=nothing`: The units of the output, to be used for dimensional constraints. - If `y` is a matrix, then this can be a vector of units, in which case - each element corresponds to each output feature. - -# Returns -- `hallOfFame::HallOfFame`: The best equations seen during the search. - hallOfFame.members gives an array of `PopMember` objects, which - have their tree (equation) stored in `.tree`. Their score (loss) - is given in `.score`. The array of `PopMember` objects - is enumerated by size from `1` to `options.maxsize`. -""" -function equation_search( - X::AbstractMatrix{T}, - y::AbstractMatrix{T}; - niterations::Int=10, - weights::Union{AbstractMatrix{T},AbstractVector{T},Nothing}=nothing, - options::Options=Options(), - variable_names::Union{AbstractVector{String},Nothing}=nothing, - display_variable_names::Union{AbstractVector{String},Nothing}=variable_names, - y_variable_names::Union{String,AbstractVector{String},Nothing}=nothing, - parallelism=:multithreading, - numprocs::Union{Int,Nothing}=nothing, - procs::Union{Vector{Int},Nothing}=nothing, - addprocs_function::Union{Function,Nothing}=nothing, - heap_size_hint_in_bytes::Union{Integer,Nothing}=nothing, - runtests::Bool=true, - saved_state=nothing, - return_state::Union{Bool,Nothing,Val}=nothing, - loss_type::Type{L}=Nothing, - verbosity::Union{Integer,Nothing}=nothing, - progress::Union{Bool,Nothing}=nothing, - X_units::Union{AbstractVector,Nothing}=nothing, - y_units=nothing, - extra::NamedTuple=NamedTuple(), - v_dim_out::Val{DIM_OUT}=Val(nothing), - # Deprecated: - multithreaded=nothing, - varMap=nothing, -) where {T<:DATA_TYPE,L,DIM_OUT} - if multithreaded !== nothing - error( - "`multithreaded` is deprecated. Use the `parallelism` argument instead. " * - "Choose one of :multithreaded, :multiprocessing, or :serial.", - ) - end - variable_names = deprecate_varmap(variable_names, varMap, :equation_search) - - if weights !== nothing - @assert length(weights) == length(y) - weights = reshape(weights, size(y)) - end - - datasets = construct_datasets( - X, - y, - weights, - variable_names, - display_variable_names, - y_variable_names, - X_units, - y_units, - extra, - L, - ) - - return equation_search( - datasets; - niterations=niterations, - options=options, - parallelism=parallelism, - numprocs=numprocs, - procs=procs, - addprocs_function=addprocs_function, - heap_size_hint_in_bytes=heap_size_hint_in_bytes, - runtests=runtests, - saved_state=saved_state, - return_state=return_state, - verbosity=verbosity, - progress=progress, - v_dim_out=Val(DIM_OUT), - ) -end - -function equation_search( - X::AbstractMatrix{T1}, y::AbstractMatrix{T2}; kw... -) where {T1<:DATA_TYPE,T2<:DATA_TYPE} - U = promote_type(T1, T2) - return equation_search( - convert(AbstractMatrix{U}, X), convert(AbstractMatrix{U}, y); kw... - ) -end - -function equation_search( - X::AbstractMatrix{T1}, y::AbstractVector{T2}; kw... -) where {T1<:DATA_TYPE,T2<:DATA_TYPE} - return equation_search(X, reshape(y, (1, size(y, 1))); kw..., v_dim_out=Val(1)) -end - -function equation_search(dataset::Dataset; kws...) - return equation_search([dataset]; kws..., v_dim_out=Val(1)) -end - -function equation_search( - datasets::Vector{D}; - niterations::Int=10, - options::Options=Options(), - parallelism=:multithreading, - numprocs::Union{Int,Nothing}=nothing, - procs::Union{Vector{Int},Nothing}=nothing, - addprocs_function::Union{Function,Nothing}=nothing, - heap_size_hint_in_bytes::Union{Integer,Nothing}=nothing, - runtests::Bool=true, - saved_state=nothing, - return_state::Union{Bool,Nothing,Val}=nothing, - verbosity::Union{Int,Nothing}=nothing, - progress::Union{Bool,Nothing}=nothing, - v_dim_out::Val{DIM_OUT}=Val(nothing), -) where {DIM_OUT,T<:DATA_TYPE,L<:LOSS_TYPE,D<:Dataset{T,L}} - concurrency = if parallelism in (:multithreading, "multithreading") - :multithreading - elseif parallelism in (:multiprocessing, "multiprocessing") - :multiprocessing - elseif parallelism in (:serial, "serial") - :serial - else - error( - "Invalid parallelism mode: $parallelism. " * - "You must choose one of :multithreading, :multiprocessing, or :serial.", - ) - :serial - end - not_distributed = concurrency in (:multithreading, :serial) - not_distributed && - procs !== nothing && - error( - "`procs` should not be set when using `parallelism=$(parallelism)`. Please use `:multiprocessing`.", - ) - not_distributed && - numprocs !== nothing && - error( - "`numprocs` should not be set when using `parallelism=$(parallelism)`. Please use `:multiprocessing`.", - ) - - _return_state = if return_state isa Val - first(typeof(return_state).parameters) - else - if options.return_state === Val(nothing) - return_state === nothing ? false : return_state - else - @assert( - return_state === nothing, - "You cannot set `return_state` in both the `Options` and in the passed arguments." - ) - first(typeof(options.return_state).parameters) - end - end - - dim_out = if DIM_OUT === nothing - length(datasets) > 1 ? 2 : 1 - else - DIM_OUT - end - _numprocs::Int = if numprocs === nothing - if procs === nothing - 4 - else - length(procs) - end - else - if procs === nothing - numprocs - else - @assert length(procs) == numprocs - numprocs - end - end - - _verbosity = if verbosity === nothing && options.verbosity === nothing - 1 - elseif verbosity === nothing && options.verbosity !== nothing - options.verbosity - elseif verbosity !== nothing && options.verbosity === nothing - verbosity - else - error( - "You cannot set `verbosity` in both the search parameters `Options` and the call to `equation_search`.", - ) - 1 - end - _progress::Bool = if progress === nothing && options.progress === nothing - (_verbosity > 0) && length(datasets) == 1 - elseif progress === nothing && options.progress !== nothing - options.progress - elseif progress !== nothing && options.progress === nothing - progress - else - error( - "You cannot set `progress` in both the search parameters `Options` and the call to `equation_search`.", - ) - false - end - - _addprocs_function = addprocs_function === nothing ? addprocs : addprocs_function - - exeflags = if VERSION >= v"1.9" && concurrency == :multiprocessing - heap_size_hint_in_megabytes = floor( - Int, ( - if heap_size_hint_in_bytes === nothing - (Sys.free_memory() / _numprocs) - else - heap_size_hint_in_bytes - end - ) / 1024^2 - ) - _verbosity > 0 && - heap_size_hint_in_bytes === nothing && - @info "Automatically setting `--heap-size-hint=$(heap_size_hint_in_megabytes)M` on each Julia process. You can configure this with the `heap_size_hint_in_bytes` parameter." - - `--heap-size=$(heap_size_hint_in_megabytes)M` - else - `` - end - - # Underscores here mean that we have mutated the variable - return _equation_search( - datasets, - RuntimeOptions(; - niterations=niterations, - total_cycles=options.populations * niterations, - numprocs=_numprocs, - init_procs=procs, - addprocs_function=_addprocs_function, - exeflags=exeflags, - runtests=runtests, - verbosity=_verbosity, - progress=_progress, - parallelism=Val(concurrency), - dim_out=Val(dim_out), - return_state=Val(_return_state), - ), - options, - saved_state, - ) -end - -@noinline function _equation_search( - datasets::Vector{D}, ropt::RuntimeOptions, options::Options, saved_state -) where {D<:Dataset} - # PROMPT EVOLUTION - idea_database_all = [Vector{String}() for j in 1:length(datasets)] - - _validate_options(datasets, ropt, options) - state = _create_workers(datasets, ropt, options) - _initialize_search!(state, datasets, ropt, options, saved_state, idea_database_all) - _warmup_search!(state, datasets, ropt, options, idea_database_all) - _main_search_loop!(state, datasets, ropt, options, idea_database_all) - _tear_down!(state, ropt, options) - return _format_output(state, datasets, ropt, options) -end - -function _validate_options( - datasets::Vector{D}, ropt::RuntimeOptions, options::Options -) where {T,L,D<:Dataset{T,L}} - example_dataset = first(datasets) - nout = length(datasets) - @assert nout >= 1 - @assert (nout == 1 || ropt.dim_out == 2) - @assert options.populations >= 1 - if ropt.progress - @assert(nout == 1, "You cannot display a progress bar for multi-output searches.") - @assert(ropt.verbosity > 0, "You cannot display a progress bar with `verbosity=0`.") - end - if options.node_type <: GraphNode && ropt.verbosity > 0 - @warn "The `GraphNode` interface and mutation operators are experimental and will change in future versions." - end - if ropt.runtests - test_option_configuration(ropt.parallelism, datasets, options, ropt.verbosity) - test_dataset_configuration(example_dataset, options, ropt.verbosity) - end - for dataset in datasets - update_baseline_loss!(dataset, options) - end - if options.define_helper_functions - set_default_variable_names!(first(datasets).variable_names) - end - if options.seed !== nothing - seed!(options.seed) - end - return nothing -end -@stable default_mode = "disable" function _create_workers( - datasets::Vector{D}, ropt::RuntimeOptions, options::Options -) where {T,L,D<:Dataset{T,L}} - stdin_reader = watch_stream(stdin) - - record = RecordType() - @recorder record["options"] = "$(options)" - - nout = length(datasets) - example_dataset = first(datasets) - example_ex = create_expression(zero(T), options, example_dataset) - NT = typeof(example_ex) - PopType = Population{T,L,NT} - HallOfFameType = HallOfFame{T,L,NT} - WorkerOutputType = get_worker_output_type( - Val(ropt.parallelism), PopType, HallOfFameType - ) - ChannelType = ropt.parallelism == :multiprocessing ? RemoteChannel : Channel - - # Pointers to populations on each worker: - worker_output = Vector{WorkerOutputType}[WorkerOutputType[] for j in 1:nout] - # Initialize storage for workers - tasks = [Task[] for j in 1:nout] - # Set up a channel to send finished populations back to head node - channels = [[ChannelType(1) for i in 1:(options.populations)] for j in 1:nout] - (procs, we_created_procs) = if ropt.parallelism == :multiprocessing - configure_workers(; - procs=ropt.init_procs, - ropt.numprocs, - ropt.addprocs_function, - options, - project_path=splitdir(Pkg.project().path)[1], - file=@__FILE__, - ropt.exeflags, - ropt.verbosity, - example_dataset, - ropt.runtests, - ) - else - Int[], false - end - # Get the next worker process to give a job: - worker_assignment = WorkerAssignments() - # Randomly order which order to check populations: - # This is done so that we do work on all nout equally. - task_order = [(j, i) for j in 1:nout for i in 1:(options.populations)] - shuffle!(task_order) - - # Persistent storage of last-saved population for final return: - last_pops = init_dummy_pops(options.populations, datasets, options) - # Best 10 members from each population for migration: - best_sub_pops = init_dummy_pops(options.populations, datasets, options) - # TODO: Should really be one per population too. - all_running_search_statistics = [ - RunningSearchStatistics(; options=options) for j in 1:nout - ] - # Records the number of evaluations: - # Real numbers indicate use of batching. - num_evals = [[0.0 for i in 1:(options.populations)] for j in 1:nout] - - halls_of_fame = Vector{HallOfFameType}(undef, nout) - - cycles_remaining = [ropt.total_cycles for j in 1:nout] - cur_maxsizes = [ - get_cur_maxsize(; options, ropt.total_cycles, cycles_remaining=cycles_remaining[j]) - for j in 1:nout - ] - - return SearchState{T,L,typeof(example_ex),WorkerOutputType,ChannelType}(; - procs=procs, - we_created_procs=we_created_procs, - worker_output=worker_output, - tasks=tasks, - channels=channels, - worker_assignment=worker_assignment, - task_order=task_order, - halls_of_fame=halls_of_fame, - last_pops=last_pops, - best_sub_pops=best_sub_pops, - all_running_search_statistics=all_running_search_statistics, - num_evals=num_evals, - cycles_remaining=cycles_remaining, - cur_maxsizes=cur_maxsizes, - stdin_reader=stdin_reader, - record=Ref(record), - ) -end -function _initialize_search!( - state::SearchState{T,L,N}, - datasets, - ropt::RuntimeOptions, - options::Options, - saved_state, - idea_database_all, -) where {T,L,N} - nout = length(datasets) - - init_hall_of_fame = load_saved_hall_of_fame(saved_state) - if init_hall_of_fame === nothing - for j in 1:nout - state.halls_of_fame[j] = HallOfFame(options, datasets[j]) - end - else - # Recompute losses for the hall of fame, in - # case the dataset changed: - for j in eachindex(init_hall_of_fame, datasets, state.halls_of_fame) - hof = strip_metadata(init_hall_of_fame[j], options, datasets[j]) - for member in hof.members[hof.exists] - score, result_loss = score_func(datasets[j], member, options) - member.score = score - member.loss = result_loss - end - state.halls_of_fame[j] = hof - end - end - - for j in 1:nout, i in 1:(options.populations) - worker_idx = assign_next_worker!( - state.worker_assignment; out=j, pop=i, parallelism=ropt.parallelism, state.procs - ) - saved_pop = load_saved_population(saved_state; out=j, pop=i) - new_pop = - if saved_pop !== nothing && length(saved_pop.members) == options.population_size - _saved_pop = strip_metadata(saved_pop, options, datasets[j]) - ## Update losses: - for member in _saved_pop.members - score, result_loss = score_func(datasets[j], member, options) - member.score = score - member.loss = result_loss - end - copy_pop = copy(_saved_pop) - @sr_spawner( - begin - (copy_pop, HallOfFame(options, datasets[j]), RecordType(), 0.0) - end, - parallelism = ropt.parallelism, - worker_idx = worker_idx - ) - else - if saved_pop !== nothing && ropt.verbosity > 0 - @warn "Recreating population (output=$(j), population=$(i)), as the saved one doesn't have the correct number of members." - end - @sr_spawner( - begin - ( - Population( - datasets[j]; - population_size=options.population_size, - nlength=3, - options=options, - nfeatures=datasets[j].nfeatures, - idea_database=idea_database_all[j], - ), - HallOfFame(options, datasets[j]), - RecordType(), - Float64(options.population_size), - ) - end, - parallelism = ropt.parallelism, - worker_idx = worker_idx - ) - # This involves population_size evaluations, on the full dataset: - end - push!(state.worker_output[j], new_pop) - end - return nothing -end -function _warmup_search!( - state::SearchState{T,L,N}, - datasets, - ropt::RuntimeOptions, - options::Options, - idea_database_all, -) where {T,L,N} - nout = length(datasets) - for j in 1:nout, i in 1:(options.populations) - dataset = datasets[j] - running_search_statistics = state.all_running_search_statistics[j] - cur_maxsize = state.cur_maxsizes[j] - @recorder state.record[]["out$(j)_pop$(i)"] = RecordType() - worker_idx = assign_next_worker!( - state.worker_assignment; out=j, pop=i, parallelism=ropt.parallelism, state.procs - ) - - # TODO - why is this needed?? - # Multi-threaded doesn't like to fetch within a new task: - c_rss = deepcopy(running_search_statistics) - last_pop = state.worker_output[j][i] - updated_pop = @sr_spawner( - begin - in_pop = first( - extract_from_worker(last_pop, Population{T,L,N}, HallOfFame{T,L,N}) - ) - _dispatch_s_r_cycle( - in_pop, - dataset, - options; - pop=i, - out=j, - iteration=0, - ropt.verbosity, - cur_maxsize, - running_search_statistics=c_rss, - idea_database=idea_database_all[j], - )::DefaultWorkerOutputType{Population{T,L,N},HallOfFame{T,L,N}} - end, - parallelism = ropt.parallelism, - worker_idx = worker_idx - ) - state.worker_output[j][i] = updated_pop - end - return nothing -end -function _main_search_loop!( - state::SearchState{T,L,N}, - datasets, - ropt::RuntimeOptions, - options::Options, - idea_database_all, -) where {T,L,N} - ropt.verbosity > 0 && @info "Started!" - nout = length(datasets) - start_time = time() - if ropt.progress - #TODO: need to iterate this on the max cycles remaining! - sum_cycle_remaining = sum(state.cycles_remaining) - progress_bar = WrappedProgressBar( - 1:sum_cycle_remaining; width=options.terminal_width - ) - end - last_print_time = time() - last_speed_recording_time = time() - num_evals_last = sum(sum, state.num_evals) - num_evals_since_last = sum(sum, state.num_evals) - num_evals_last # i.e., start at 0 - print_every_n_seconds = 5 - equation_speed = Float32[] - - if ropt.parallelism in (:multiprocessing, :multithreading) - for j in 1:nout, i in 1:(options.populations) - # Start listening for each population to finish: - t = @async put!(state.channels[j][i], fetch(state.worker_output[j][i])) - push!(state.tasks[j], t) - end - end - kappa = 0 - resource_monitor = ResourceMonitor(; - # Storing n times as many monitoring intervals as populations seems like it will - # help get accurate resource estimates: - max_recordings=options.populations * 100 * nout, - start_reporting_at=options.populations * 3 * nout, - window_size=options.populations * 2 * nout, - ) - n_iterations = 0 - if options.llm_options.active - open(options.llm_options.llm_recorder_dir * "n_iterations.txt", "a") do file - write(file, "- " * string(div(n_iterations, options.populations)) * "\n") - end - end - worst_members = Vector{PopMember}() - while sum(state.cycles_remaining) > 0 - kappa += 1 - if kappa > options.populations * nout - kappa = 1 - end - # nout, populations: - j, i = state.task_order[kappa] - idea_database = idea_database_all[j] - - # Check if error on population: - if ropt.parallelism in (:multiprocessing, :multithreading) - if istaskfailed(state.tasks[j][i]) - fetch(state.tasks[j][i]) - error("Task failed for population") - end - end - # Non-blocking check if a population is ready: - population_ready = if ropt.parallelism in (:multiprocessing, :multithreading) - # TODO: Implement type assertions based on parallelism. - isready(state.channels[j][i]) - else - true - end - record_channel_state!(resource_monitor, population_ready) - - # Don't start more if this output has finished its cycles: - # TODO - this might skip extra cycles? - population_ready &= (state.cycles_remaining[j] > 0) - if population_ready - if n_iterations % options.populations == 0 - worst_members = Vector{PopMember}() - end - n_iterations += 1 - # Take the fetch operation from the channel since its ready - (cur_pop, best_seen, cur_record, cur_num_evals) = if ropt.parallelism in - ( - :multiprocessing, :multithreading - ) - take!( - state.channels[j][i] - ) - else - state.worker_output[j][i] - end::DefaultWorkerOutputType{Population{T,L,N},HallOfFame{T,L,N}} - state.last_pops[j][i] = copy(cur_pop) - state.best_sub_pops[j][i] = best_sub_pop(cur_pop; topn=options.topn) - @recorder state.record[] = recursive_merge(state.record[], cur_record) - state.num_evals[j][i] += cur_num_evals - dataset = datasets[j] - cur_maxsize = state.cur_maxsizes[j] - - worst_member = nothing - for member in cur_pop.members - if worst_member == nothing || worst_member.loss < member.loss - worst_member = member - end - size = compute_complexity(member, options) - update_frequencies!(state.all_running_search_statistics[j]; size) - end - if worst_member != nothing && worst_member.loss > 100 # if the worst of population is good then thats still good to keep - push!(worst_members, worst_member) - end - #! format: off - update_hall_of_fame!(state.halls_of_fame[j], cur_pop.members, options) - update_hall_of_fame!(state.halls_of_fame[j], best_seen.members[best_seen.exists], options) - #! format: on - - # Dominating pareto curve - must be better than all simpler equations - dominating = calculate_pareto_frontier(state.halls_of_fame[j]) - if options.llm_options.active && - options.llm_options.prompt_evol && - (n_iterations % options.populations == 0) - update_idea_database(idea_database, dominating, worst_members, options) - end - - if options.save_to_file - save_to_file(dominating, nout, j, dataset, options) - end - ################################################################### - # Migration ####################################################### - if options.migration - best_of_each = Population([ - member for pop in state.best_sub_pops[j] for member in pop.members - ]) - migrate!( - best_of_each.members => cur_pop, options; frac=options.fraction_replaced - ) - end - if options.hof_migration && length(dominating) > 0 - migrate!(dominating => cur_pop, options; frac=options.fraction_replaced_hof) - end - ################################################################### - - state.cycles_remaining[j] -= 1 - if state.cycles_remaining[j] == 0 - break - end - worker_idx = assign_next_worker!( - state.worker_assignment; - out=j, - pop=i, - parallelism=ropt.parallelism, - state.procs, - ) - iteration = if options.use_recorder - key = "out$(j)_pop$(i)" - find_iteration_from_record(key, state.record[]) + 1 - else - 0 - end - - c_rss = deepcopy(state.all_running_search_statistics[j]) - in_pop = copy(cur_pop::Population{T,L,N}) - state.worker_output[j][i] = @sr_spawner( - begin - _dispatch_s_r_cycle( - in_pop, - dataset, - options; - pop=i, - out=j, - iteration, - ropt.verbosity, - cur_maxsize, - running_search_statistics=c_rss, - dominating=dominating, - idea_database=idea_database, - ) - end, - parallelism = ropt.parallelism, - worker_idx = worker_idx - ) - if ropt.parallelism in (:multiprocessing, :multithreading) - state.tasks[j][i] = @async put!( - state.channels[j][i], fetch(state.worker_output[j][i]) - ) - end - - state.cur_maxsizes[j] = get_cur_maxsize(; - options, ropt.total_cycles, cycles_remaining=state.cycles_remaining[j] - ) - move_window!(state.all_running_search_statistics[j]) - if ropt.progress - head_node_occupation = estimate_work_fraction(resource_monitor) - update_progress_bar!( - progress_bar, - only(state.halls_of_fame), - only(datasets), - options, - equation_speed, - head_node_occupation, - ropt.parallelism, - ) - end - end - yield() - - ################################################################ - ## Search statistics - elapsed_since_speed_recording = time() - last_speed_recording_time - if elapsed_since_speed_recording > 1.0 - num_evals_since_last, num_evals_last = let s = sum(sum, state.num_evals) - s - num_evals_last, s - end - current_speed = num_evals_since_last / elapsed_since_speed_recording - push!(equation_speed, current_speed) - average_over_m_measurements = 20 # 20 second running average - if length(equation_speed) > average_over_m_measurements - deleteat!(equation_speed, 1) - end - last_speed_recording_time = time() - end - ################################################################ - - ################################################################ - ## Printing code - elapsed = time() - last_print_time - # Update if time has passed - if elapsed > print_every_n_seconds - if ropt.verbosity > 0 && !ropt.progress && length(equation_speed) > 0 - - # Dominating pareto curve - must be better than all simpler equations - head_node_occupation = estimate_work_fraction(resource_monitor) - print_search_state( - state.halls_of_fame, - datasets; - options, - equation_speed, - ropt.total_cycles, - state.cycles_remaining, - head_node_occupation, - parallelism=ropt.parallelism, - width=options.terminal_width, - ) - end - last_print_time = time() - end - ################################################################ - - ################################################################ - ## Early stopping code - if any(( - check_for_loss_threshold(state.halls_of_fame, options), - check_for_user_quit(state.stdin_reader), - check_for_timeout(start_time, options), - check_max_evals(state.num_evals, options), - )) - break - end - ################################################################ - end - if options.llm_options.active - open(options.llm_options.llm_recorder_dir * "n_iterations.txt", "a") do file - write(file, "- " * string(div(n_iterations, options.populations)) * "\n") - end - end - return nothing -end -function _tear_down!(state::SearchState, ropt::RuntimeOptions, options::Options) - close_reader!(state.stdin_reader) - # Safely close all processes or threads - if ropt.parallelism == :multiprocessing - state.we_created_procs && rmprocs(state.procs) - elseif ropt.parallelism == :multithreading - nout = length(state.worker_output) - for j in 1:nout, i in eachindex(state.worker_output[j]) - wait(state.worker_output[j][i]) - end - end - @recorder json3_write(state.record[], options.recorder_file) - return nothing -end -function _format_output( - state::SearchState, datasets, ropt::RuntimeOptions, options::Options -) - nout = length(datasets) - out_hof = if ropt.dim_out == 1 - embed_metadata(only(state.halls_of_fame), options, only(datasets)) - else - map(j -> embed_metadata(state.halls_of_fame[j], options, datasets[j]), 1:nout) - end - if ropt.return_state - return ( - map(j -> embed_metadata(state.last_pops[j], options, datasets[j]), 1:nout), - out_hof, - ) - else - return out_hof - end -end - -@stable default_mode = "disable" function _dispatch_s_r_cycle( - in_pop::Population{T,L,N}, - dataset::Dataset, - options::Options; - pop::Int, - out::Int, - iteration::Int, - verbosity, - cur_maxsize::Int, - running_search_statistics, - dominating=nothing, - idea_database=nothing, -) where {T,L,N} - record = RecordType() - @recorder record["out$(out)_pop$(pop)"] = RecordType( - "iteration$(iteration)" => record_population(in_pop, options) - ) - num_evals = 0.0 - normalize_frequencies!(running_search_statistics) - out_pop, best_seen, evals_from_cycle = s_r_cycle( - dataset, - in_pop, - options.ncycles_per_iteration, - cur_maxsize, - running_search_statistics; - verbosity=verbosity, - options=options, - record=record, - dominating=dominating, - idea_database=idea_database, - ) - num_evals += evals_from_cycle - out_pop, evals_from_optimize = optimize_and_simplify_population( - dataset, out_pop, options, cur_maxsize, record - ) - num_evals += evals_from_optimize - if options.batching - for i_member in 1:(options.maxsize + MAX_DEGREE) - score, result_loss = score_func(dataset, best_seen.members[i_member], options) - best_seen.members[i_member].score = score - best_seen.members[i_member].loss = result_loss - num_evals += 1 - end - end - return (out_pop, best_seen, record, num_evals) -end - -include("MLJInterface.jl") -using .MLJInterfaceModule: LaSRRegressor, MultitargetLaSRRegressor - -function __init__() - @require_extensions -end - -# Hack to get static analysis to work from within tests: -@ignore include("../test/runtests.jl") - -# TODO: Hack to force ConstructionBase version -using ConstructionBase: ConstructionBase as _ - -include("precompile.jl") -redirect_stdout(devnull) do - redirect_stderr(devnull) do - do_precompilation(Val(:precompile)) - end -end - -end #module SR From 88f0868fbebe3af54bbf2791f91b7d075b9ba91a Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:25:59 +0000 Subject: [PATCH 06/17] more dispatch doctor ignore statements --- src/LibraryAugmentedSymbolicRegression.jl | 2 +- src/Population.jl | 4 ++-- test/test_deterministic.jl | 1 + 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/src/LibraryAugmentedSymbolicRegression.jl b/src/LibraryAugmentedSymbolicRegression.jl index 865316792..27e99b5bb 100644 --- a/src/LibraryAugmentedSymbolicRegression.jl +++ b/src/LibraryAugmentedSymbolicRegression.jl @@ -762,7 +762,7 @@ function _initialize_search!( ropt::RuntimeOptions, options::Options, saved_state, - idea_database_all, + idea_database_all::Vector{Vector{String}}, ) where {T,L,N} nout = length(datasets) diff --git a/src/Population.jl b/src/Population.jl index abd8d0937..75d0b75c2 100644 --- a/src/Population.jl +++ b/src/Population.jl @@ -31,7 +31,7 @@ end options::Options, nfeatures::Int, ::Type{T}, - idea_database::Union{Vector{String},String,Nothing}, + idea_database::Union{Vector{String},Nothing}, ) where {T<:DATA_TYPE} if options.llm_options.active && (rand() < options.llm_options.weights.llm_gen_random) gen_llm_random_tree(nlength, options, nfeatures, T, idea_database) @@ -54,7 +54,7 @@ Create random population with LLM and RNG and score them on the dataset. nlength::Int=3, nfeatures::Int, npop=nothing, - idea_database::Union{Vector{String},String,Nothing}=nothing, + idea_database::Union{Vector{String},Nothing}=nothing, ) where {T,L} @assert (population_size !== nothing) ⊻ (npop !== nothing) population_size = if npop === nothing diff --git a/test/test_deterministic.jl b/test/test_deterministic.jl index deee1e8f7..395e022af 100644 --- a/test/test_deterministic.jl +++ b/test/test_deterministic.jl @@ -31,3 +31,4 @@ for i in 1:2 end @test string(all_outputs[1]) == string(all_outputs[2]) +w \ No newline at end of file From b90bbd3f84ca3f822f161c9d8a0eb8f5d4c4648e Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:28:39 +0000 Subject: [PATCH 07/17] typo --- test/test_deterministic.jl | 1 - 1 file changed, 1 deletion(-) diff --git a/test/test_deterministic.jl b/test/test_deterministic.jl index 395e022af..deee1e8f7 100644 --- a/test/test_deterministic.jl +++ b/test/test_deterministic.jl @@ -31,4 +31,3 @@ for i in 1:2 end @test string(all_outputs[1]) == string(all_outputs[2]) -w \ No newline at end of file From 8b8a0a7931b9e0ffdf042a4ba44a080cd0f541d0 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:31:39 +0000 Subject: [PATCH 08/17] bumping Julia version --- .github/workflows/CI.yml | 4 ++-- Project.toml | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index cb1c9ea86..dcf5a1a8b 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -30,8 +30,8 @@ jobs: - "part2" - "part3" julia-version: - - "1.6" - - "1.8" + - "1.9" + - "1.10" - "1" os: - ubuntu-latest diff --git a/Project.toml b/Project.toml index 03a45cd78..45165978a 100644 --- a/Project.toml +++ b/Project.toml @@ -71,7 +71,7 @@ SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "^1.6" +julia = "1.9,1.10" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" From a742a3bac7d3a6790d08a41eb244d763567ec5de Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:46:12 +0000 Subject: [PATCH 09/17] bump project numbers --- Project.toml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Project.toml b/Project.toml index 45165978a..58f101ba0 100644 --- a/Project.toml +++ b/Project.toml @@ -52,7 +52,7 @@ Distributed = "<0.0.1, 1" DynamicExpressions = "0.19.3" DynamicQuantities = "0.10, 0.11, 0.12, 0.13, 0.14, 1" Enzyme = "0.12" -JSON = "0.21.4" +JSON = "0.21" JSON3 = "1" LineSearches = "7" LossFunctions = "0.10, 0.11" @@ -64,7 +64,7 @@ Pkg = "<0.0.1, 1" PrecompileTools = "1" Printf = "<0.0.1, 1" ProgressBars = "~1.4, ~1.5" -PromptingTools = "~0.54, ~0.55" +PromptingTools = "0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51, 0.52, 0.53, 0.54" Random = "<0.0.1, 1" Reexport = "1" SpecialFunctions = "0.10.1, 1, 2" From 8f3432e1a92c08870b3954e2940fb042299d94ad Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:46:39 +0000 Subject: [PATCH 10/17] Revert "bumping Julia version" This reverts commit 8b8a0a7931b9e0ffdf042a4ba44a080cd0f541d0. --- .github/workflows/CI.yml | 4 ++-- Project.toml | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index dcf5a1a8b..cb1c9ea86 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -30,8 +30,8 @@ jobs: - "part2" - "part3" julia-version: - - "1.9" - - "1.10" + - "1.6" + - "1.8" - "1" os: - ubuntu-latest diff --git a/Project.toml b/Project.toml index 58f101ba0..45d832d9f 100644 --- a/Project.toml +++ b/Project.toml @@ -71,7 +71,7 @@ SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "1.9,1.10" +julia = "^1.6" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" From b6b0ad149f2ad25b8d299bda898321edbed47084 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 22:47:13 +0000 Subject: [PATCH 11/17] bump project numbers --- Project.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Project.toml b/Project.toml index 45d832d9f..29671dc29 100644 --- a/Project.toml +++ b/Project.toml @@ -71,7 +71,7 @@ SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "^1.6" +julia = "1.6" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" From 63ff7e8b63fdee791413ce7b5753430365090b80 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 16:20:19 -0700 Subject: [PATCH 12/17] bumping Julia version to 1.11 --- .github/workflows/CI.yml | 3 +-- Project.toml | 4 ++-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index cb1c9ea86..c11a80b0b 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -30,8 +30,7 @@ jobs: - "part2" - "part3" julia-version: - - "1.6" - - "1.8" + - "1.11" - "1" os: - ubuntu-latest diff --git a/Project.toml b/Project.toml index 29671dc29..abade9939 100644 --- a/Project.toml +++ b/Project.toml @@ -64,14 +64,14 @@ Pkg = "<0.0.1, 1" PrecompileTools = "1" Printf = "<0.0.1, 1" ProgressBars = "~1.4, ~1.5" -PromptingTools = "0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51, 0.52, 0.53, 0.54" +PromptingTools = "0.53, 0.54" Random = "<0.0.1, 1" Reexport = "1" SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "1.6" +julia = "1.11" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" From 5efac1be7206719e903d2dfdbf80e3fa415a1385 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Mon, 16 Sep 2024 16:26:32 -0700 Subject: [PATCH 13/17] bumping Julia to 1.10 --- .github/workflows/CI.yml | 2 +- Project.toml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index c11a80b0b..5e1e61f3c 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -30,7 +30,7 @@ jobs: - "part2" - "part3" julia-version: - - "1.11" + - "1.10" - "1" os: - ubuntu-latest diff --git a/Project.toml b/Project.toml index abade9939..bc4879131 100644 --- a/Project.toml +++ b/Project.toml @@ -71,7 +71,7 @@ SpecialFunctions = "0.10.1, 1, 2" StatsBase = "0.33, 0.34" SymbolicUtils = "0.19, ^1.0.5, 2, 3" TOML = "<0.0.1, 1" -julia = "1.11" +julia = "1.10" [extras] Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" From c8d3afabda9a36c93f7464be0e2776040ed9ebb2 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Tue, 17 Sep 2024 00:54:18 +0000 Subject: [PATCH 14/17] Resolve corner cases LLMFunctions --- README.md | 4 ++-- src/LLMFunctions.jl | 11 ++++++++--- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index dec25e782..59c95222b 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,11 @@
-LibraryAugmentedSymbolicRegression.jl accelerates the search for symbolic expressions using library learning. +LibraryAugmentedSymbolicRegression.jl (LaSR.jl) accelerates the search for symbolic expressions using library learning. | Latest release | Website | Forums | Paper | | :---: | :---: | :---: | :---: | -| [![version](https://juliahub.com/docs/LaSR/version.svg)](https://juliahub.com/ui/Packages/LaSR/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-????.?????-b31b1b)](https://atharvas.net/static/lasr.pdf) | +| [![version](https://juliahub.com/docs/LibraryAugmentedSymbolicRegression/version.svg)](https://juliahub.com/ui/Packages/LibraryAugmentedSymbolicRegression/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-????.?????-b31b1b)](https://atharvas.net/static/lasr.pdf) | | Build status | Coverage | | :---: | :---: | diff --git a/src/LLMFunctions.jl b/src/LLMFunctions.jl index c5bc5b5f0..2557af19e 100644 --- a/src/LLMFunctions.jl +++ b/src/LLMFunctions.jl @@ -21,7 +21,7 @@ using DynamicExpressions: AbstractOperatorEnum using Compat: Returns, @inline using ..CoreModule: Options, DATA_TYPE, binopmap, unaopmap, LLMOptions -using ..MutationFunctionsModule: gen_random_tree_fixed_size +using ..MutationFunctionsModule: gen_random_tree_fixed_size, random_node_and_parent using PromptingTools: SystemMessage, @@ -58,12 +58,13 @@ function convertDict(d)::NamedTuple end function get_vars(options::Options)::String - variable_names = ["x", "y", "z", "k", "j", "l", "m", "n", "p", "a", "b"] - if !isnothing(options.llm_options.var_order) + if !isnothing(options.llm_options) && !isnothing(options.llm_options.var_order) variable_names = [ options.llm_options.var_order[key] for key in sort(collect(keys(options.llm_options.var_order))) ] + else + variable_names = ["x", "y", "z", "k", "j", "l", "m", "n", "p", "a", "b"] end return join(variable_names, ", ") end @@ -105,6 +106,7 @@ function construct_prompt( # if n_occurrences is less than |element_list|, add the missing elements after the last occurrence if n_occurrences < length(element_list) last_occurrence = findlast(x -> occursin(pattern, x), lines) + @assert last_occurrence !== nothing "No occurrences of the element_id_tag found in the user prompt." for i in reverse((n_occurrences + 1):length(element_list)) new_line = replace(lines[last_occurrence], string(n_occurrences) => string(i)) insert!(lines, last_occurrence + 1, new_line) @@ -544,6 +546,9 @@ function parse_msg_content(msg_content) try out = parse(content) # json parse + if out === nothing + return [] + end if out isa Dict return [out[key] for key in keys(out)] end From 6e7e8d4c611df2079aa6a06c01b4c55791017858 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Tue, 17 Sep 2024 02:56:40 +0000 Subject: [PATCH 15/17] update README --- README.md | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 59c95222b..ff899406f 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ LibraryAugmentedSymbolicRegression.jl (LaSR.jl) accelerates the search for symbo | Latest release | Website | Forums | Paper | | :---: | :---: | :---: | :---: | -| [![version](https://juliahub.com/docs/LibraryAugmentedSymbolicRegression/version.svg)](https://juliahub.com/ui/Packages/LibraryAugmentedSymbolicRegression/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-????.?????-b31b1b)](https://atharvas.net/static/lasr.pdf) | +| [![version](https://juliahub.com/docs/LibraryAugmentedSymbolicRegression/version.svg)](https://juliahub.com/ui/Packages/LibraryAugmentedSymbolicRegression/X2eIS) | [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://trishullab.github.io/lasr-web/) | [![Discussions](https://img.shields.io/badge/discussions-github-informational)](https://github.com/trishullab/LibraryAugmentedSymbolicRegression.jl/discussions) | [![Paper](https://img.shields.io/badge/arXiv-2409.09359-b31b1b)](https://arxiv.org/abs/2409.09359) | | Build status | Coverage | | :---: | :---: | @@ -14,18 +14,33 @@ LibraryAugmentedSymbolicRegression.jl (LaSR.jl) accelerates the search for symbo LaSR is integrated with [SymbolicRegression.jl](https://github.com/MilesCranmer/SymbolicRegression.jl). Check out [PySR](https://github.com/MilesCranmer/PySR) for a Python frontend. -[Cite this software](https://arxiv.org/abs/????.?????) +[Cite this software](https://arxiv.org/abs/2409.09359)
**Contents**: -- [Quickstart](#quickstart) - [Benchmarking](#benchmarking) +- [Quickstart](#quickstart) - [Organization](#organization) - [LLM Utilities](#llm-utilities) +## Benchmarking + +If you'd like to compare with LaSR, we've archived the code used in the paper in the `lasr-experiments` branch. Clone this repository and run: +```bash +$ git switch lasr-experiments +``` +to switch to the branch and follow the instructions in the README to reproduce our results. This directory contains the code for evaluating LaSR on the + +- [x] Feynman Equations dataset +- [x] Synthetic equations dataset + - [x] and generation code +- [x] Bigbench experiments + - [x] and evaluation code + + ## Quickstart Install in Julia with: @@ -129,20 +144,6 @@ llm_options = LLMOptions( ) ``` -## Benchmarking - -If you wish to compare against LaSR, we've archived the code we used to run LaSR on top of PySR and SymbolicRegression.jl in the `lasr-experiments` branch. Run -```bash -$ git switch lasr-experiments -``` -and follow the instructions in the README to reproduce our results. This directory contains the code for evaluating LaSR on the - -- [x] Feynman Equations dataset -- [x] Synthetic equations dataset - - [x] and generation code -- [x] Bigbench experiments - - [x] and evaluation code - ## Organization LibraryAugmentedSymbolicRegression.jl development is kept independent from the main codebase. However, to ensure LaSR can be used easily, it is integrated into SymbolicRegression.jl via the [`ext/SymbolicRegressionLaSRExt`](https://www.example.com) extension module. This, in turn, is loaded into PySR. This cartoon summarizes the interaction between the different packages: From e09afeabc566df74a45a1864ce9c9746235dc591 Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Tue, 17 Sep 2024 03:01:46 +0000 Subject: [PATCH 16/17] update README --- README.md | 149 ------------------------------------------------------ 1 file changed, 149 deletions(-) diff --git a/README.md b/README.md index ff899406f..6cb0bc4b3 100644 --- a/README.md +++ b/README.md @@ -154,155 +154,6 @@ LibraryAugmentedSymbolicRegression.jl development is kept independent from the m > The `ext/SymbolicRegressionLaSRExt` module is not yet available in the released version of SymbolicRegression.jl. It will be available in the release `vX.X.X` of SymbolicRegression.jl. -## Code structure - -LibraryAugmentedSymbolicRegression.jl is organized roughly as follows. -Rounded rectangles indicate objects, and rectangles indicate functions. - -> (if you can't see this diagram being rendered, try pasting it into [mermaid-js.github.io/mermaid-live-editor](https://mermaid-js.github.io/mermaid-live-editor)) - -```mermaid -flowchart TB - op([Options]) - d([Dataset]) - op --> ES - d --> ES - subgraph ES[equation_search] - direction TB - IP[sr_spawner] - IP --> p1 - IP --> p2 - subgraph p1[Thread 1] - direction LR - pop1([Population]) - pop1 --> src[s_r_cycle] - src --> opt[optimize_and_simplify_population] - opt --> pop1 - end - subgraph p2[Thread 2] - direction LR - pop2([Population]) - pop2 --> src2[s_r_cycle] - src2 --> opt2[optimize_and_simplify_population] - opt2 --> pop2 - end - pop1 --> hof - pop2 --> hof - hof([HallOfFame]) - hof --> migration - pop1 <-.-> migration - pop2 <-.-> migration - migration[migrate!] - end - ES --> output([HallOfFame]) -``` - -The `HallOfFame` objects store the expressions with the lowest loss seen at each complexity. - -The dependency structure of the code itself is as follows: - -```mermaid -stateDiagram-v2 - AdaptiveParsimony --> Mutate - AdaptiveParsimony --> Population - AdaptiveParsimony --> RegularizedEvolution - AdaptiveParsimony --> SingleIteration - AdaptiveParsimony --> LaSR - CheckConstraints --> Mutate - CheckConstraints --> LaSR - Complexity --> CheckConstraints - Complexity --> HallOfFame - Complexity --> LossFunctions - Complexity --> Mutate - Complexity --> Population - Complexity --> SearchUtils - Complexity --> SingleIteration - Complexity --> LaSR - ConstantOptimization --> Mutate - ConstantOptimization --> SingleIteration - Core --> AdaptiveParsimony - Core --> CheckConstraints - Core --> Complexity - Core --> ConstantOptimization - Core --> HallOfFame - Core --> InterfaceDynamicExpressions - Core --> LossFunctions - Core --> Migration - Core --> Mutate - Core --> MutationFunctions - Core --> PopMember - Core --> Population - Core --> Recorder - Core --> RegularizedEvolution - Core --> SearchUtils - Core --> SingleIteration - Core --> LaSR - Dataset --> Core - HallOfFame --> SearchUtils - HallOfFame --> SingleIteration - HallOfFame --> LaSR - InterfaceDynamicExpressions --> LossFunctions - InterfaceDynamicExpressions --> LaSR - LossFunctions --> ConstantOptimization - LossFunctions --> HallOfFame - LossFunctions --> Mutate - LossFunctions --> PopMember - LossFunctions --> Population - LossFunctions --> LaSR - Migration --> LaSR - Mutate --> RegularizedEvolution - MutationFunctions --> Mutate - MutationFunctions --> Population - MutationFunctions --> LaSR - Operators --> Core - Operators --> Options - Options --> Core - OptionsStruct --> Core - OptionsStruct --> Options - PopMember --> ConstantOptimization - PopMember --> HallOfFame - PopMember --> Migration - PopMember --> Mutate - PopMember --> Population - PopMember --> RegularizedEvolution - PopMember --> SingleIteration - PopMember --> LaSR - Population --> Migration - Population --> RegularizedEvolution - Population --> SearchUtils - Population --> SingleIteration - Population --> LaSR - ProgramConstants --> Core - ProgramConstants --> Dataset - ProgressBars --> SearchUtils - ProgressBars --> LaSR - Recorder --> Mutate - Recorder --> RegularizedEvolution - Recorder --> SingleIteration - Recorder --> LaSR - RegularizedEvolution --> SingleIteration - SearchUtils --> LaSR - SingleIteration --> LaSR - Utils --> CheckConstraints - Utils --> ConstantOptimization - Utils --> Options - Utils --> PopMember - Utils --> SingleIteration - Utils --> LaSR -``` - -Bash command to generate dependency structure from `src` directory (requires `vim-stream`): - -```bash -echo 'stateDiagram-v2' -IFS=$'\n' -for f in *.jl; do - for line in $(cat $f | grep -e 'import \.\.' -e 'import \.'); do - echo $(echo $line | vims -s 'dwf:d$' -t '%s/^\.*//g' '%s/Module//g') $(basename "$f" .jl); - done; -done | vims -l 'f a--> ' | sort -``` - ## Search options Other than `LLMOptions`, We have the same search options as SymbolicRegression.jl. See https://astroautomata.com/SymbolicRegression.jl/stable/api/#Options From 9e017c286911e1d92876698c783645e9abf8e01e Mon Sep 17 00:00:00 2001 From: Atharva Sehgal Date: Tue, 17 Sep 2024 03:03:25 +0000 Subject: [PATCH 17/17] add CITATIONS --- CITATION.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/CITATION.md b/CITATION.md index 367d5089d..7272f07a3 100644 --- a/CITATION.md +++ b/CITATION.md @@ -31,4 +31,16 @@ To cite symbolic distillation of neural networks, the following BibTeX entry can } ``` -To cite Lang \ No newline at end of file +To cite LaSR, please use the following BibTeX entry: + +```bibtex +@misc{grayeli2024symbolicregressionlearnedconcept, + title={Symbolic Regression with a Learned Concept Library}, + author={Arya Grayeli and Atharva Sehgal and Omar Costilla-Reyes and Miles Cranmer and Swarat Chaudhuri}, + year={2024}, + eprint={2409.09359}, + archivePrefix={arXiv}, + primaryClass={cs.LG}, + url={https://arxiv.org/abs/2409.09359}, +} +``` \ No newline at end of file