This is the code for fine-tuning Mistral, llama2, Baichuan2, and Falcon for geocoding or toponym resolution tasks. These models are fine-tuned to accurately estimate toponyms' unambiguous references or full addresses (e.g., city, state, country), subsequently determining their geo-coordinates via geocoders, as shown in the figure below.
conda create --name myenv python=3.10
conda activate myenv
pip install -r requirements.txt
OUT='lora_weights_save_dir'
mkdir $OUT
python finetune_llm.py \
--data_file "data/training_data.json" \
--R 16 \
--batch 32 \
--ALPHA 16 \
--dropout 0.1 \
--BASE_MODEL "kittn/mistral-7B-v0.1-hf" \
--OUTPUT_DIR=$OUT \
--LEARNING_RATE 3e-3 \
--neftune_noise_alpha 0.1 \
--TRAIN_STEPS 500
Use the Baichuan2 project to execute the following code. Replace the fine-tune.py file in the fine-tune folder with the provided fine-tune.py file from this project. Make sure to install the required dependencies for this project.
OUT='lora_weights_save_dir'
mkdir $OUT
hostfile=""
deepspeed --hostfile=$hostfile fine-tune.py \
--report_to "none" \
--data_path "data/training_data_baichuan.json" \
--model_name_or_path "baichuan-inc/Baichuan2-7B-Base" \
--output_dir $OUT \
--model_max_length 512 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 1 \
--max_steps 500 \
--save_strategy 'steps' \
--learning_rate 3e-3 \
--save_steps 2 \
--eval_steps 2 \
--lr_scheduler_type constant \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--logging_steps 2 \
--gradient_checkpointing True \
--deepspeed ds_config.json \
--bf16 True \
--tf32 True \
--use_lora True \
Use the lit-gpt project to execute the following code. Make sure to install the required dependencies for this project.
Out='lora_weights_save_dir'
mkdir $Out
python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b \
--data_dir data/7 \
--out_dir $Out \
--device 1 \
--precision bf16-true \
We have provided the fine-tuned models (LoRA weights) for your convenience. The available models are:
Unzip the test_data.zip to the appropriate location.
For Mistral, Llama2, and Baichuan2, execute the following code:
BASE_MODEL="kittn/mistral-7B-v0.1-hf"
LORA_WEIGHTS="path_of_the_lora_weights"
python prediction.py \
--load_8bit False\
--base_model "$BASE_MODEL" \
--lora_weights "$LORA_WEIGHTS" \
For Falcon, please use the lit-gpt to execute the following code. Ensure that the falcon_prediction.py file is placed in the generation folder of the lit-gpt project.
python generate/lora_location.py \
--checkpoint_dir checkpoints/tiiuae/falcon-7b \
--lora_path "path_of_the_lora_weights" # e.g., iter-183552-ckpt.pth \
--top_k 40 \
--max_seq_length 2048 \
This step will estimate the unambiguous reference for toponyms in a text, such as 'Paris, Texas, US' for 'Paris.' By further querying geocodes using services like Nominatim, GeoNames, ArcGIS API, or a combination of these geocoders, you can determine the geo-coordinates and other properties, such as population and type of the toponym.
If you use the code or data, please cite the following publication:
@article{hu2024toponym,
title={Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge},
author={Hu, Xuke and Kersten, Jens and Klan, Friederike and Farzana, Sheikh Mastura},
journal={International Journal of Geographical Information Science},
pages={1--28},
year={2024},
publisher={Taylor & Francis}
}