FuseMine is a comprehensive pipeline designed to link data across multiple mineral site databases, such as USMIN and MRDS, and mineral site records extracted from reports and tables through tools created by TA2 Inferlink and USC. It facilitates the reconciliation process by utilizing both geospatial data and semantic similarities in textual attributes, such as site name and commodity.
To run FuseMine, ensure the following requirements are met:
- Hardware: A local device with GPU support is recommended.
- Software: Docker must be installed. Follow the installation steps on the official Docker website.
- Memory: At least 10 GB of available memory is suggested (actual requirements depend on the data size).
- Network: An active internet connection is required during execution.
FuseMine simplifies usage by providing shell scripts for common workflows, including processing raw data into schema format and linking mineral sites. These scripts:
- Clone the MinMod data repository (if not already cloned).
- Create and run a Docker container for the specified task.
- Export the resulting data from the container
- Push the data to the data repository.
Ensure you are running these scripts on a system capable of executing Bash scripts. If you would like to run through Docker, use follow the commands under Command (with Docker).
- Clone the repository
git clone https://github.com/DARPA-CRITICALMAAS/umn-ta2-mineral-site-linkage.git
cd umn-ta2-mineral-site-linkage
The data processing step requires a manually curated attribute map file, which maps the headers of the raw structured data to the MinMod KG schema. An example of an attribute map CSV is provided: sample_mapfile.csv. Attributes that FuseMine can map to are listed under 'attribute_label'. If there is no label in the raw data representing certain label (e.g., there is no column representing 'deposit_type_candidate') leave the 'corresponding_attribute_label' empty.
./run_processing.sh <path_to_raw_data> <path_to_attribute_map>
path_to_raw_data
: Path (either file or directory) to the raw mineral site database.path_to_attribute_map
: Path to the CSV file containing label mapping information (refer to sample_mapfile.csv for guidance)
- Build and start the Docker container
docker build -t ta2-processing .
docker run -dit ta2-processing
docker exec -it <container_id> /bin/bash
container_id
: Container ID of the container. Find by runningdocker ps
- Move data into the container
docker cp <path_to_raw_data> <container_id>:/umn-ta2-mineral-site-linkage
docker cp <path_to_attribute_map> <container_id>:/umn-ta2-mineral-site-linkage
path_to_raw_data
: Path (either file or directory) to the raw mineral site database.path_to_attribute_map
: Path to the CSV file containing label mapping information (refer to sample_mapfile.csv for guidance)
- Run the command
poetry run python3 process_data_to_schema.py --raw_data ./<raw_data_filename> --attribute_map ./<attribute_map_filename> --schema_output_filename <file_name>
file_name
: Desired filename of the processed raw data
- Move the data from Docker container to local
exit
docker cp <container_id>:/umn-ta2-mineral-site-linkage/outputs/<file_name>.json <save_path>
save_path
: Path of directory to store the processed raw data
Processing Time: Several hours depending on the size of the raw data.
Before running FuseMine, ensure that all data is available on the MinMod Knowledge Graph (MinMod KG). The last update time and version of MinMod KG can be checked here.
./run_linking.sh <commodity>
commodity
: The specific commodity to focus on. The commodity name must match one of the MRDS commodity names listed here.
- Build Docker container
docker build -t ta2-linking .
docker run -dit ta2-linking
docker exec -it <container_id> /bin/bash
container_id
: Container ID of the created Docker container. Can be found by runningdocker ps
- Run the command
poetry run python3 fusemine.py --commodity <commodity>
commodity
: The specific commodity to focus on. The commodity name must match one of the MRDS commodity names listed here.
- Move data from Docker container to local
exit
docker cp <container_id>:/umn-ta2-mineral-site-linkage/outputs/<commodity>_sameas.csv <save_path>
save_path
: Path of directory to store the same_as links
- Stop Docker container
docker stop <container_id>
FuseMine generates a log file that reports the number of available records on the MinMod KG, the number of identified mineral sites, and the run time. These log files are saved in the ./logs
directory.
Once the data is merged into the main branch, it typically takes 5 to 10 minutes for the changes to be reflected on the KG.