Named Entity Recogniton (NER) of biological terms in BioSample records using LLMs
See also the documentation by ollama
docker pull ollama/ollama:0.5.4
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:0.5.4
docker exec ollama ollama pull llama3.1:70b
docker network create network_ollama
docker network connect network_ollama ollama
docker pull shikeda/bsllmner:latest
In the extraction mode, the program extracts strings of a specified type from the input json. The detail is defined in the prompt. As a common procedure, an input JSON object, which is given as an item in the JSON list provided as the input, is appended to the last prompt.
docker run --rm --network network_ollama -v `pwd`:/data/ shikeda/bsllmner:latest -m llama3.1:70b -i 5,2,6,7 -v -u http://ollama:11434 extract /data/input.json
-m llama3.1:70b
: Specify LLM model-i 5,2,6,7
: Specify the prompt indices. Each number corresponds to an index number of a prompt defined inbsllmner/prompt/prompt.yaml
. The input to LLM is constructed as the array in this order. If you want to use a customized prompt, you can specify a yaml file with the-p
option.-v
: display progress-u http://ollama:11434
: Specify the URL of ollama serverextract
: Extraction mode/data/input.json
: input json
The input json is like below. For each sample, the accession
attribute is required as the identifier of the sample.
[
{
"accession": "SAMD00123367",
"cell line": "H1299",
"organism": "Homo sapiens",
"sample name": "ATAC-seq_H1299_48h_G11_GSK1210151A (Inhibitor_BET)_0.1",
"title": "ATAC-seq_H1299_48h_G11_GSK1210151A (Inhibitor_BET)_0.1"
},
{
"accession": "SAMD00235411",
"cell line": "SKNO-1",
"organism": "Homo sapiens",
"phenotype": "shRNA_2 against human KDM4B",
"sample name": "SKNO1 4B sh2",
"title": "ATAC-seq 4B sh2"
}
]
Also, the list of json output by the EBI BioSamples API (example) is also available.
[
{
"accession": "SAMD00123367",
"taxId": 9606,
"characteristics": {
"cell line": [{"text": "H1299"}],
"organism": [{"text":"Homo sapiens"}],
"sample name": [{"text": "ATAC-seq_H1299_48h_G11_GSK1210151A (Inhibitor_BET)_0.1"}],
"title": [{"text": "ATAC-seq_H1299_48h_G11_GSK1210151A (Inhibitor_BET)_0.1"}]
}
}
]
Each output file of the API includes a single object. The jq
command can be used to merge the files like: jq -s '.' *json
.
For details of the EBI BioSamples API, please see the document.
The result of the extraction mode is output as json-lines like below. The output_full
attribute contains the raw output of LLM for the sample. The conclusion of LLM is assumed to be JSON format and is output as the output
value.
{"accession": "SAMD00123367", "characteristics": {"cell_line": ["text": "H1299"]}, "output": {"cell_line": "H1299"}, "output_full": "Let's break it down... Therefore, my output will be:\n\n{\"cell_line\": \"H1299\"}", "taxId": 9606}
{"accession": "SAMD00235411", "characteristics": {"cell_line": ["text": "SKNO-1"]}, "output": {"cell_line": "SKNO-1"}, "output_full": "Let's break it down... Here is my output:\n\n{\"cell_line\": \"SKNO-1\"}", "taxId": 9606}
characteristics
and taxId
are used in ontology-mapping with the MetaSRA pipeline. (This output json-lines can be directly used as an input for the pipeline.)
As a result of the ontology mapping process, multiple ontology terms can be found as candidates to represent a single BioSample record. In the selection mode, the program selects a term that is most likely to represent the sample among candidates.
docker run --rm --network network_ollama -v `pwd`:/data/ shikeda/bsllmner:latest -m llama3:8b -i 5,2,6,7,14 -r /data/metasraout.tsv -l /data/llmout.jsonl -u http://ollama:11434 select /data/input.json
-i 5,2,6,7,14
: Specify the prompt indices. Each number corresponds to an index number of a prompt defined inbsllmner/prompt/prompt.yaml
. The input to LLM is constructed as the array in this order. If you want to use a customized prompt, you can specify a yaml file with the-p
option. The last prompt is assumed to describe the selection task. The rest ones are same as indices that were used in the extraction mode.-r /data/metasraout.tsv
: Specify TSV file output by MetaSRA-l /data/llmout.jsonl
: Specify json-lines file output by the extraction mode ofbsllmner
select
: Selection mode/data/input.json
: input json (the same file as the input of the extraction mode)
The result is output as json-lines like below. The output_full
attribute contains the raw output of LLM for the sample. The conclusion of LLM is assumed to be JSON format and is output as the output
value.
{"accession": "SAMN08200557", "output": {"cell_line_id": "CVCL:9773"}, "output_full": "Let's compare each term... Output: `{\"cell_line_id\": \"CVCL:9773\"}`"}
{"accession": "SAMN12541232", "output": {"cell_line_id": "CVCL:7735"}, "output_full": "Let's compare each term... Based on the confidence scores, I would output:\n\n{\"cell_line_id\": \"CVCL:7735\"}"}