In this section we demonstrate an end-to-end example for
BLS in Python backend. The
model repository
should contain square model. The square model
will send 'n' responses where 'n' is the value of input IN
. For each response,
output OUT
will equal the value of IN
. This example is broken into two
sections. The first section demonstrates how to perform synchronous BLS requests
and the second section shows how to execute asynchronous BLS requests.
The goal of bls_decoupled_sync
model is to calculate the sum of the responses
returned from the square model and return the summation as the final response. The value of input 'IN' will be passed as an input to the
square model which determines how many responses the
square model will generate.
- Create the model repository:
mkdir -p models/bls_decoupled_sync/1
mkdir -p models/square_int32/1
# Copy the Python models
cp examples/bls_decoupled/sync_model.py models/bls_decoupled_sync/1/model.py
cp examples/bls_decoupled/sync_config.pbtxt models/bls_decoupled_sync/config.pbtxt
cp examples/decoupled/square_model.py models/square_int32/1/model.py
cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
- Start the tritonserver:
tritonserver --model-repository `pwd`/models
- Send inference requests to server:
python3 examples/bls_decoupled/sync_client.py
You should see an output similar to the output below:
==========model result==========
The square value of [4] is [16]
==========model result==========
The square value of [2] is [4]
==========model result==========
The square value of [0] is [0]
==========model result==========
The square value of [1] is [1]
PASS: BLS Decoupled Sync
The sync_model.py model file is heavily commented with explanations about each of the function calls.
The client.py sends 4 inference requests to the
bls_decoupled_sync
model with the input as: [4], [2], [0] and [1]
respectively. In compliance with the behavior of the sync BLS model,
it will expect the output to be the square value of the input.
In this section we explain how to send multiple BLS requests without waiting for their response. Asynchronous execution of BLS requests will not block your model execution and can lead to speedups under certain conditions.
The bls_decoupled_async
model will perform two async BLS requests on the
square model. Then, it will wait until the inference requests
are completed. It will calculate the sum of the output OUT
from the
square model in both two requests to construct the final
inference response object using these tensors.
- Create the model repository:
mkdir -p models/bls_decoupled_async/1
mkdir -p models/square_int32/1
# Copy the Python models
cp examples/bls_decoupled/async_model.py models/bls_decoupled_async/1/model.py
cp examples/bls_decoupled/async_config.pbtxt models/bls_decoupled_async/config.pbtxt
cp examples/decoupled/square_model.py models/square_int32/1/model.py
cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
- Start the tritonserver:
tritonserver --model-repository `pwd`/models
- Send inference requests to server:
python3 examples/bls_decoupled/async_client.py
You should see an output similar to the output below:
==========model result==========
Two times the square value of [4] is [32]
==========model result==========
Two times the square value of [2] is [8]
==========model result==========
Two times the square value of [0] is [0]
==========model result==========
Two times the square value of [1] is [2]
PASS: BLS Decoupled Async
The async_model.py model file is heavily commented with explanations about each of the function calls.
The client.py sends 4 inference requests to the 'bls_decoupled_sync' model with the input as: [4], [2], [0] and [1] respectively. In compliance with the behavior of sync BLS model model, it will expect the output to be two time the square value of the input.