Skip to content

Commit

Permalink
Md files need to have only one heading for rst files to
Browse files Browse the repository at this point in the history
show proper titles.

e.g Python with HuggingFace <../tutorials/Quick_Deploy/HuggingFaceTransformers/README.md>
will not show as "Python with HuggingFace" in userguides if the README
had multiple headers.
  • Loading branch information
statiraju committed Jan 9, 2025
1 parent 2cb9deb commit 792afc4
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions Quick_Deploy/HuggingFaceTransformers/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -176,10 +176,10 @@ Using this technique you should be able to serve any transformer models supporte
hugging face with Triton.


# Next Steps
## Next Steps
The following sections expand on the base tutorial and provide guidance for future sandboxing.

## Loading Cached Models
### Loading Cached Models
In the previous steps, we downloaded the falcon-7b model from hugging face when we
launched the Triton server. We can avoid this lengthy download process in subsequent runs
by loading cached models into Triton. By default, the provided `model.py` files will cache
Expand All @@ -206,14 +206,14 @@ command from earlier (making sure to replace `${HOME}` with the path to your ass
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface
```

## Triton Tool Ecosystem
### Triton Tool Ecosystem
Deploying models in Triton also comes with the benefit of access to a fully-supported suite
of deployment analyzers to help you better understand and tailor your systems to fit your
needs. Triton currently has two options for deployment analysis:
- [Performance Analyzer](https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton-inference-server-2310/user-guide/docs/user_guide/perf_analyzer.html): An inference performance optimizer.
- [Model Analyzer](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html) A GPU memory and compute utilization optimizer.

### Performance Analyzer
#### Performance Analyzer
To use the performance analyzer, please remove the persimmon8b model from `model_repository` and restart
the Triton server using the `docker run` command from above.

Expand Down Expand Up @@ -289,7 +289,7 @@ guide.
For more information regarding dynamic batching in Triton, please see [this](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher)
guide.

### Model Analyzer
#### Model Analyzer

In the performance analyzer section, we used intuition to increase our throughput by changing
a subset of variables and measuring the difference in performance. However, we only changed
Expand Down

0 comments on commit 792afc4

Please sign in to comment.