Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix trpc device map #2112

Closed
wants to merge 11 commits into from
42 changes: 19 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,40 @@

# FEDML Open Source: A Unified and Scalable Machine Learning Library for Running Training and Deployment Anywhere at Any Scale

Backed by FEDML Nexus AI: Next-Gen Cloud Services for LLMs & Generative AI (https://fedml.ai)
Backed by TensorOpera AI: Your Generative AI Platform at Scale (https://TensorOpera.ai)

<div align="center">
<img src="docs/images/fedml_logo_light_mode.png" width="400px">
<img src="docs/images/TensorOpera_arch.png" width="600px">
</div>

FedML Documentation: https://doc.fedml.ai
TensorOpera Documentation: https://docs.TensorOpera.ai

FedML Homepage: https://fedml.ai/ \
FedML Blog: https://blog.fedml.ai/ \
FedML Medium: https://medium.com/@FedML \
FedML Research: https://fedml.ai/research-papers/
TensorOpera Homepage: https://TensorOpera.ai/ \
TensorOpera Blog: https://blog.TensorOpera.ai/

Join the Community: \
Join the Community:
Slack: https://join.slack.com/t/fedml/shared_invite/zt-havwx1ee-a1xfOUrATNfc9DFqU~r34w \
Discord: https://discord.gg/9xkW8ae6RV


FEDML® stands for Foundational Ecosystem Design for Machine Learning. [FEDML Nexus AI](https://fedml.ai) is the next-gen cloud service for LLMs & Generative AI. It helps developers to *launch* complex model *training*, *deployment*, and *federated learning* anywhere on decentralized GPUs, multi-clouds, edge servers, and smartphones, *easily, economically, and securely*.
TensorOpera® AI (https://TensorOpera.ai) is the next-gen cloud service for LLMs & Generative AI. It helps developers to launch complex model training, deployment, and federated learning anywhere on decentralized GPUs, multi-clouds, edge servers, and smartphones, easily, economically, and securely.

Highly integrated with [FEDML open source library](https://github.com/fedml-ai/fedml), FEDML Nexus AI provides holistic support of three interconnected AI infrastructure layers: user-friendly MLOps, a well-managed scheduler, and high-performance ML libraries for running any AI jobs across GPU Clouds.
Highly integrated with TensorOpera open source library, TensorOpera AI provides holistic support of three interconnected AI infrastructure layers: user-friendly MLOps, a well-managed scheduler, and high-performance ML libraries for running any AI jobs across GPU Clouds.

![fedml-nexus-ai-overview.png](./docs/images/fedml-nexus-ai-overview.png)
A typical workflow is showing in figure above. When developer wants to run a pre-built job in Studio or Job Store, TensorOpera®Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. When running the job, TensorOpera®Launch orchestrates the compute plane in different cluster topologies and configuration so that any complex AI jobs are enabled, regardless model training, deployment, or even federated learning. TensorOpera®Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.

A typical workflow is showing in figure above. When developer wants to run a pre-built job in Studio or Job Store, FEDML®Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. When running the job, FEDML®Launch orchestrates the compute plane in different cluster topologies and configuration so that any complex AI jobs are enabled, regardless model training, deployment, or even federated learning. FEDML®Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
In the MLOps layer of TensorOpera AI
- **TensorOpera® Studio** embraces the power of Generative AI! Access popular open-source foundational models (e.g., LLMs), fine-tune them seamlessly with your specific data, and deploy them scalably and cost-effectively using the TensorOpera Launch on GPU marketplace.
- **TensorOpera® Job Store** maintains a list of pre-built jobs for training, deployment, and federated learning. Developers are encouraged to run directly with customize datasets or models on cheaper GPUs.

In the MLOps layer of FEDML Nexus AI
- **FEDML® Studio** embraces the power of Generative AI! Access popular open-source foundational models (e.g., LLMs), fine-tune them seamlessly with your specific data, and deploy them scalably and cost-effectively using the FEDML Launch on GPU marketplace.
- **FEDML® Job Store** maintains a list of pre-built jobs for training, deployment, and federated learning. Developers are encouraged to run directly with customize datasets or models on cheaper GPUs.
In the scheduler layer of TensorOpera AI
- **TensorOpera® Launch** swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. It supports a range of compute-intensive jobs for generative AI and LLMs, such as large-scale training, serverless deployments, and vector DB searches. TensorOpera Launch also facilitates on-prem cluster management and deployment on private or hybrid clouds.

In the scheduler layer of FEDML Nexus AI
- **FEDML® Launch** swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. It supports a range of compute-intensive jobs for generative AI and LLMs, such as large-scale training, serverless deployments, and vector DB searches. FEDML Launch also facilitates on-prem cluster management and deployment on private or hybrid clouds.

In the Compute layer of FEDML Nexus AI
- **FEDML® Deploy** is a model serving platform for high scalability and low latency.
- **FEDML® Train** focuses on distributed training of large and foundational models.
- **FEDML® Federate** is a federated learning platform backed by the most popular federated learning open-source library and the world’s first FLOps (federated learning Ops), offering on-device training on smartphones and cross-cloud GPU servers.
- **FEDML® Open Source** is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
In the Compute layer of TensorOpera AI
- **TensorOpera® Deploy** is a model serving platform for high scalability and low latency.
- **TensorOpera® Train** focuses on distributed training of large and foundational models.
- **TensorOpera® Federate** is a federated learning platform backed by the most popular federated learning open-source library and the world’s first FLOps (federated learning Ops), offering on-device training on smartphones and cross-cloud GPU servers.
- **TensorOpera® Open Source** is unified and scalable machine learning library for running these AI jobs anywhere at any scale.

# Contributing
FedML embraces and thrive through open-source. We welcome all kinds of contributions from the community. Kudos to all of <a href="https://github.com/fedml-ai/fedml/graphs/contributors" target="_blank">our amazing contributors</a>!
Expand Down
Binary file added docs/images/TensorOpera_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions python/fedml/core/distributed/communication/trpc/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
def set_device_map(options, worker_idx, device_list):
local_device = device_list[worker_idx]
for index, remote_device in enumerate(device_list):
logging.warn(f"Setting device map for client {index} as {remote_device}")
logging.warn(f"Setting device map for client {index} as {device_list[remote_device]}")
if index != worker_idx:
options.set_device_map(WORKER_NAME.format(index), {local_device: remote_device})
options.set_device_map(WORKER_NAME.format(index), {local_device: device_list[remote_device]})
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,13 @@ bootstrap: |
pip install PyYAML==5.3.1 -i https://pypi.org/simple
pip install fedml==0.8.29
pip install -U typing_extensions -i https://pypi.org/simple
pip install -U pydantic
pip install -U fastapi
echo "Bootstrap finished."

computing:
resource_type: RTX-4090 # e.g., A100-80G, please check the resource type list by "fedml show-resource-type" or visiting URL: https://open.fedml.ai/accelerator_resource_type
#resource_type: RTX-4090 # e.g., A100-80G, please check the resource type list by "fedml show-resource-type" or visiting URL: https://open.fedml.ai/accelerator_resource_type
resource_type: A100-80GB-SXM
minimum_num_gpus: 1 # minimum # of GPUs to provision
maximum_cost_per_hour: $10 # max cost per hour of all machines for your job
# device_type: GPU # GPU or CPU