Skip to content

Commit

Permalink
🍱 Add linkedin, tweet, fix typo
Browse files Browse the repository at this point in the history
  • Loading branch information
falexwolf committed Apr 3, 2024
1 parent bad1cb8 commit a973cb8
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/arrayloader-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ affiliation:
falexwolf: Lamin Labs, Munich
db: https://lamin.ai/laminlabs/arrayloader-benchmarks
repo: https://github.com/laminlabs/arrayloader-benchmarks
tweet: https://twitter.com/falexwolf/status/1775476575011553500
linkedin: https://www.linkedin.com/posts/falexwolf_whats-a-good-way-of-organizing-scrna-seq-activity-7181245277415079937-caSw
---

---
Expand Down Expand Up @@ -72,7 +74,7 @@ Here, `MappedCollection` is a [map-style PyTorch data loader](https://lamin.ai/d

![](https://lamin-site-assets.s3.amazonaws.com/.lamindb/n9cf1yZzUpMNiPmZqo3m.svg)

**Figure 1 ([source](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/faAhgiIDemaP4BB5))**: We compared NVIDIA Merlin based on a local collection of parquet files, `MappedCollection` based on a local collection of h5ad files, and `cellxgene_census` based on a `tiledbsoma` store in the cloud. Shown is the batch loading time (standard boxplot, **left**), the time per epoch (barplot, **center**), and the number of samples loaded per second (barplot, **right**) with statistics gathered across ~50k batch loading operations during 5 epochs for each method. The raw data consists of 138 `.h5ad` files hosted by CZI and was transformed into parquet files [here](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/GjHlkZOA4wKp5zKv). For `cellxgene_census`, we use the concatenated version `tiledbsoma` store hosted by CZI and access from within the same AWS data center `us-west-2` for maximal streaming speed ([benchmark](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/Md9ea0bLFozt65cN)). Outside of `us-west-2`, the speed is _much_ slower. We ran all benchmarks on AWS SageMaker using a `ml.g4dn.2xlarge` EC2 instance. NVIDIA Merlin runs into memory overflow during the benchmark, and we manually triggered the garbage collector.
**Figure 1 ([source](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/faAhgiIDemaP4BB5))**: We compared NVIDIA Merlin based on a local collection of parquet files, `MappedCollection` based on a local collection of h5ad files, and `cellxgene_census` based on a `tiledbsoma` store in the cloud. Shown is the batch loading time (standard boxplot, **left**), the time per epoch (barplot, **center**), and the number of samples loaded per second (barplot, **right**) with statistics gathered across ~50k batch loading operations during 5 epochs for each method. The raw data consists of 138 `.h5ad` files hosted by CZI and was transformed into parquet files [here](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/GjHlkZOA4wKp5zKv). For `cellxgene_census`, we use the concatenated `tiledbsoma` store hosted by CZI and access it from within the same AWS data center `us-west-2` for maximal streaming speed ([benchmark](https://lamin.ai/laminlabs/arrayloader-benchmarks/transform/Md9ea0bLFozt65cN)). Outside of `us-west-2`, the speed is _much_ slower. We ran all benchmarks on AWS SageMaker using a `ml.g4dn.2xlarge` EC2 instance. NVIDIA Merlin runs into memory overflow during the benchmark, and we manually triggered the garbage collector.

### Sampling batches from large array collections

Expand Down

0 comments on commit a973cb8

Please sign in to comment.