You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Summary:
The quantization process using llama.cpp takes 100 times longer on Windows (dockerized) compared to Linux, with the container using only 1% of the CPU despite being capable of using all available cores.
Steps to Reproduce:
Run llama-quantize on Windows and Linux with the same model (I am doing this in SpongeQuant, with the CPU only option).
Observe the CPU usage during the quantization process (using htop on Linux and Task Manager or docker stats on Windows).
Compare the time taken for the same operation between Windows and Linux.
Expected Behavior:
The quantization process should perform similarly on both Windows and Linux, utilizing the available CPU resources efficiently.
Actual Behavior:
On Windows:
The process is extremely slow, taking about 100 times longer than on Linux.
The process uses only 1% of the CPU throughout the execution (but spike at around 80% of CPU when it starts).
On Linux, the process runs as expected with full CPU utilization and normal speed.
Additional Information:
The issue persists regardless of the Docker container configuration or resource limits.
The issue is related to CPU usage and performance throttling, but no specific throttling settings have been identified on Windows.
I have tested with the stress package, and the container is capable of using 100% of the CPU, but quantization is still limited.
Container Setup:
The issue happens in a container using Ubuntu 22.04, with Python dependencies and llama.cpp compiled from source. WSL2 is being used for Windows execution.
# -------------------------------------------------------------------------------# Use a plain Ubuntu image for CPU-only mode.# -------------------------------------------------------------------------------FROM ubuntu:22.04
# -------------------------------------------------------------------------------# Disable interactive prompts.# -------------------------------------------------------------------------------ENV DEBIAN_FRONTEND=noninteractive
# -------------------------------------------------------------------------------# Install required system dependencies.# -------------------------------------------------------------------------------RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
curl \
wget \
ninja-build \
python3 \
python3-pip \
libssl-dev \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
# -------------------------------------------------------------------------------# Set the working directory.# -------------------------------------------------------------------------------WORKDIR /app
# -------------------------------------------------------------------------------# Create a cache directory and set environment variables.# -------------------------------------------------------------------------------RUN mkdir -p /app/.cache && chmod -R 777 /app/.cache
ENV HF_HOME=/app/.cache
ENV HOME=/app
# -------------------------------------------------------------------------------# Copy the requirements file.# -------------------------------------------------------------------------------COPY ./app/requirements.cpu.txt /app/
# -------------------------------------------------------------------------------# Upgrade pip.# -------------------------------------------------------------------------------RUN python3 -m pip install --upgrade pip==25.0
# -------------------------------------------------------------------------------# Force-install torch first so that auto-gptq’s metadata generation finds it.# -------------------------------------------------------------------------------RUN python3 -m pip install torch==2.6.0
# -------------------------------------------------------------------------------# Install the rest of the Python dependencies.# -------------------------------------------------------------------------------RUN python3 -m pip install -r requirements.cpu.txt
# -------------------------------------------------------------------------------# Clone and build llama_cpp (for GGUF quantization).# -------------------------------------------------------------------------------RUN git clone https://github.com/ggerganov/llama.cpp.git /app/llama_cpp
WORKDIR /app/llama_cpp
RUN mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release \
-G Ninja .. && \
ninja -j$(nproc)
# -------------------------------------------------------------------------------# Copy the rest of your application files.# -------------------------------------------------------------------------------COPY ./app /app
WORKDIR /app
# -------------------------------------------------------------------------------# Expose the port (for Gradio UI, for example) and set the entrypoint.# -------------------------------------------------------------------------------EXPOSE 7860
CMD ["python3", "app.py"]
Name and Version
llama-quantize, build = 4691 (369be55)
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-quantize
Command line
Problem description & steps to reproduce
Issue Summary:
The quantization process using llama.cpp takes 100 times longer on Windows (dockerized) compared to Linux, with the container using only 1% of the CPU despite being capable of using all available cores.
Steps to Reproduce:
htop
on Linux and Task Manager ordocker stats
on Windows).Expected Behavior:
The quantization process should perform similarly on both Windows and Linux, utilizing the available CPU resources efficiently.
Actual Behavior:
On Windows:
On Linux, the process runs as expected with full CPU utilization and normal speed.
Additional Information:
stress
package, and the container is capable of using 100% of the CPU, but quantization is still limited.Container Setup:
The issue happens in a container using Ubuntu 22.04, with Python dependencies and llama.cpp compiled from source. WSL2 is being used for Windows execution.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: