ConoHaVPS（メモリ2GB / 3コア）で llama.cpp + Gemma GGUF を動かす手順

ConoHaVPS（メモリ 2GB / 3コア）の Ubuntu 24.04 LTS 上で llama.cpp をビルドし、unsloth/gemma-4-E2B-it-GGUF モデルをダウンロードして動作確認するまでの手順をまとめたものです。低スペック VPS でも動作する構成のため、軽量モデルの検証に向いています。

1. システム更新・依存パッケージ導入

まずは OS の更新と、llama.cpp のビルドに必要なパッケージをインストールします。 Ubuntu 24.04 は初期状態で必要な開発ツールが揃っていないため、以下のコマンドで整えます。


$ sudo apt update && sudo apt upgrade -y
$ sudo apt install -y \
  build-essential cmake git curl wget \
  python3-pip python3-venv libopenblas-dev

ポイント：

build-essential と cmake は llama.cpp のビルドに必須
libopenblas-dev を入れることで CPU でも高速化（BLAS）が有効
Python 仮想環境を使うため python3-venv も必要

2. llama.cpp のビルド

llama.cpp を GitHub から取得し、OpenBLAS を使って CPU 最適化ビルドを行います。


$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

# CPU のみの場合
$ cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
$ cmake --build build --config Release -j$(nproc)

補足：

GPU が無い VPS なので CPU ビルドで問題ありません
-DGGML_BLAS=ON により行列演算が高速化され、推論速度が向上
-j$(nproc) は CPU コア数に応じて並列ビルド

3. モデルのダウンロード

Hugging Face Hub から GGUF モデルを取得します。 Python 仮想環境を作成し、hf コマンド（huggingface-cli の後継）を使用します。


$ python3 -m venv env3.12
$ . env3.12/bin/activate
(env3.12)$ pip3 install huggingface-hub
(env3.12)$ cd ~/llama.cpp/build/bin
(env3.12)$ hf download unsloth/gemma-4-E2B-it-GGUF \
  gemma-4-E2B-it-Q4_K_M.gguf \
  --local-dir ./models

補足：

Q4_K_M は軽量で 2GB RAM の VPS でも動かしやすい量子化形式
hf download は huggingface-cli の新名称
モデルは build/bin/models に保存される

4. 動作確認（CLI）

llama.cpp の CLI ツール llama-cli を使って動作確認します。 VPS の CPU コア数（3コア）に合わせてスレッド数を 3 に設定しています。


$ cd ~/llama.cpp/build/bin
$ ./llama-cli \
  -m ./models/gemma-4-E2B-it-Q4_K_M.gguf \
  -p "こんにちは。" \
  -n 512 \
  -t 3 \
  -c 2048

主なオプション：

-p：プロンプト
-n：生成トークン数
-t：スレッド数
-c：コンテキスト長

実行結果例

以下は実際の出力の一部です。


> こんにちは。

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "こんにちは。" (Konnichiwa.) - This means "Hello" or "Good afternoon." It's a standard, polite greeting.
2.  **Determine the context/intent:** The user is initiating a conversation with a simple greeting.
3.  **Formulate an appropriate response (in Japanese):**
    *   Acknowledge the greeting.
    *   Return a reciprocal greeting.
    *   (Optional but good) Ask how the user is doing or inquire about what they need.
4.  **Draft the response options:**
    *   *Option 1 (Simple return):* こんにちは。 (Konnichiwa.)
    *   *Option 2 (Polite return + inquiry):* こんにちは。何かお手伝いできることはありますか？ (Konnichiwa. Nanika otetsudai dekiru koto wa arimasu ka? - Hello. Is there anything I can help you with?)
    *   *Option 3 (Friendly return):* こんにちは！お元気ですか？ (Konnichiwa! Ogenki desu ka? - Hello! How are you?)
5.  **Select the best option:** Option 2 is generally the most useful for an AI assistant, as it sets the stage for the actual task. Option 3 is also very warm. I will use a slightly more versatile version.

6.  **Final Output Generation (in Japanese):** こんにちは。どのようなご用件でしょうか？ (Konnichiwa. Dono yōna goyōken deshō ka? - Hello. What kind of business/request do you have?) or simply a warm acknowledgment. (I will go with a standard, welcoming reply.)
[End thinking]

こんにちは！

何かお手伝いできることはありますか？😊

[ Prompt: 24.1 t/s | Generation: 8.0 t/s ]

2026年 4月17日 0時 8分

『ConoHaVPS（メモリ2GB / 3コア）で llama.cpp + Gemma GGUF を動かす手順』を公開しました。

SASAGAWA .TOKYO WEBhttps://sasagawa.tokyo/ 〜 AIと共に紡ぐ、Finance × ICT の活動ハブ＆開発舞台裏マガジン〜

- 活動の地図（Map of Activities）

iBe.TOKYOhttps://ibe.tokyo/ Internet – to – be .Tokyo

- 10km walk ’24~’25