第23章: パフォーマンス最適化 - 速度を追求する

学習目標

プロファイリングツールの使い方を習得する
ベンチマークの正しい書き方を学ぶ
メモリ使用量の最適化テクニックを理解する
CPU使用率の改善方法を学ぶ
コンパイラ最適化を活用する方法を知る

---

23.1 最適化の基本原則

23.1.1 計測なくして最適化なし

┌─────────────────────────────────────────────────────────┐
│          最適化のゴールデンルール                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. 計測する（Measure）                                  │
│     ↓ どこがボトルネックか特定                           │
│                                                         │
│  2. 最適化する（Optimize）                               │
│     ↓ 最も効果的な部分を改善                             │
│                                                         │
│  3. 検証する（Verify）                                   │
│     ↓ 本当に速くなったか確認                             │
│                                                         │
│  4. 繰り返す（Iterate）                                  │
│                                                         │
└─────────────────────────────────────────────────────────┘

有名な格言：

> "Premature optimization is the root of all evil." > （時期尚早な最適化は諸悪の根源） > — Donald Knuth

23.1.2 最適化の優先順位

// ❌ 悪い例：計測せずに最適化
fn process_data(data: &[i32]) -> i32 {
    // 「たぶん速いだろう」という思い込みで複雑な実装
    unsafe {
        // unsafeを使えば速い？（間違い）
        // ...
    }
}

// ✅ 良い例：まずシンプルに実装
fn process_data(data: &[i32]) -> i32 {
    data.iter().sum()  // シンプルで読みやすい
}

// 計測後、本当に遅ければ最適化

---

23.2 プロファイリングツール

23.2.1 cargo-flamegraph

インストール：

cargo install flamegraph

# Linux: パーミッション設定
sudo sysctl kernel.perf_event_paranoid=1

使用方法：

# フレームグラフ生成
cargo flamegraph

# ブラウザで flamegraph.svg を開く

読み方：

┌─────────────────────────────────────────────────────────┐
│              フレームグラフの見方                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  横幅 = 時間の割合（広いほど時間がかかっている）          │
│  縦軸 = コールスタック（下から上へ呼び出し）              │
│  色   = ランダム（意味なし）                             │
│                                                         │
│  [       main        ]   ← プログラム全体                │
│  [  process_data  ]      ← ボトルネックが見つかる        │
│  [ heavy_function ]      ← ここを最適化すべき            │
│                                                         │
└─────────────────────────────────────────────────────────┘

23.2.2 perf（Linux）

# プロファイリング実行
perf record ./target/release/myapp

# レポート表示
perf report

# 特定の関数のみ
perf record -e cycles -g -- ./target/release/myapp

# フレームグラフ生成
perf script | stackcollapse-perf.pl | flamegraph.pl > perf.svg

23.2.3 Instruments（macOS）

# Xcodeに付属のInstrumentsを使用

# コマンドラインから起動
instruments -t "Time Profiler" ./target/release/myapp

23.2.4 valgrind（Linux）

メモリプロファイリング：

# メモリリーク検出
valgrind --leak-check=full ./target/release/myapp

# キャッシュプロファイリング
valgrind --tool=cachegrind ./target/release/myapp

# Massif（ヒーププロファイラ）
valgrind --tool=massif ./target/release/myapp
massif-visualizer massif.out.*

---

23.3 ベンチマーク

23.3.1 Criterion.rs

セットアップ：

# Cargo.toml

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "my_benchmark"
harness = false

基本的なベンチマーク：

// benches/my_benchmark.rs

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci_recursive(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2),
    }
}

fn fibonacci_iterative(n: u64) -> u64 {
    let mut a = 0;
    let mut b = 1;
    for _ in 0..n {
        let temp = a;
        a = b;
        b = temp + b;
    }
    b
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib_recursive 20", |b| {
        b.iter(|| fibonacci_recursive(black_box(20)))
    });

    c.bench_function("fib_iterative 20", |b| {
        b.iter(|| fibonacci_iterative(black_box(20)))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

実行：

cargo bench

# 結果例
# fib_recursive 20    time:   [26.531 µs 26.644 µs 26.774 µs]
# fib_iterative 20    time:   [8.9234 ns 8.9456 ns 8.9712 ns]

23.3.2 パラメータ化ベンチマーク

use criterion::{BenchmarkId, Criterion};

fn bench_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort");

    for size in [10, 100, 1000, 10000].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            size,
            |b, &size| {
                let mut data: Vec<i32> = (0..size).rev().collect();
                b.iter(|| {
                    data.sort();
                    black_box(&data);
                });
            },
        );
    }

    group.finish();
}

23.3.3 比較ベンチマーク

fn bench_comparison(c: &mut Criterion) {
    let data: Vec<i32> = (0..1000).collect();

    let mut group = c.benchmark_group("iteration");

    group.bench_function("for_loop", |b| {
        b.iter(|| {
            let mut sum = 0;
            for &x in &data {
                sum += x;
            }
            sum
        });
    });

    group.bench_function("iterator", |b| {
        b.iter(|| data.iter().sum::<i32>());
    });

    group.bench_function("fold", |b| {
        b.iter(|| data.iter().fold(0, |acc, &x| acc + x));
    });

    group.finish();
}

---

23.4 メモリ最適化

23.4.1 アロケーション削減

❌ 悪い例：不要なアロケーション

fn process_lines(text: &str) -> Vec<String> {
    let mut results = Vec::new();
    for line in text.lines() {
        // 毎回String::fromでアロケーション
        results.push(line.to_string().to_uppercase());
    }
    results
}

✅ 良い例：アロケーション削減

fn process_lines(text: &str) -> Vec<String> {
    text.lines()
        .map(|line| line.to_uppercase())
        .collect()
}

// さらに良い：イテレータを返す
fn process_lines_lazy(text: &str) -> impl Iterator<Item = String> + '_ {
    text.lines().map(|line| line.to_uppercase())
}

23.4.2 サイズ最適化

構造体のサイズ削減：

use std::mem::size_of;

// ❌ 悪い例：パディングが多い（24バイト）
struct BadLayout {
    a: u8,   // 1バイト + 7バイトのパディング
    b: u64,  // 8バイト
    c: u32,  // 4バイト + 4バイトのパディング
}

// ✅ 良い例：最適なレイアウト（16バイト）
struct GoodLayout {
    b: u64,  // 8バイト
    c: u32,  // 4バイト
    a: u8,   // 1バイト + 3バイトのパディング
}

fn main() {
    println!("BadLayout: {} bytes", size_of::<BadLayout>());   // 24
    println!("GoodLayout: {} bytes", size_of::<GoodLayout>()); // 16
}

23.4.3 Cow（Clone on Write）

use std::borrow::Cow;

fn process_text(text: Cow<str>) -> Cow<str> {
    if text.contains("error") {
        // 変更が必要な場合のみクローン
        Cow::Owned(text.replace("error", "ERROR"))
    } else {
        // 変更不要ならそのまま返す
        text
    }
}

fn main() {
    let text = "Hello, world!";
    let result = process_text(Cow::Borrowed(text));
    // クローンは発生しない

    let error_text = "An error occurred";
    let result = process_text(Cow::Borrowed(error_text));
    // ここでのみクローンが発生
}

---

23.5 CPU最適化

23.5.1 SIMD（Single Instruction Multiple Data）

// 標準的な実装
fn sum_scalar(data: &[f32]) -> f32 {
    data.iter().sum()
}

// SIMD版（portable_simd feature required）
#![feature(portable_simd)]
use std::simd::*;

fn sum_simd(data: &[f32]) -> f32 {
    let chunks = data.chunks_exact(4);
    let remainder = chunks.remainder();

    let sum = chunks
        .map(|chunk| {
            let simd = f32x4::from_slice(chunk);
            simd
        })
        .fold(f32x4::splat(0.0), |acc, x| acc + x);

    sum.horizontal_sum() + remainder.iter().sum::<f32>()
}

23.5.2 並列化

Rayon の使用：

use rayon::prelude::*;

// シーケンシャル版
fn process_sequential(data: &[i32]) -> Vec<i32> {
    data.iter().map(|&x| expensive_operation(x)).collect()
}

// 並列版
fn process_parallel(data: &[i32]) -> Vec<i32> {
    data.par_iter()
        .map(|&x| expensive_operation(x))
        .collect()
}

fn expensive_operation(x: i32) -> i32 {
    // 重い計算
    (0..1000).fold(x, |acc, i| acc + i)
}

ベンチマーク結果例：

process_sequential  time:   [50.123 ms 50.456 ms 50.789 ms]
process_parallel    time:   [12.345 ms 12.567 ms 12.789 ms]
                           ↑ 約4倍高速（4コアCPU）

23.5.3 キャッシュフレンドリーなコード

❌ 悪い例：キャッシュミスが多い

// 列優先アクセス（キャッシュに悪い）
fn sum_columns_bad(matrix: &[[i32; 1000]; 1000]) -> [i32; 1000] {
    let mut sums = [0; 1000];
    for col in 0..1000 {
        for row in 0..1000 {
            sums[col] += matrix[row][col];  // キャッシュミス
        }
    }
    sums
}

✅ 良い例：キャッシュフレンドリー

// 行優先アクセス（キャッシュに優しい）
fn sum_columns_good(matrix: &[[i32; 1000]; 1000]) -> [i32; 1000] {
    let mut sums = [0; 1000];
    for row in 0..1000 {
        for col in 0..1000 {
            sums[col] += matrix[row][col];  // 連続アクセス
        }
    }
    sums
}

---

23.6 コンパイラ最適化

23.6.1 インライン化

// 自動的にインライン化される（小さい関数）
fn add(a: i32, b: i32) -> i32 {
    a + b
}

// 明示的にインライン化を指示
#[inline]
fn multiply(a: i32, b: i32) -> i32 {
    a * b
}

// 常にインライン化（慎重に使用）
#[inline(always)]
fn critical_function(x: i32) -> i32 {
    x * x + 2 * x + 1
}

// インライン化しない
#[inline(never)]
fn large_function() {
    // 大きな関数はインライン化しない方が良い場合も
}

23.6.2 LTO（Link Time Optimization）

# Cargo.toml

[profile.release]
lto = true          # 完全なLTO
# lto = "thin"      # 高速なLTO（バランス型）

効果：

バイナリサイズ削減（10-20%）
実行速度向上（5-15%）
ビルド時間増加（2-5倍）

23.6.3 コード生成単位

[profile.release]
codegen-units = 1   # 最大最適化（遅いビルド）
# codegen-units = 16  # デフォルト（速いビルド）

トレードオフ：

設定	ビルド時間	実行速度	バイナリサイズ
`codegen-units = 1`	遅い	速い	小さい
`codegen-units = 16`	速い	遅い	大きい

---

23.7 最適化テクニック集

23.7.1 String vs &str

// ❌ 遅い
fn process(s: String) -> String {
    s.to_uppercase()
}

// ✅ 速い
fn process(s: &str) -> String {
    s.to_uppercase()
}

23.7.2 Vec の事前確保

// ❌ 遅い：複数回の再アロケーション
fn create_vec() -> Vec<i32> {
    let mut v = Vec::new();
    for i in 0..1000 {
        v.push(i);  // 再アロケーションが発生
    }
    v
}

// ✅ 速い：事前に容量を確保
fn create_vec_fast() -> Vec<i32> {
    let mut v = Vec::with_capacity(1000);
    for i in 0..1000 {
        v.push(i);  // 再アロケーションなし
    }
    v
}

23.7.3 SmallVec の使用

use smallvec::SmallVec;

// スタックに4要素まで、それ以上はヒープ
type SmallVec4<T> = SmallVec<[T; 4]>;

fn process() -> SmallVec4<i32> {
    let mut v = SmallVec4::new();
    v.push(1);
    v.push(2);
    v.push(3);
    // ここまではヒープアロケーションなし
    v
}

23.7.4 HashMap vs BTreeMap

use std::collections::{HashMap, BTreeMap};

// 高速な検索（ランダムアクセス）
let mut hash_map = HashMap::new();
hash_map.insert("key", "value");

// ソート済み（順序付きイテレーション）
let mut btree_map = BTreeMap::new();
btree_map.insert("key", "value");

選択基準：

操作	HashMap	BTreeMap
挿入	O(1)	O(log n)
検索	O(1)	O(log n)
ソート済みイテレーション	×	○
メモリ使用量	多い	少ない

---

23.8 実践的な最適化例

23.8.1 JSON パースの最適化

// ❌ 遅い：毎回パース
fn get_user_name(json: &str) -> String {
    let value: serde_json::Value = serde_json::from_str(json).unwrap();
    value["name"].as_str().unwrap().to_string()
}

// ✅ 速い：構造体に直接デシリアライズ
#[derive(Deserialize)]
struct User {
    name: String,
}

fn get_user_name_fast(json: &str) -> String {
    let user: User = serde_json::from_str(json).unwrap();
    user.name
}

// ✅✅ さらに速い：simd-jsonを使用
use simd_json;

fn get_user_name_fastest(json: &mut [u8]) -> String {
    let user: User = simd_json::from_slice(json).unwrap();
    user.name
}

23.8.2 文字列結合の最適化

// ❌ 非常に遅い
fn concat_slow(items: &[&str]) -> String {
    let mut result = String::new();
    for item in items {
        result = result + item;  // 毎回アロケーション
    }
    result
}

// ✅ 速い
fn concat_fast(items: &[&str]) -> String {
    let mut result = String::new();
    for item in items {
        result.push_str(item);  // アロケーションを最小化
    }
    result
}

// ✅✅ さらに速い
fn concat_fastest(items: &[&str]) -> String {
    let total_len: usize = items.iter().map(|s| s.len()).sum();
    let mut result = String::with_capacity(total_len);
    for item in items {
        result.push_str(item);
    }
    result
}

---

23.9 まとめ

学んだこと

プロファイリング

- flamegraph, perf, valgrind - ボトルネックの特定

ベンチマーク

- Criterion.rs - 正確な測定方法

最適化テクニック

- メモリ削減 - CPU最適化 - コンパイラ最適化

最適化チェックリスト

[ ] プロファイリングでボトルネックを特定
[ ] ベンチマークで効果を測定
[ ] アロケーションを最小化
[ ] 適切なデータ構造を選択
[ ] コンパイラ最適化を活用
[ ] 並列化を検討

次のステップ

次の章では、これまで学んだすべてを統合した実践プロジェクトに取り組みます。

---

参考資料

The Rust Performance Book
Criterion.rs
Rayon