단 60줄의 코드로 GPT를 만들었다고?

단 60줄의 코드로 GPT를 만든 친구가 있네요. 깃헙 저장소에 코드가 있는데요, 너무 짧아 콜랩에서도 돌아가게 해 봤습니다.
여러분이 질문도 넣어보고 파라미터 값도 변경해 가면서 테스트해 보세요. 또 코드가 짧으니 GPT학습에도 좋습니다.

콜랩에 실행되는 코드의 설명입니다.

이 코드는 GPT-2 모델을 실행하는 함수와 그와 관련된 여러 유틸리티 함수들로 구성되어 있습니다. 주요 함수와 기능은 다음과 같습니다:

gelu(x): GELU (Gaussian Error Linear Unit) 활성화 함수를 구현한 함수입니다.
softmax(x): 소프트맥스 함수를 구현한 함수입니다. 입력 배열의 각 요소를 소프트맥스 함수에 적용하여 확률 분포로 변환합니다.
layer_norm(x, g, b, eps): 레이어 정규화를 수행하는 함수입니다. 입력 배열을 정규화하고, 스케일과 오프셋을 적용합니다.
linear(x, w, b): 선형 변환을 수행하는 함수입니다. 입력 배열과 가중치 행렬, 편향 벡터를 곱하여 출력을 계산합니다.
ffn(x, c_fc, c_proj): 피드포워드 신경망(feed-forward network)을 구현한 함수입니다. 입력 배열을 선형 변환하고, GELU 활성화 함수를 적용한 후 다시 선형 변환합니다.
attention(q, k, v, mask): 어텐션(attention) 메커니즘을 구현한 함수입니다. 주어진 쿼리(query), 키(key), 값(value) 배열을 사용하여 어텐션 가중치를 계산하고, 가중합을 구합니다.
mha(x, c_attn, c_proj, n_head): 멀티 헤드 어텐션(multi-head attention)을 수행하는 함수입니다. 입력 배열을 선형 변환한 후, 각각의 어텐션 헤드로 분리합니다. 어텐션 가중치를 계산하고 헤드를 다시 병합한 후, 선형 변환을 적용합니다.
transformer_block(x, mlp, attn, ln_1, ln_2, n_head): 트랜스포머 블록(transformer block)을 구현한 함수입니다. 멀티 헤드 어텐션과 피드포워드 신경망을 포함하며, 입력 배열에 이러한 계층을 적용하여 출력을 계산합니다.
gpt2(inputs, wte, wpe, blocks, ln_f, n_head): GPT-2 모델을 실행하는 함수입니다. 토큰과 위치 임베딩을 결합한 후, 여러 개의 트랜스포머 블록을 순차적으로 통과시켜 출력을 계산합니다.
`generate(inputs, params, n_head, n_tokens_to_generate)` 함수는 입력 시퀀스(inputs), 모델 파라미터(params), 헤드 개수(n_head), 생성할 토큰 개수(n_tokens_to_generate)를 인자로 받아서 텍스트를 생성하는 함수입니다.함수 내부에서는 다음과 같은 작업이 수행됩니다:
1. `tqdm` 라이브러리를 사용하여 “generating” 메시지와 함께 진행 상황을 표시합니다.

2. 지정된 토큰 개수(n_tokens_to_generate)만큼 반복하는 루프를 실행합니다. 이 루프는 자동 회귀적인 디코딩을 수행합니다.

3. `gpt2(inputs, **params, n_head=n_head)`를 호출하여 모델의 순방향 전파(forward pass)를 수행하고 로짓(logits)을 얻습니다.

4. 로짓 중 가장 큰 값을 가진 인덱스를 선택하여 다음 토큰을 결정합니다. 이는 탐욕적인(greedy) 샘플링 방식입니다.

5. 예측된 다음 토큰을 입력 시퀀스(inputs)에 추가합니다.

6. 생성된 텍스트를 반환하기 위해 입력 시퀀스(inputs)에서 마지막에 위치한 n_tokens_to_generate 개수만큼의 토큰을 잘라냅니다.

따라서 `generate` 함수를 호출하면 입력 시퀀스(inputs)와 모델 파라미터(params)를 사용하여 지정된 개수의 토큰을 생성하고, 해당 토큰들을 반환합니다. 이러한 토큰들은 후속 처리를 통해 텍스트로 디코딩될 수 있습니다.

아래는 코드입니다.

#로칼 컴퓨터에서 할거 아니니까, picoGPT/gpt2.py 의 내용 그대로 실행되게 붙여 넣음
import numpy as np


def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))


def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)


def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # normalize x to have mean=0 and var=1 over last axis
    return g * x + b  # scale and offset with gamma/beta params


def linear(x, w, b):  # [m, in], [in, out], [out] -> [m, out]
    return x @ w + b


def ffn(x, c_fc, c_proj):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # project up
    a = gelu(linear(x, **c_fc))  # [n_seq, n_embd] -> [n_seq, 4*n_embd]

    # project back down
    x = linear(a, **c_proj)  # [n_seq, 4*n_embd] -> [n_seq, n_embd]

    return x


def attention(q, k, v, mask):  # [n_q, d_k], [n_k, d_k], [n_k, d_v], [n_q, n_k] -> [n_q, d_v]
    return softmax(q @ k.T / np.sqrt(q.shape[-1]) + mask) @ v


def mha(x, c_attn, c_proj, n_head):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # qkv projection
    x = linear(x, **c_attn)  # [n_seq, n_embd] -> [n_seq, 3*n_embd]

    # split into qkv
    qkv = np.split(x, 3, axis=-1)  # [n_seq, 3*n_embd] -> [3, n_seq, n_embd]

    # split into heads
    qkv_heads = list(map(lambda x: np.split(x, n_head, axis=-1), qkv))  # [3, n_seq, n_embd] -> [3, n_head, n_seq, n_embd/n_head]

    # causal mask to hide future inputs from being attended to
    causal_mask = (1 - np.tri(x.shape[0], dtype=x.dtype)) * -1e10  # [n_seq, n_seq]

    # perform attention over each head
    out_heads = [attention(q, k, v, causal_mask) for q, k, v in zip(*qkv_heads)]  # [3, n_head, n_seq, n_embd/n_head] -> [n_head, n_seq, n_embd/n_head]

    # merge heads
    x = np.hstack(out_heads)  # [n_head, n_seq, n_embd/n_head] -> [n_seq, n_embd]

    # out projection
    x = linear(x, **c_proj)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x


def transformer_block(x, mlp, attn, ln_1, ln_2, n_head):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # multi-head causal self attention
    x = x + mha(layer_norm(x, **ln_1), **attn, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # position-wise feed forward network
    x = x + ffn(layer_norm(x, **ln_2), **mlp)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x


def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):  # [n_seq] -> [n_seq, n_vocab]
    # token + positional embeddings
    x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]

    # forward pass through n_layer transformer blocks
    for block in blocks:
        x = transformer_block(x, **block, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # projection to vocab
    x = layer_norm(x, **ln_f)  # [n_seq, n_embd] -> [n_seq, n_embd]
    return x @ wte.T  # [n_seq, n_embd] -> [n_seq, n_vocab]


def generate(inputs, params, n_head, n_tokens_to_generate):
    from tqdm import tqdm

    for _ in tqdm(range(n_tokens_to_generate), "generating"):  # auto-regressive decode loop
        logits = gpt2(inputs, **params, n_head=n_head)  # model forward pass
        next_id = np.argmax(logits[-1])  # greedy sampling
        inputs.append(int(next_id))  # append prediction to input

    return inputs[len(inputs) - n_tokens_to_generate :]  # only return generated ids


def main(prompt: str, n_tokens_to_generate: int = 40, model_size: str = "124M", models_dir: str = "models"):
    from utils import load_encoder_hparams_and_params

    # load encoder, hparams, and params from the released open-ai gpt-2 files
    encoder, hparams, params = load_encoder_hparams_and_params(model_size, models_dir)

    # encode the input string using the BPE tokenizer
    input_ids = encoder.encode(prompt)

    # make sure we are not surpassing the max sequence length of our model
    assert len(input_ids) + n_tokens_to_generate < hparams["n_ctx"]

    # generate output ids
    output_ids = generate(input_ids, params, hparams["n_head"], n_tokens_to_generate)

    # decode the ids back into a string
    output_text = encoder.decode(output_ids)

    return output_text

답글 남기기 응답 취소