Starter Kit: Build Your Own LLM – TensorFlow


Building your own LLM can be a transformative step in research, product innovation, or domain-specific automation. Depending on your goals, computational resources, and available data, there are two primary paths you can take:


Choose Your Track First

GoalTrackDescription
πŸ”§ Full control, academic/researchBuild from ScratchYou define architecture, tokenizer, train from raw text
πŸš€ Quick results, domain-specificFine-tune Pretrained LLMUse existing models like LLaMA, Mistral, GPT-Neo

Full Roadmap to Build Your Own LLM

1. Define Objectives

  • What do you want your LLM to do? E.g., Chatbot, Q&A system, coding assistant, legal summarizer

2. Collect and Prepare Data

  • Data sources: Wikipedia, books, Common Crawl, academic papers, code, chat logs
  • Cleaning: remove boilerplate, HTML tags, duplicates
  • Tokenization-ready corpus (plain .txt or .jsonl)

πŸ“¦ Tools: datasets, BeautifulSoup, langchain, pdfminer, Apache Tika

3. Tokenization

  • Choose subword technique:
    • Byte-Level BPE (GPT-2, GPT-J)
    • WordPiece (BERT)
    • Unigram LM (T5, XLNet)

πŸ“¦ Tools: SentencePiece, Hugging Face Tokenizers

πŸ’‘ Tip: Train your tokenizer on your dataset if you’re starting from scratch

4. Design or Choose Model Architecture

  • If from scratch:
    • Build transformer blocks (multi-head attention + feedforward)
    • Decide: depth, width, heads, position embeddings
  • If fine-tuning:
    • Choose from: LLaMA 2, Mistral, GPT-Neo, BLOOM, etc.

πŸ“¦ Frameworks: PyTorch, TensorFlow, HuggingFace Transformers, nanoGPT, minGPT, Megatron-LM

5. Train or Fine-Tune

  • From scratch: use massive datasets (100GB+), train on 8+ GPUs or TPUs
  • Fine-tune: smaller datasets (100MB–10GB), use parameter-efficient techniques:
    • LoRA, QLoRA, PEFT, Adapters

πŸ“¦ Tools: transformers.Trainer, accelerate, DeepSpeed, Ray, ColossalAI

6. Evaluate the Model

  • Tasks: Text generation, question answering, summarization
  • Metrics:
    • Perplexity
    • BLEU, ROUGE, Exact Match
    • Human evaluation

7. Save and Export

  • Save model weights, tokenizer, config:
    • model.save_pretrained()
    • Convert to ONNX, TorchScript if needed

8. Deploy

  • Build an API (Flask/FastAPI/Gradio)
  • Use Streamlit or LangChain for interface
  • Host on:
    • Cloud (AWS, GCP, Azure)
    • Local GPU server
    • Hugging Face Spaces
    • Docker container

9. Post-Deployment Monitoring

  • Track hallucinations, drift, user feedback
  • Use prompt engineering or RAG (retrieval-augmented generation) to improve relevance

βš™οΈ Summary Table

StepDescriptionTools
1. DefineTask + scopeβ€”
2. Collect DataClean, deduplicatedatasets, bs4, pandas
3. TokenizeSubword or byte-basedtokenizers, SentencePiece
4. ModelBuild or loadtransformers, PyTorch, nanoGPT
5. TrainScratch or fine-tuneTrainer, accelerate, DeepSpeed
6. EvaluatePerplexity, BLEUevaluate, custom scripts
7. SaveStore weights, tokenizermodel.save_pretrained()
8. DeployAPI or frontendStreamlit, FastAPI, Docker
9. MonitorFeedback + improvementsLangChain, telemetry

NLP Pipeline used to build a model


NLP Pipeline Enriched Table

StepType of ProcessName / TechniqueWhere It’s UsedYear Developed
1Raw InputRaw TextAll NLP tasksβ€”
2Text CleaningLowercasing, Stopword Removal, LemmatizationPreprocessing, traditional NLP1990s–2000s
3TokenizationWord-LevelNLTK, SpaCy, classical ML~2000
Subword BPEGPT-2, RoBERTa2015 (Sennrich)
WordPieceBERT, ALBERT2016 (Google)
Byte-Level BPEGPT-2, GPT-Neo2019 (OpenAI)
Unigram LM (SentencePiece)T5, XLNet, multilingual NLP2018 (Google)
4VectorizationOne-Hot EncodingClassical ML, small DL models~1980s–1990s
TF-IDFText classification, IR, ML models1972 (Jones)
Count VectorizerNaive Bayes, SVMs~1990s
Word2VecStatic embeddings2013 (Google)
GloVeStatic embeddings2014 (Stanford)
FastTextStatic + subword embeddings2016 (Facebook)
Transformer EmbeddingsGPT, BERT, LLaMA2017 (Google)
5Feature SelectionChi-Squared, PCA, SelectKBestClassical ML pipelines1990s–2000s
Attention-based selectionNeural LLMs (implicitly)2017+
6ModelingLogistic Regression, SVMTraditional ML~1950s–1990s
LSTM / GRURNN-based NLP2014–2015
Transformer (Self-Attention)BERT, GPT, T5, LLaMA2017 (Vaswani)
7EvaluationAccuracy, F1, BLEU, PerplexityModel assessmentOngoing

Notes:

  • TF-IDF is one of the oldest vectorization techniques and was foundational to early information retrieval systems.
  • Word2Vec, GloVe, and FastText introduced semantic similarity to embeddings.
  • BPE, WordPiece, and SentencePiece are critical to subword-based tokenization used in most LLMs today.
  • Transformer-based embeddings (like those in BERT, GPT) revolutionized NLP starting in 2017.

Types of Tokenization Techniques

Here’s a categorized overview of the most popular tokenization techniques β€” from basic word-level to advanced subword models like Byte-Level BPE.

Whitespace/Word Tokenization

  • Method: Splits on spaces and punctuation.
  • Example: “Hello, world!” β†’ [“Hello”, “,”, “world”, “!”]
  • Pros: Simple, fast.
  • Cons: Poor handling of unknown/misspelled words.

🟑 Used in: Traditional NLP (before deep learning era)

Character-Level Tokenization

  • Method: Breaks text into individual characters.
  • Example: “Chat” β†’ [“C”, “h”, “a”, “t”]
  • Pros: Handles unknown words perfectly.
  • Cons: Long sequences, weak semantics.

🟑 Used in: Very small or character-sensitive models (e.g., some speech models)

Subword Tokenization (Most Common in LLMs)

a. 

Byte-Pair Encoding (BPE)

  • Method: Starts with characters and merges frequent pairs.
  • Example: “lower”, “lowest” β†’ [“low”, “er”], [“low”, “est”]
  • Pros: Handles rare words better than word-level.
  • Cons: Doesn’t consider context.

Used in: GPT-2, GPT-3, RoBERTa


b. 

Byte-Level BPE

  • Method: BPE applied at byte level (i.e., raw UTF-8), not characters.
  • Example: “hello” β†’ [“h”, “e”, “l”, “l”, “o”] β†’ merged to tokens like “he”, “llo”
  • Pros: Handles all languages & symbols without pre-tokenization.
  • Cons: More tokens per input compared to BPE.

Used in: GPT-2, GPT-Neo, RoBERTa


c. 

WordPiece

  • Method: Similar to BPE but uses a greedy likelihood-based merging strategy.
  • Example: “unaffordable” β†’ [“un”, “##afford”, “##able”]
  • Pros: Better modeling of morphemes.
  • Cons: Vocabulary often English-biased.

Used in: BERT, ALBERT, DistilBERT


d. 

Unigram Language Model (ULM)

  • Method: Chooses subwords based on a probabilistic model, not just frequency.
  • Example: Picks most probable tokenization among many options.
  • Pros: More flexible; allows multiple ways to tokenize.
  • Cons: Slightly more complex.

Used in: T5, XLNet, SentencePiece tokenizer

Byte-Level Unicode Tokenization

  • Method: Tokenizes input at byte level using Unicode bytes.
  • Pros: Universal for all languages, emojis, code, etc.
  • Cons: Long token sequences.

Used in: GPT-J, BigScience BLOOM, newer models with multi-language support.

Character + Subword Hybrid

  • Mixes character and subword tokens to balance robustness and sequence length.

Used in: Some experimental multilingual or speech models.


πŸ“Š Quick Comparison

TechniqueHandles OOVLanguage-AgnosticCompressionUsed In
Word/Whitespace❌❌❌NLTK, SpaCy (basic)
Char-Levelβœ…βœ…βŒSpeech, OCR
BPEβœ…βš οΈ (not always)βœ…GPT-2, RoBERTa
Byte-Level BPEβœ…βœ…βœ…GPT-2, GPT-J
WordPieceβœ…βš οΈβœ…BERT, ALBERT
Unigram LMβœ…βœ…βœ…T5, XLNet

Byte-Level BPE is widely used in GPT models because it’s compact, Unicode-friendly, and doesn’t need pre-tokenization. But other methods like WordPiece and Unigram LM are common in BERT, T5, etc.


Embeddings – hows it different with vector ??

Embeddings are dense vector representations of words, subwords, or tokens, where:

  • Similar meanings β†’ similar vectors (semantic proximity)
  • Fixed size (e.g., 300-dim or 768-dim) regardless of vocab size
  • Can be pretrained or learned during training

Types of Embeddings

TypeDescriptionUsed In
Static Word EmbeddingsOne fixed vector per wordWord2Vec, GloVe, FastText
Contextual EmbeddingsVaries by sentence contextBERT, GPT, LLaMA, T5
Learned EmbeddingsInitialized randomly & trained with modelMost neural models

One-Hot Encoding vs Embeddings

FeatureOne-HotEmbeddings
Vector typeSparse (mostly zeros)Dense
Semantic similarity❌ Noβœ… Yes (similar words β†’ similar vecs)
SizeVocab-size dimensionalUsually 100–1000 dimensions
Scalable to large vocab❌ Poorβœ… Excellent
LLMs❌ Not usedβœ… Core input layer

Embedding in Deep Learning

When you build a model in TensorFlow or PyTorch, you’ll typically have an:

tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)

This maps token IDs to dense vectors before passing them into transformers, RNNs, etc.


where Embedding falls

StepProcessEmbedding Falls Here?
1Raw Textβ€”
2Text Cleaningβ€”
3TokenizationPrepares token IDs
4Vectorizationβœ… Embeddings are used here
5Feature SelectionRare, embeddings are end-to-end
6ModelingInput to neural models
7Evaluationβ€”

code snippets to build a basic Large Language Model (LLM) from scratch using Keras (TensorFlow). This example walks you through training a mini-GPT-style Transformer on your own text data.

This is educational-level code, not production-scale. A full LLM requires huge datasets and distributed training.


Step-by-Step: Build a Mini LLM with Keras


πŸ”Ή Step 1: Install Required Libraries

pip install tensorflow numpy

πŸ”Ή Step 2: Prepare the Dataset

For simplicity, we’ll use a plain .txt file as training data. You can replace this with a large corpus.

import tensorflow as tf
import numpy as np

# Load text data
with open("your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Create character-level vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

# Convert to int
text_as_int = np.array([char2idx[c] for c in text])

# Define sequence length
seq_length = 100
examples_per_epoch = len(text) // seq_length

# Create training dataset
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Batch size
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

πŸ”Ή Step 3: Define the Transformer-Based Model

from tensorflow.keras import layers

# Model config
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        layers.Embedding(vocab_size, embedding_dim,
                         batch_input_shape=[batch_size, None]),
        layers.GRU(rnn_units,
                   return_sequences=True,
                   stateful=True,
                   recurrent_initializer='glorot_uniform'),
        layers.Dense(vocab_size)
    ])
    return model

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.summary()

πŸ”Ή Step 4: Compile and Train

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

# Checkpoint callback
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="./checkpoints/ckpt_{epoch}",
    save_weights_only=True
)

EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

πŸ”Ή Step 5: Generate Text

To generate text, we load the trained weights and use a loop to predict one character at a time.

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint('./checkpoints'))
model.build(tf.TensorShape([1, None]))

def generate_text(model, start_string, num_generate=500, temperature=1.0):
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []

    model.reset_states()
    for _ in range(num_generate):
        predictions = model(input_eval)
        predictions = predictions[:, -1, :] / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])

    return start_string + ''.join(text_generated)

print(generate_text(model, start_string="Once upon a time, "))

πŸ”š Final Thoughts

What we Just BuiltNotes
Mini LLM using GRU + EmbeddingCan be replaced by full Transformer blocks
Character-level generationCan be upgraded to word/subword level with Tokenizer
Trained on local textFor better results, use larger cleaned datasets

πŸš€ Upgrade Paths

  • Use Transformer blocks (MultiHeadAttention, LayerNorm).
  • Train on tokenized text using BPE (e.g., sentencepiece).
  • Train with TPUs or multi-GPU setups.
  • Switch to huggingface/transformers for scalable workflows.

Here’s a rewritten version of the mini LLM from scratch using full Transformer layers in Keras β€” inspired by GPT-like architecture (decoder-only Transformer, causal masking, etc.).


Build a GPT-Style Transformer LLM from Scratch Using Keras

This implementation uses:

  • Positional Embeddings
  • Multi-Head Self-Attention
  • Causal Masking
  • Feedforward Layers
  • Layer Normalization & Residual Connections

βœ… Step-by-Step Guide

πŸ”Ή Step 1: Install and Import Libraries

pip install tensorflow numpy
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

πŸ”Ή Step 2: Prepare Dataset (Character-Level)

# Load your dataset
with open("your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Build vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

# Encode text
text_as_int = np.array([char2idx[c] for c in text])

# Create input-target pairs
seq_length = 128
examples_per_epoch = len(text_as_int) // seq_length

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

πŸ”Ή Step 3: Define Transformer Components

class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=d_model)
        self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=d_model)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

class CausalSelfAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)
        self.dropout = layers.Dropout(0.1)

    def call(self, x, training):
        attn_output = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
        attn_output = self.dropout(attn_output, training=training)
        return self.layernorm(x + attn_output)

    def _causal_mask(self, size):
        i = tf.range(size)[:, None]
        j = tf.range(size)
        mask = tf.cast(i >= j, dtype=tf.int32)
        return mask[None, None, :, :]

class FeedForward(layers.Layer):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.seq = keras.Sequential([
            layers.Dense(d_ff, activation='relu'),
            layers.Dense(d_model),
            layers.Dropout(0.1)
        ])
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        out = self.seq(x, training=training)
        return self.layernorm(x + out)

class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.att = CausalSelfAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)

    def call(self, x, training):
        x = self.att(x, training=training)
        x = self.ff(x, training=training)
        return x

πŸ”Ή Step 4: Define the GPT-like Model

def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
    inputs = layers.Input(shape=(seq_len,))
    x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)

    for _ in range(num_layers):
        x = TransformerBlock(d_model, num_heads, d_ff)(x)

    outputs = layers.Dense(vocab_size)(x)
    return keras.Model(inputs=inputs, outputs=outputs)

πŸ”Ή Step 5: Compile & Train

model = build_gpt_model(
    vocab_size=len(vocab),
    seq_len=seq_length,
    d_model=256,
    num_heads=4,
    d_ff=512,
    num_layers=4
)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn)
model.summary()

EPOCHS = 10
model.fit(dataset, epochs=EPOCHS)

πŸ”Ή Step 6: Generate Text

def generate_text(model, start_string, gen_length=500, temperature=1.0):
    input_indices = [char2idx[s] for s in start_string]
    input_tensor = tf.expand_dims(input_indices, 0)

    generated = []
    for _ in range(gen_length):
        input_padded = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, maxlen=seq_length, padding='pre')
        predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
        predictions /= temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()

        input_tensor = tf.concat([input_tensor, [[predicted_id]]], axis=-1)
        generated.append(idx2char[predicted_id])

    return start_string + ''.join(generated)

print(generate_text(model, start_string="Once upon a time, "))

🧩 Summary

ComponentDescription
EmbeddingToken + Positional
TransformerDecoder-only, causal masked
AttentionMulti-head self-attention
Training DataCharacter-based, simple .txt
OutputNext-char prediction

Here’s a clean and reusable boilerplate for building a GPT-style Transformer LLM from scratch using Keras. It’s structured into modular components and can be easily customized or extended for larger projects.


GPT-Style LLM Boilerplate (Keras / TensorFlow)

llm_gpt_keras/
β”œβ”€β”€ data/
β”‚   └── your_dataset.txt
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ transformer_blocks.py
β”‚   β”œβ”€β”€ gpt_model.py
β”œβ”€β”€ train.py
β”œβ”€β”€ generate.py
β”œβ”€β”€ requirements.txt

πŸ“„ 

requirements.txt

tensorflow>=2.10
numpy

πŸ“„ 

model/transformer_blocks.py

import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = layers.Embedding(vocab_size, d_model)
        self.pos_emb = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(start=0, limit=tf.shape(x)[-1], delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

class CausalSelfAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.dropout = layers.Dropout(0.1)
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        attn = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
        x = self.layernorm(x + self.dropout(attn, training=training))
        return x

    def _causal_mask(self, size):
        i = tf.range(size)[:, None]
        j = tf.range(size)
        mask = tf.cast(i >= j, dtype=tf.int32)
        return mask[None, None, :, :]

class FeedForward(layers.Layer):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.ff = tf.keras.Sequential([
            layers.Dense(d_ff, activation='relu'),
            layers.Dense(d_model),
            layers.Dropout(0.1)
        ])
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        return self.layernorm(x + self.ff(x, training=training))

class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.att = CausalSelfAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)

    def call(self, x, training):
        x = self.att(x, training)
        x = self.ff(x, training)
        return x

πŸ“„ 

model/gpt_model.py

from tensorflow.keras import layers, Model, Input
from model.transformer_blocks import PositionalEmbedding, TransformerBlock

def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
    inputs = Input(shape=(seq_len,))
    x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)

    for _ in range(num_layers):
        x = TransformerBlock(d_model, num_heads, d_ff)(x)

    outputs = layers.Dense(vocab_size)(x)
    return Model(inputs, outputs)

πŸ“„ 

train.py

import tensorflow as tf
import numpy as np
from model.gpt_model import build_gpt_model

# Load data
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

seq_length = 128
examples_per_epoch = len(text_as_int) // (seq_length + 1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

def split_input_target(chunk):
    return chunk[:-1], chunk[1:]

dataset = sequences.map(split_input_target).shuffle(10000).batch(64, drop_remainder=True)

# Model setup
model = build_gpt_model(
    vocab_size=len(vocab),
    seq_len=seq_length,
    d_model=256,
    num_heads=4,
    d_ff=512,
    num_layers=4
)

model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
model.fit(dataset, epochs=10)
model.save_weights("checkpoints/gpt_small.h5")

πŸ“„ 

generate.py

import tensorflow as tf
from model.gpt_model import build_gpt_model
import numpy as np

# Load vocabulary
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

seq_length = 128
model = build_gpt_model(len(vocab), seq_length)
model.load_weights("checkpoints/gpt_small.h5")

def generate_text(start_string, num_generate=500, temperature=1.0):
    input_eval = [char2idx[c] for c in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    generated = []

    for _ in range(num_generate):
        input_padded = tf.keras.preprocessing.sequence.pad_sequences(
            input_eval, maxlen=seq_length, padding='pre'
        )
        predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()
        input_eval = tf.concat([input_eval, [[predicted_id]]], axis=-1)
        generated.append(idx2char[predicted_id])

    return start_string + ''.join(generated)

print(generate_text("Once upon a time, "))

Leave a Reply