Building your own LLM can be a transformative step in research, product innovation, or domain-specific automation. Depending on your goals, computational resources, and available data, there are two primary paths you can take:
Choose Your Track First
| Goal | Track | Description |
|---|---|---|
| π§ Full control, academic/research | Build from Scratch | You define architecture, tokenizer, train from raw text |
| π Quick results, domain-specific | Fine-tune Pretrained LLM | Use existing models like LLaMA, Mistral, GPT-Neo |
Full Roadmap to Build Your Own LLM
1. Define Objectives
- What do you want your LLM to do? E.g., Chatbot, Q&A system, coding assistant, legal summarizer
2. Collect and Prepare Data
- Data sources: Wikipedia, books, Common Crawl, academic papers, code, chat logs
- Cleaning: remove boilerplate, HTML tags, duplicates
- Tokenization-ready corpus (plain .txt or .jsonl)
π¦ Tools: datasets, BeautifulSoup, langchain, pdfminer, Apache Tika
3. Tokenization
- Choose subword technique:
- Byte-Level BPE (GPT-2, GPT-J)
- WordPiece (BERT)
- Unigram LM (T5, XLNet)
π¦ Tools: SentencePiece, Hugging Face Tokenizers
π‘ Tip: Train your tokenizer on your dataset if youβre starting from scratch
4. Design or Choose Model Architecture
- If from scratch:
- Build transformer blocks (multi-head attention + feedforward)
- Decide: depth, width, heads, position embeddings
- If fine-tuning:
- Choose from: LLaMA 2, Mistral, GPT-Neo, BLOOM, etc.
π¦ Frameworks: PyTorch, TensorFlow, HuggingFace Transformers, nanoGPT, minGPT, Megatron-LM
5. Train or Fine-Tune
- From scratch: use massive datasets (100GB+), train on 8+ GPUs or TPUs
- Fine-tune: smaller datasets (100MBβ10GB), use parameter-efficient techniques:
- LoRA, QLoRA, PEFT, Adapters
π¦ Tools: transformers.Trainer, accelerate, DeepSpeed, Ray, ColossalAI
6. Evaluate the Model
- Tasks: Text generation, question answering, summarization
- Metrics:
- Perplexity
- BLEU, ROUGE, Exact Match
- Human evaluation
7. Save and Export
- Save model weights, tokenizer, config:
- model.save_pretrained()
- Convert to ONNX, TorchScript if needed
8. Deploy
- Build an API (Flask/FastAPI/Gradio)
- Use Streamlit or LangChain for interface
- Host on:
- Cloud (AWS, GCP, Azure)
- Local GPU server
- Hugging Face Spaces
- Docker container
9. Post-Deployment Monitoring
- Track hallucinations, drift, user feedback
- Use prompt engineering or RAG (retrieval-augmented generation) to improve relevance
βοΈ Summary Table
| Step | Description | Tools |
|---|---|---|
| 1. Define | Task + scope | β |
| 2. Collect Data | Clean, deduplicate | datasets, bs4, pandas |
| 3. Tokenize | Subword or byte-based | tokenizers, SentencePiece |
| 4. Model | Build or load | transformers, PyTorch, nanoGPT |
| 5. Train | Scratch or fine-tune | Trainer, accelerate, DeepSpeed |
| 6. Evaluate | Perplexity, BLEU | evaluate, custom scripts |
| 7. Save | Store weights, tokenizer | model.save_pretrained() |
| 8. Deploy | API or frontend | Streamlit, FastAPI, Docker |
| 9. Monitor | Feedback + improvements | LangChain, telemetry |
NLP Pipeline used to build a model
NLP Pipeline Enriched Table
| Step | Type of Process | Name / Technique | Where Itβs Used | Year Developed |
|---|---|---|---|---|
| 1 | Raw Input | Raw Text | All NLP tasks | β |
| 2 | Text Cleaning | Lowercasing, Stopword Removal, Lemmatization | Preprocessing, traditional NLP | 1990sβ2000s |
| 3 | Tokenization | Word-Level | NLTK, SpaCy, classical ML | ~2000 |
| Subword BPE | GPT-2, RoBERTa | 2015 (Sennrich) | ||
| WordPiece | BERT, ALBERT | 2016 (Google) | ||
| Byte-Level BPE | GPT-2, GPT-Neo | 2019 (OpenAI) | ||
| Unigram LM (SentencePiece) | T5, XLNet, multilingual NLP | 2018 (Google) | ||
| 4 | Vectorization | One-Hot Encoding | Classical ML, small DL models | ~1980sβ1990s |
| TF-IDF | Text classification, IR, ML models | 1972 (Jones) | ||
| Count Vectorizer | Naive Bayes, SVMs | ~1990s | ||
| Word2Vec | Static embeddings | 2013 (Google) | ||
| GloVe | Static embeddings | 2014 (Stanford) | ||
| FastText | Static + subword embeddings | 2016 (Facebook) | ||
| Transformer Embeddings | GPT, BERT, LLaMA | 2017 (Google) | ||
| 5 | Feature Selection | Chi-Squared, PCA, SelectKBest | Classical ML pipelines | 1990sβ2000s |
| Attention-based selection | Neural LLMs (implicitly) | 2017+ | ||
| 6 | Modeling | Logistic Regression, SVM | Traditional ML | ~1950sβ1990s |
| LSTM / GRU | RNN-based NLP | 2014β2015 | ||
| Transformer (Self-Attention) | BERT, GPT, T5, LLaMA | 2017 (Vaswani) | ||
| 7 | Evaluation | Accuracy, F1, BLEU, Perplexity | Model assessment | Ongoing |
Notes:
- TF-IDF is one of the oldest vectorization techniques and was foundational to early information retrieval systems.
- Word2Vec, GloVe, and FastText introduced semantic similarity to embeddings.
- BPE, WordPiece, and SentencePiece are critical to subword-based tokenization used in most LLMs today.
- Transformer-based embeddings (like those in BERT, GPT) revolutionized NLP starting in 2017.
Types of Tokenization Techniques
Hereβs a categorized overview of the most popular tokenization techniques β from basic word-level to advanced subword models like Byte-Level BPE.
Whitespace/Word Tokenization
- Method: Splits on spaces and punctuation.
- Example: “Hello, world!” β [“Hello”, “,”, “world”, “!”]
- Pros: Simple, fast.
- Cons: Poor handling of unknown/misspelled words.
π‘ Used in: Traditional NLP (before deep learning era)
Character-Level Tokenization
- Method: Breaks text into individual characters.
- Example: “Chat” β [“C”, “h”, “a”, “t”]
- Pros: Handles unknown words perfectly.
- Cons: Long sequences, weak semantics.
π‘ Used in: Very small or character-sensitive models (e.g., some speech models)
Subword Tokenization (Most Common in LLMs)
a.
Byte-Pair Encoding (BPE)
- Method: Starts with characters and merges frequent pairs.
- Example: “lower”, “lowest” β [“low”, “er”], [“low”, “est”]
- Pros: Handles rare words better than word-level.
- Cons: Doesnβt consider context.
Used in: GPT-2, GPT-3, RoBERTa
b.
Byte-Level BPE
- Method: BPE applied at byte level (i.e., raw UTF-8), not characters.
- Example: “hello” β [“h”, “e”, “l”, “l”, “o”] β merged to tokens like “he”, “llo”
- Pros: Handles all languages & symbols without pre-tokenization.
- Cons: More tokens per input compared to BPE.
Used in: GPT-2, GPT-Neo, RoBERTa
c.
WordPiece
- Method: Similar to BPE but uses a greedy likelihood-based merging strategy.
- Example: “unaffordable” β [“un”, “##afford”, “##able”]
- Pros: Better modeling of morphemes.
- Cons: Vocabulary often English-biased.
Used in: BERT, ALBERT, DistilBERT
d.
Unigram Language Model (ULM)
- Method: Chooses subwords based on a probabilistic model, not just frequency.
- Example: Picks most probable tokenization among many options.
- Pros: More flexible; allows multiple ways to tokenize.
- Cons: Slightly more complex.
Used in: T5, XLNet, SentencePiece tokenizer
Byte-Level Unicode Tokenization
- Method: Tokenizes input at byte level using Unicode bytes.
- Pros: Universal for all languages, emojis, code, etc.
- Cons: Long token sequences.
Used in: GPT-J, BigScience BLOOM, newer models with multi-language support.
Character + Subword Hybrid
- Mixes character and subword tokens to balance robustness and sequence length.
Used in: Some experimental multilingual or speech models.
π Quick Comparison
| Technique | Handles OOV | Language-Agnostic | Compression | Used In |
|---|---|---|---|---|
| Word/Whitespace | β | β | β | NLTK, SpaCy (basic) |
| Char-Level | β | β | β | Speech, OCR |
| BPE | β | β οΈ (not always) | β | GPT-2, RoBERTa |
| Byte-Level BPE | β | β | β | GPT-2, GPT-J |
| WordPiece | β | β οΈ | β | BERT, ALBERT |
| Unigram LM | β | β | β | T5, XLNet |
Byte-Level BPE is widely used in GPT models because itβs compact, Unicode-friendly, and doesnβt need pre-tokenization. But other methods like WordPiece and Unigram LM are common in BERT, T5, etc.
Embeddings – hows it different with vector ??
Embeddings are dense vector representations of words, subwords, or tokens, where:
- Similar meanings β similar vectors (semantic proximity)
- Fixed size (e.g., 300-dim or 768-dim) regardless of vocab size
- Can be pretrained or learned during training
Types of Embeddings
| Type | Description | Used In |
|---|---|---|
| Static Word Embeddings | One fixed vector per word | Word2Vec, GloVe, FastText |
| Contextual Embeddings | Varies by sentence context | BERT, GPT, LLaMA, T5 |
| Learned Embeddings | Initialized randomly & trained with model | Most neural models |
One-Hot Encoding vs Embeddings
| Feature | One-Hot | Embeddings |
|---|---|---|
| Vector type | Sparse (mostly zeros) | Dense |
| Semantic similarity | β No | β Yes (similar words β similar vecs) |
| Size | Vocab-size dimensional | Usually 100β1000 dimensions |
| Scalable to large vocab | β Poor | β Excellent |
| LLMs | β Not used | β Core input layer |
Embedding in Deep Learning
When you build a model in TensorFlow or PyTorch, youβll typically have an:
tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)
This maps token IDs to dense vectors before passing them into transformers, RNNs, etc.
where Embedding falls
| Step | Process | Embedding Falls Here? |
|---|---|---|
| 1 | Raw Text | β |
| 2 | Text Cleaning | β |
| 3 | Tokenization | Prepares token IDs |
| 4 | Vectorization | β Embeddings are used here |
| 5 | Feature Selection | Rare, embeddings are end-to-end |
| 6 | Modeling | Input to neural models |
| 7 | Evaluation | β |
code snippets to build a basic Large Language Model (LLM) from scratch using Keras (TensorFlow). This example walks you through training a mini-GPT-style Transformer on your own text data.
This is educational-level code, not production-scale. A full LLM requires huge datasets and distributed training.
Step-by-Step: Build a Mini LLM with Keras
πΉ Step 1: Install Required Libraries
pip install tensorflow numpy
πΉ Step 2: Prepare the Dataset
For simplicity, weβll use a plain .txt file as training data. You can replace this with a large corpus.
import tensorflow as tf
import numpy as np
# Load text data
with open("your_dataset.txt", "r", encoding="utf-8") as f:
text = f.read()
# Create character-level vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
# Convert to int
text_as_int = np.array([char2idx[c] for c in text])
# Define sequence length
seq_length = 100
examples_per_epoch = len(text) // seq_length
# Create training dataset
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
# Batch size
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)πΉ Step 3: Define the Transformer-Based Model
from tensorflow.keras import layers
# Model config
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
layers.Dense(vocab_size)
])
return model
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.summary()πΉ Step 4: Compile and Train
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
model.compile(optimizer='adam', loss=loss)
# Checkpoint callback
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath="./checkpoints/ckpt_{epoch}",
save_weights_only=True
)
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])πΉ Step 5: Generate Text
To generate text, we load the trained weights and use a loop to predict one character at a time.
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint('./checkpoints'))
model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string, num_generate=500, temperature=1.0):
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
model.reset_states()
for _ in range(num_generate):
predictions = model(input_eval)
predictions = predictions[:, -1, :] / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return start_string + ''.join(text_generated)
print(generate_text(model, start_string="Once upon a time, "))π Final Thoughts
| What we Just Built | Notes |
|---|---|
| Mini LLM using GRU + Embedding | Can be replaced by full Transformer blocks |
| Character-level generation | Can be upgraded to word/subword level with Tokenizer |
| Trained on local text | For better results, use larger cleaned datasets |
π Upgrade Paths
- Use Transformer blocks (MultiHeadAttention, LayerNorm).
- Train on tokenized text using BPE (e.g., sentencepiece).
- Train with TPUs or multi-GPU setups.
- Switch to huggingface/transformers for scalable workflows.
Hereβs a rewritten version of the mini LLM from scratch using full Transformer layers in Keras β inspired by GPT-like architecture (decoder-only Transformer, causal masking, etc.).
Build a GPT-Style Transformer LLM from Scratch Using Keras
This implementation uses:
- Positional Embeddings
- Multi-Head Self-Attention
- Causal Masking
- Feedforward Layers
- Layer Normalization & Residual Connections
β Step-by-Step Guide
πΉ Step 1: Install and Import Libraries
pip install tensorflow numpy
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import numpy as np
πΉ Step 2: Prepare Dataset (Character-Level)
# Load your dataset
with open("your_dataset.txt", "r", encoding="utf-8") as f:
text = f.read()
# Build vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
# Encode text
text_as_int = np.array([char2idx[c] for c in text])
# Create input-target pairs
seq_length = 128
examples_per_epoch = len(text_as_int) // seq_length
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)πΉ Step 3: Define Transformer Components
class PositionalEmbedding(layers.Layer):
def __init__(self, vocab_size, d_model, max_len):
super().__init__()
self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=d_model)
self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=d_model)
def call(self, x):
maxlen = tf.shape(x)[-1]
positions = tf.range(start=0, limit=maxlen, delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions
class CausalSelfAttention(layers.Layer):
def __init__(self, d_model, num_heads):
super().__init__()
self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
self.layernorm = layers.LayerNormalization(epsilon=1e-6)
self.dropout = layers.Dropout(0.1)
def call(self, x, training):
attn_output = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
attn_output = self.dropout(attn_output, training=training)
return self.layernorm(x + attn_output)
def _causal_mask(self, size):
i = tf.range(size)[:, None]
j = tf.range(size)
mask = tf.cast(i >= j, dtype=tf.int32)
return mask[None, None, :, :]
class FeedForward(layers.Layer):
def __init__(self, d_model, d_ff):
super().__init__()
self.seq = keras.Sequential([
layers.Dense(d_ff, activation='relu'),
layers.Dense(d_model),
layers.Dropout(0.1)
])
self.layernorm = layers.LayerNormalization(epsilon=1e-6)
def call(self, x, training):
out = self.seq(x, training=training)
return self.layernorm(x + out)
class TransformerBlock(layers.Layer):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.att = CausalSelfAttention(d_model, num_heads)
self.ff = FeedForward(d_model, d_ff)
def call(self, x, training):
x = self.att(x, training=training)
x = self.ff(x, training=training)
return xπΉ Step 4: Define the GPT-like Model
def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
inputs = layers.Input(shape=(seq_len,))
x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)
for _ in range(num_layers):
x = TransformerBlock(d_model, num_heads, d_ff)(x)
outputs = layers.Dense(vocab_size)(x)
return keras.Model(inputs=inputs, outputs=outputs)πΉ Step 5: Compile & Train
model = build_gpt_model(
vocab_size=len(vocab),
seq_len=seq_length,
d_model=256,
num_heads=4,
d_ff=512,
num_layers=4
)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss_fn)
model.summary()
EPOCHS = 10
model.fit(dataset, epochs=EPOCHS)πΉ Step 6: Generate Text
def generate_text(model, start_string, gen_length=500, temperature=1.0):
input_indices = [char2idx[s] for s in start_string]
input_tensor = tf.expand_dims(input_indices, 0)
generated = []
for _ in range(gen_length):
input_padded = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, maxlen=seq_length, padding='pre')
predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
predictions /= temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()
input_tensor = tf.concat([input_tensor, [[predicted_id]]], axis=-1)
generated.append(idx2char[predicted_id])
return start_string + ''.join(generated)
print(generate_text(model, start_string="Once upon a time, "))π§© Summary
| Component | Description |
|---|---|
| Embedding | Token + Positional |
| Transformer | Decoder-only, causal masked |
| Attention | Multi-head self-attention |
| Training Data | Character-based, simple .txt |
| Output | Next-char prediction |
Hereβs a clean and reusable boilerplate for building a GPT-style Transformer LLM from scratch using Keras. Itβs structured into modular components and can be easily customized or extended for larger projects.
GPT-Style LLM Boilerplate (Keras / TensorFlow)
llm_gpt_keras/ βββ data/ β βββ your_dataset.txt βββ model/ β βββ transformer_blocks.py β βββ gpt_model.py βββ train.py βββ generate.py βββ requirements.txt
π
requirements.txt
tensorflow>=2.10 numpy
π
model/transformer_blocks.py
import tensorflow as tf
from tensorflow.keras import layers
class PositionalEmbedding(layers.Layer):
def __init__(self, vocab_size, d_model, max_len):
super().__init__()
self.token_emb = layers.Embedding(vocab_size, d_model)
self.pos_emb = layers.Embedding(max_len, d_model)
def call(self, x):
positions = tf.range(start=0, limit=tf.shape(x)[-1], delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions
class CausalSelfAttention(layers.Layer):
def __init__(self, d_model, num_heads):
super().__init__()
self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
self.dropout = layers.Dropout(0.1)
self.layernorm = layers.LayerNormalization(epsilon=1e-6)
def call(self, x, training):
attn = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
x = self.layernorm(x + self.dropout(attn, training=training))
return x
def _causal_mask(self, size):
i = tf.range(size)[:, None]
j = tf.range(size)
mask = tf.cast(i >= j, dtype=tf.int32)
return mask[None, None, :, :]
class FeedForward(layers.Layer):
def __init__(self, d_model, d_ff):
super().__init__()
self.ff = tf.keras.Sequential([
layers.Dense(d_ff, activation='relu'),
layers.Dense(d_model),
layers.Dropout(0.1)
])
self.layernorm = layers.LayerNormalization(epsilon=1e-6)
def call(self, x, training):
return self.layernorm(x + self.ff(x, training=training))
class TransformerBlock(layers.Layer):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.att = CausalSelfAttention(d_model, num_heads)
self.ff = FeedForward(d_model, d_ff)
def call(self, x, training):
x = self.att(x, training)
x = self.ff(x, training)
return xπ
model/gpt_model.py
from tensorflow.keras import layers, Model, Input
from model.transformer_blocks import PositionalEmbedding, TransformerBlock
def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
inputs = Input(shape=(seq_len,))
x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)
for _ in range(num_layers):
x = TransformerBlock(d_model, num_heads, d_ff)(x)
outputs = layers.Dense(vocab_size)(x)
return Model(inputs, outputs)π
train.py
import tensorflow as tf
import numpy as np
from model.gpt_model import build_gpt_model
# Load data
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
text = f.read()
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])
seq_length = 128
examples_per_epoch = len(text_as_int) // (seq_length + 1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
def split_input_target(chunk):
return chunk[:-1], chunk[1:]
dataset = sequences.map(split_input_target).shuffle(10000).batch(64, drop_remainder=True)
# Model setup
model = build_gpt_model(
vocab_size=len(vocab),
seq_len=seq_length,
d_model=256,
num_heads=4,
d_ff=512,
num_layers=4
)
model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
model.fit(dataset, epochs=10)
model.save_weights("checkpoints/gpt_small.h5")π
generate.py
import tensorflow as tf
from model.gpt_model import build_gpt_model
import numpy as np
# Load vocabulary
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
text = f.read()
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
seq_length = 128
model = build_gpt_model(len(vocab), seq_length)
model.load_weights("checkpoints/gpt_small.h5")
def generate_text(start_string, num_generate=500, temperature=1.0):
input_eval = [char2idx[c] for c in start_string]
input_eval = tf.expand_dims(input_eval, 0)
generated = []
for _ in range(num_generate):
input_padded = tf.keras.preprocessing.sequence.pad_sequences(
input_eval, maxlen=seq_length, padding='pre'
)
predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()
input_eval = tf.concat([input_eval, [[predicted_id]]], axis=-1)
generated.append(idx2char[predicted_id])
return start_string + ''.join(generated)
print(generate_text("Once upon a time, "))