Starter Kit: Build Your Own LLM – TensorFlow

Building your own LLM can be a transformative step in research, product innovation, or domain-specific automation. Depending on your goals, computational resources, and available data, there are two primary paths you can take:

Choose Your Track First

Goal	Track	Description
🔧 Full control, academic/research	Build from Scratch	You define architecture, tokenizer, train from raw text
🚀 Quick results, domain-specific	Fine-tune Pretrained LLM	Use existing models like LLaMA, Mistral, GPT-Neo

Full Roadmap to Build Your Own LLM

1. Define Objectives

What do you want your LLM to do? E.g., Chatbot, Q&A system, coding assistant, legal summarizer

2. Collect and Prepare Data

Data sources: Wikipedia, books, Common Crawl, academic papers, code, chat logs
Cleaning: remove boilerplate, HTML tags, duplicates
Tokenization-ready corpus (plain .txt or .jsonl)

📦 Tools: datasets, BeautifulSoup, langchain, pdfminer, Apache Tika

3. Tokenization

Choose subword technique:
- Byte-Level BPE (GPT-2, GPT-J)
- WordPiece (BERT)
- Unigram LM (T5, XLNet)

📦 Tools: SentencePiece, Hugging Face Tokenizers

💡 Tip: Train your tokenizer on your dataset if you’re starting from scratch

4. Design or Choose Model Architecture

If from scratch:
- Build transformer blocks (multi-head attention + feedforward)
- Decide: depth, width, heads, position embeddings
If fine-tuning:
- Choose from: LLaMA 2, Mistral, GPT-Neo, BLOOM, etc.

📦 Frameworks: PyTorch, TensorFlow, HuggingFace Transformers, nanoGPT, minGPT, Megatron-LM

5. Train or Fine-Tune

From scratch: use massive datasets (100GB+), train on 8+ GPUs or TPUs
Fine-tune: smaller datasets (100MB–10GB), use parameter-efficient techniques:
- LoRA, QLoRA, PEFT, Adapters

📦 Tools: transformers.Trainer, accelerate, DeepSpeed, Ray, ColossalAI

6. Evaluate the Model

Tasks: Text generation, question answering, summarization
Metrics:
- Perplexity
- BLEU, ROUGE, Exact Match
- Human evaluation

7. Save and Export

Save model weights, tokenizer, config:
- model.save_pretrained()
- Convert to ONNX, TorchScript if needed

8. Deploy

Build an API (Flask/FastAPI/Gradio)
Use Streamlit or LangChain for interface
Host on:
- Cloud (AWS, GCP, Azure)
- Local GPU server
- Hugging Face Spaces
- Docker container

9. Post-Deployment Monitoring

Track hallucinations, drift, user feedback
Use prompt engineering or RAG (retrieval-augmented generation) to improve relevance

⚙️ Summary Table

Step	Description	Tools
1. Define	Task + scope	—
2. Collect Data	Clean, deduplicate	datasets, bs4, pandas
3. Tokenize	Subword or byte-based	tokenizers, SentencePiece
4. Model	Build or load	transformers, PyTorch, nanoGPT
5. Train	Scratch or fine-tune	Trainer, accelerate, DeepSpeed
6. Evaluate	Perplexity, BLEU	evaluate, custom scripts
7. Save	Store weights, tokenizer	model.save_pretrained()
8. Deploy	API or frontend	Streamlit, FastAPI, Docker
9. Monitor	Feedback + improvements	LangChain, telemetry

NLP Pipeline used to build a model

NLP Pipeline Enriched Table

Step	Type of Process	Name / Technique	Where It’s Used	Year Developed
1	Raw Input	Raw Text	All NLP tasks	—
2	Text Cleaning	Lowercasing, Stopword Removal, Lemmatization	Preprocessing, traditional NLP	1990s–2000s
3	Tokenization	Word-Level	NLTK, SpaCy, classical ML	~2000
		Subword BPE	GPT-2, RoBERTa	2015 (Sennrich)
		WordPiece	BERT, ALBERT	2016 (Google)
		Byte-Level BPE	GPT-2, GPT-Neo	2019 (OpenAI)
		Unigram LM (SentencePiece)	T5, XLNet, multilingual NLP	2018 (Google)
4	Vectorization	One-Hot Encoding	Classical ML, small DL models	~1980s–1990s
		TF-IDF	Text classification, IR, ML models	1972 (Jones)
		Count Vectorizer	Naive Bayes, SVMs	~1990s
		Word2Vec	Static embeddings	2013 (Google)
		GloVe	Static embeddings	2014 (Stanford)
		FastText	Static + subword embeddings	2016 (Facebook)
		Transformer Embeddings	GPT, BERT, LLaMA	2017 (Google)
5	Feature Selection	Chi-Squared, PCA, SelectKBest	Classical ML pipelines	1990s–2000s
		Attention-based selection	Neural LLMs (implicitly)	2017+
6	Modeling	Logistic Regression, SVM	Traditional ML	~1950s–1990s
		LSTM / GRU	RNN-based NLP	2014–2015
		Transformer (Self-Attention)	BERT, GPT, T5, LLaMA	2017 (Vaswani)
7	Evaluation	Accuracy, F1, BLEU, Perplexity	Model assessment	Ongoing

Notes:

TF-IDF is one of the oldest vectorization techniques and was foundational to early information retrieval systems.
Word2Vec, GloVe, and FastText introduced semantic similarity to embeddings.
BPE, WordPiece, and SentencePiece are critical to subword-based tokenization used in most LLMs today.
Transformer-based embeddings (like those in BERT, GPT) revolutionized NLP starting in 2017.

Types of Tokenization Techniques

Here’s a categorized overview of the most popular tokenization techniques — from basic word-level to advanced subword models like Byte-Level BPE.

Whitespace/Word Tokenization

Method: Splits on spaces and punctuation.
Example: “Hello, world!” → [“Hello”, “,”, “world”, “!”]
Pros: Simple, fast.
Cons: Poor handling of unknown/misspelled words.

🟡 Used in: Traditional NLP (before deep learning era)

Character-Level Tokenization

Method: Breaks text into individual characters.
Example: “Chat” → [“C”, “h”, “a”, “t”]
Pros: Handles unknown words perfectly.
Cons: Long sequences, weak semantics.

🟡 Used in: Very small or character-sensitive models (e.g., some speech models)

Subword Tokenization (Most Common in LLMs)

a.

Byte-Pair Encoding (BPE)

Method: Starts with characters and merges frequent pairs.
Example: “lower”, “lowest” → [“low”, “er”], [“low”, “est”]
Pros: Handles rare words better than word-level.
Cons: Doesn’t consider context.

Used in: GPT-2, GPT-3, RoBERTa

b.

Byte-Level BPE

Method: BPE applied at byte level (i.e., raw UTF-8), not characters.
Example: “hello” → [“h”, “e”, “l”, “l”, “o”] → merged to tokens like “he”, “llo”
Pros: Handles all languages & symbols without pre-tokenization.
Cons: More tokens per input compared to BPE.

Used in: GPT-2, GPT-Neo, RoBERTa

c.

WordPiece

Method: Similar to BPE but uses a greedy likelihood-based merging strategy.
Example: “unaffordable” → [“un”, “##afford”, “##able”]
Pros: Better modeling of morphemes.
Cons: Vocabulary often English-biased.

Used in: BERT, ALBERT, DistilBERT

d.

Unigram Language Model (ULM)

Method: Chooses subwords based on a probabilistic model, not just frequency.
Example: Picks most probable tokenization among many options.
Pros: More flexible; allows multiple ways to tokenize.
Cons: Slightly more complex.

Used in: T5, XLNet, SentencePiece tokenizer

Byte-Level Unicode Tokenization

Method: Tokenizes input at byte level using Unicode bytes.
Pros: Universal for all languages, emojis, code, etc.
Cons: Long token sequences.

Used in: GPT-J, BigScience BLOOM, newer models with multi-language support.

Character + Subword Hybrid

Mixes character and subword tokens to balance robustness and sequence length.

Used in: Some experimental multilingual or speech models.

📊 Quick Comparison

Technique	Handles OOV	Language-Agnostic	Compression	Used In
Word/Whitespace	❌	❌	❌	NLTK, SpaCy (basic)
Char-Level	✅	✅	❌	Speech, OCR
BPE	✅	⚠️ (not always)	✅	GPT-2, RoBERTa
Byte-Level BPE	✅	✅	✅	GPT-2, GPT-J
WordPiece	✅	⚠️	✅	BERT, ALBERT
Unigram LM	✅	✅	✅	T5, XLNet

Byte-Level BPE is widely used in GPT models because it’s compact, Unicode-friendly, and doesn’t need pre-tokenization. But other methods like WordPiece and Unigram LM are common in BERT, T5, etc.

Embeddings – hows it different with vector ??

Embeddings are dense vector representations of words, subwords, or tokens, where:

Similar meanings → similar vectors (semantic proximity)
Fixed size (e.g., 300-dim or 768-dim) regardless of vocab size
Can be pretrained or learned during training

Types of Embeddings

Type	Description	Used In
Static Word Embeddings	One fixed vector per word	Word2Vec, GloVe, FastText
Contextual Embeddings	Varies by sentence context	BERT, GPT, LLaMA, T5
Learned Embeddings	Initialized randomly & trained with model	Most neural models

One-Hot Encoding vs Embeddings

Feature	One-Hot	Embeddings
Vector type	Sparse (mostly zeros)	Dense
Semantic similarity	❌ No	✅ Yes (similar words → similar vecs)
Size	Vocab-size dimensional	Usually 100–1000 dimensions
Scalable to large vocab	❌ Poor	✅ Excellent
LLMs	❌ Not used	✅ Core input layer

Embedding in Deep Learning

When you build a model in TensorFlow or PyTorch, you’ll typically have an:

tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)

This maps token IDs to dense vectors before passing them into transformers, RNNs, etc.

where Embedding falls

Step	Process	Embedding Falls Here?
1	Raw Text	—
2	Text Cleaning	—
3	Tokenization	Prepares token IDs
4	Vectorization	✅ Embeddings are used here
5	Feature Selection	Rare, embeddings are end-to-end
6	Modeling	Input to neural models
7	Evaluation	—

code snippets to build a basic Large Language Model (LLM) from scratch using Keras (TensorFlow). This example walks you through training a mini-GPT-style Transformer on your own text data.

This is educational-level code, not production-scale. A full LLM requires huge datasets and distributed training.

Step-by-Step: Build a Mini LLM with Keras

🔹 Step 1: Install Required Libraries

pip install tensorflow numpy

🔹 Step 2: Prepare the Dataset

For simplicity, we’ll use a plain .txt file as training data. You can replace this with a large corpus.

import tensorflow as tf
import numpy as np

# Load text data
with open("your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Create character-level vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

# Convert to int
text_as_int = np.array([char2idx[c] for c in text])

# Define sequence length
seq_length = 100
examples_per_epoch = len(text) // seq_length

# Create training dataset
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Batch size
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

🔹 Step 3: Define the Transformer-Based Model

from tensorflow.keras import layers

# Model config
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        layers.Embedding(vocab_size, embedding_dim,
                         batch_input_shape=[batch_size, None]),
        layers.GRU(rnn_units,
                   return_sequences=True,
                   stateful=True,
                   recurrent_initializer='glorot_uniform'),
        layers.Dense(vocab_size)
    ])
    return model

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.summary()

🔹 Step 4: Compile and Train

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

# Checkpoint callback
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="./checkpoints/ckpt_{epoch}",
    save_weights_only=True
)

EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

🔹 Step 5: Generate Text

To generate text, we load the trained weights and use a loop to predict one character at a time.

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint('./checkpoints'))
model.build(tf.TensorShape([1, None]))

def generate_text(model, start_string, num_generate=500, temperature=1.0):
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []

    model.reset_states()
    for _ in range(num_generate):
        predictions = model(input_eval)
        predictions = predictions[:, -1, :] / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])

    return start_string + ''.join(text_generated)

print(generate_text(model, start_string="Once upon a time, "))

🔚 Final Thoughts

What we Just Built	Notes
Mini LLM using GRU + Embedding	Can be replaced by full Transformer blocks
Character-level generation	Can be upgraded to word/subword level with Tokenizer
Trained on local text	For better results, use larger cleaned datasets

🚀 Upgrade Paths

Use Transformer blocks (MultiHeadAttention, LayerNorm).
Train on tokenized text using BPE (e.g., sentencepiece).
Train with TPUs or multi-GPU setups.
Switch to huggingface/transformers for scalable workflows.

Here’s a rewritten version of the mini LLM from scratch using full Transformer layers in Keras — inspired by GPT-like architecture (decoder-only Transformer, causal masking, etc.).

Build a GPT-Style Transformer LLM from Scratch Using Keras

This implementation uses:

Positional Embeddings
Multi-Head Self-Attention
Causal Masking
Feedforward Layers
Layer Normalization & Residual Connections

✅ Step-by-Step Guide

🔹 Step 1: Install and Import Libraries

pip install tensorflow numpy

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

🔹 Step 2: Prepare Dataset (Character-Level)

# Load your dataset
with open("your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Build vocabulary
vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

# Encode text
text_as_int = np.array([char2idx[c] for c in text])

# Create input-target pairs
seq_length = 128
examples_per_epoch = len(text_as_int) // seq_length

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

🔹 Step 3: Define Transformer Components

class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=d_model)
        self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=d_model)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

class CausalSelfAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)
        self.dropout = layers.Dropout(0.1)

    def call(self, x, training):
        attn_output = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
        attn_output = self.dropout(attn_output, training=training)
        return self.layernorm(x + attn_output)

    def _causal_mask(self, size):
        i = tf.range(size)[:, None]
        j = tf.range(size)
        mask = tf.cast(i >= j, dtype=tf.int32)
        return mask[None, None, :, :]

class FeedForward(layers.Layer):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.seq = keras.Sequential([
            layers.Dense(d_ff, activation='relu'),
            layers.Dense(d_model),
            layers.Dropout(0.1)
        ])
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        out = self.seq(x, training=training)
        return self.layernorm(x + out)

class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.att = CausalSelfAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)

    def call(self, x, training):
        x = self.att(x, training=training)
        x = self.ff(x, training=training)
        return x

🔹 Step 4: Define the GPT-like Model

def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
    inputs = layers.Input(shape=(seq_len,))
    x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)

    for _ in range(num_layers):
        x = TransformerBlock(d_model, num_heads, d_ff)(x)

    outputs = layers.Dense(vocab_size)(x)
    return keras.Model(inputs=inputs, outputs=outputs)

🔹 Step 5: Compile & Train

model = build_gpt_model(
    vocab_size=len(vocab),
    seq_len=seq_length,
    d_model=256,
    num_heads=4,
    d_ff=512,
    num_layers=4
)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn)
model.summary()

EPOCHS = 10
model.fit(dataset, epochs=EPOCHS)

🔹 Step 6: Generate Text

def generate_text(model, start_string, gen_length=500, temperature=1.0):
    input_indices = [char2idx[s] for s in start_string]
    input_tensor = tf.expand_dims(input_indices, 0)

    generated = []
    for _ in range(gen_length):
        input_padded = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, maxlen=seq_length, padding='pre')
        predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
        predictions /= temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()

        input_tensor = tf.concat([input_tensor, [[predicted_id]]], axis=-1)
        generated.append(idx2char[predicted_id])

    return start_string + ''.join(generated)

print(generate_text(model, start_string="Once upon a time, "))

🧩 Summary

Component	Description
Embedding	Token + Positional
Transformer	Decoder-only, causal masked
Attention	Multi-head self-attention
Training Data	Character-based, simple .txt
Output	Next-char prediction

Here’s a clean and reusable boilerplate for building a GPT-style Transformer LLM from scratch using Keras. It’s structured into modular components and can be easily customized or extended for larger projects.

GPT-Style LLM Boilerplate (Keras / TensorFlow)

llm_gpt_keras/
├── data/
│   └── your_dataset.txt
├── model/
│   ├── transformer_blocks.py
│   ├── gpt_model.py
├── train.py
├── generate.py
├── requirements.txt

📄

requirements.txt

tensorflow>=2.10
numpy

📄

model/transformer_blocks.py

import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = layers.Embedding(vocab_size, d_model)
        self.pos_emb = layers.Embedding(max_len, d_model)

    def call(self, x):
        positions = tf.range(start=0, limit=tf.shape(x)[-1], delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

class CausalSelfAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.dropout = layers.Dropout(0.1)
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        attn = self.mha(query=x, value=x, key=x, attention_mask=self._causal_mask(tf.shape(x)[1]))
        x = self.layernorm(x + self.dropout(attn, training=training))
        return x

    def _causal_mask(self, size):
        i = tf.range(size)[:, None]
        j = tf.range(size)
        mask = tf.cast(i >= j, dtype=tf.int32)
        return mask[None, None, :, :]

class FeedForward(layers.Layer):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.ff = tf.keras.Sequential([
            layers.Dense(d_ff, activation='relu'),
            layers.Dense(d_model),
            layers.Dropout(0.1)
        ])
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, training):
        return self.layernorm(x + self.ff(x, training=training))

class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.att = CausalSelfAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)

    def call(self, x, training):
        x = self.att(x, training)
        x = self.ff(x, training)
        return x

📄

model/gpt_model.py

from tensorflow.keras import layers, Model, Input
from model.transformer_blocks import PositionalEmbedding, TransformerBlock

def build_gpt_model(vocab_size, seq_len, d_model=256, num_heads=4, d_ff=512, num_layers=4):
    inputs = Input(shape=(seq_len,))
    x = PositionalEmbedding(vocab_size, d_model, seq_len)(inputs)

    for _ in range(num_layers):
        x = TransformerBlock(d_model, num_heads, d_ff)(x)

    outputs = layers.Dense(vocab_size)(x)
    return Model(inputs, outputs)

📄

train.py

import tensorflow as tf
import numpy as np
from model.gpt_model import build_gpt_model

# Load data
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

seq_length = 128
examples_per_epoch = len(text_as_int) // (seq_length + 1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

def split_input_target(chunk):
    return chunk[:-1], chunk[1:]

dataset = sequences.map(split_input_target).shuffle(10000).batch(64, drop_remainder=True)

# Model setup
model = build_gpt_model(
    vocab_size=len(vocab),
    seq_len=seq_length,
    d_model=256,
    num_heads=4,
    d_ff=512,
    num_layers=4
)

model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
model.fit(dataset, epochs=10)
model.save_weights("checkpoints/gpt_small.h5")

📄

generate.py

import tensorflow as tf
from model.gpt_model import build_gpt_model
import numpy as np

# Load vocabulary
with open("data/your_dataset.txt", "r", encoding="utf-8") as f:
    text = f.read()

vocab = sorted(set(text))
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

seq_length = 128
model = build_gpt_model(len(vocab), seq_length)
model.load_weights("checkpoints/gpt_small.h5")

def generate_text(start_string, num_generate=500, temperature=1.0):
    input_eval = [char2idx[c] for c in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    generated = []

    for _ in range(num_generate):
        input_padded = tf.keras.preprocessing.sequence.pad_sequences(
            input_eval, maxlen=seq_length, padding='pre'
        )
        predictions = model(tf.convert_to_tensor(input_padded))[:, -1, :]
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[0, 0].numpy()
        input_eval = tf.concat([input_eval, [[predicted_id]]], axis=-1)
        generated.append(idx2char[predicted_id])

    return start_string + ''.join(generated)

print(generate_text("Once upon a time, "))