ℹ️ Generation Settings

Temperature controls the “creativity” or randomness of output. Higher values (e.g., 0.7) produce more diverse results. Lower values (e.g., 0.2) produce focused, deterministic results.

Top_p is nucleus sampling. GPT considers tokens whose cumulative probability mass ≤ top_p.

Use Case	Temp	Top_p
Code Generation	0.2	0.1
Creative Writing	0.7	0.8
Chatbot Responses	0.5	0.5
Code Comment Gen	0.3	0.2
Data Analysis	0.2	0.1
Exploratory Coding	0.6	0.7

Roles

system: defines how the assistant should act. Prioritized ahead of user messages.
user: instructions from the end user. Main task input.
assistant: used only for model-generated messages.

From Prompting to Context Engineering

🧑‍💻 Prompt Engineering + ✅ Evals

Prompt Engineering is about designing prompts that guide the model's behavior.
Evals are about measuring how well those prompts or models perform.

You can't do serious LLM work without connecting the two.
Prompt Engineering is hypothesis.
Evals are evidence.

You guess what will work through prompt engineering.
You prove what works through evals.

Without evals:

Prompt engineering becomes trial-and-error without feedback.
You don't know if your changes helped or hurt performance.

Recommended Course

If you want to deepen your understanding of AI Evals, I highly recommend Hamel Husain's AI Evals For Engineers & PMs course. This hands-on 4-week cohort course covers practical approaches for improving AI applications, evaluation methods, and error analysis.

Learn from industry experts and join a strong community to build better, reliable AI systems.

Summary Table

Eval Type	What it Measures	Example Use Case
Ground Truth	Exact correctness	Math, QA, MCQ
Behavioral	Safety, tone, ethics	Refusing bad requests
Grading LLM Judge	Subjective quality, reasoning	Writing, summaries
Preference	Which of two responses is better	Fine-tuning, A/B prompt testing
Function Call	Structure + tool readiness	GPT-4 function calling
RAG Evaluations	Faithfulness to context	Document QA, search synthesis
Latency/Cost	Performance metrics	Model choice, deployment
Custom Task Evals	Anything domain-specific	Consistency, format checks

🔒 Study Real Jailbreak Cases — Prompt Injection & Guardrails

Study how jailbreaks work in theory: how models interpret roleplay, context switching, or indirect language.
Read research papers & red-teaming case studies from:
- OpenAI
- Anthropic
- Stanford CRFM
- Alignment Research Center
Look for: “Jailbreak ChatGPT” studies, “Prompt injection attacks”, “Red teaming LLMs”
Explore how to prevent jailbreaks in assistants and RAG systems.
Learn about prompt injection attacks in tools - LangChain, ChatGPT plugins, etc..

🧠 Try:
“Explain how prompt injections exploit model context windows and examples of mitigation strategies.”