Essential AI Engineering Concepts for Software Developers

#AI Engineering

#What is AI engineering?

AI engineering is the process of building applications using existing foundation models by adapting them with techniques like prompting or other customization methods, and deploying them for end-users. AI engineers focus on deploying, evaluating, monitoring, and maintaining these AI-powered applications.

#What is a foundation model?

A foundation model is a large AI model trained on massive datasets (text, images, videos) that can be adapted for many specific tasks. It’s a general-purpose model that serves as a building block but it still needs to be tailored for particular use cases.

#What are LLMs (Large Language Models)?

LLMs are a type of foundation model trained to predict the next piece of text. They can summarize, answer questions, translate, and write code by generating probable text sequences.

#What is LLM transformer architecture?

The transformer architecture is a neural network model design that allows efficient parallel training, making it possible to train very large models. Transformers enable each word in a sentence to pay attention to all other words, not just those next to it, improving understanding of long or complex sentences.

#What is the LLM attention mechanism?

Attention is how the transformer model decides which parts of the input matter most. Multiple attention heads can focus on various features simultaneously, like tracking references or tone.

#What are LLM model parameters?

Model parameters are the internal numbers in the model that are adjusted during training to control its behavior. More parameters generally mean the model can recognize more patterns, but are also more costly to store and run.

#What is LLM temperature?

Temperature is a setting that acts like a creativity dial:

Low temperature: more predictable and factual responses.
High temperature: more creative and varied, but less reliable for factual answers.

#What is Top-K and Top-P in LLM temperature setting?

Top-K: Limits the model to choose from the K most likely next words.
Top-P: Expands the pool of possible words until their cumulative probability reaches a threshold (e.g., 90%), offering a balance between consistency and creativity.

#What is a token?

A token is a piece of text (can be a word, part of a word, or punctuation). Models process language one token at a time, not one word at a time.

#What is model context?

Model context refers to how much information (in tokens) the model can “remember” or process at once. It includes conversation history, prompts, documents, and the response being generated. Context has a strict token limit.

#What is prompt engineering?

Prompt engineering is the art of crafting effective instructions for the model, specifying roles, format, and rules to get more consistent and desired outputs.

#What is the difference between system vs user prompts?

System prompt: Sets the model’s overall behavior for the session (“house rules”).
User prompt: The specific instruction or question at the moment.

#What is zero shot learning?

Zero-shot learning is when you ask the model to perform a task without giving any examples—just an instruction, and the model figures it out on its own.

#What is fine tuning and PEFT?

Fine-tuning involves retraining a model on your custom examples so it permanently behaves in a certain desired way. It modifies the model’s internal parameters for specialization. PEFT (Parameter Efficient Fine-Tuning): Updates only a small part of the model (like adding adapter layer) rather than the entire model, making fine-tuning easier, faster, and cheaper.

#What is quantization and distillation?

Quantization: Reduces the size of the model by storing numbers with fewer bits, making the model smaller and faster with minimal loss in quality.
Distillation: Trains a smaller “student” model to replicate a larger “teacher” model, yielding a quicker, lighter model that retains much of the original’s knowledge.

#What is preference fine tuning?

This process uses human feedback to guide which model responses are preferred so the model learns to give outputs those people like (e.g., making it more helpful, safe, and polite).

#What is RAG?

RAG (Retrieval Augmented Generation): The model fetches relevant external documents or data to answer questions, keeping responses up-to-date and more accurate without retraining the whole model.

#What are embeddings and vector DB?

Embeddings: Represent text as lists of numbers where similar meanings are numerically close.
Vector DB: Stores these embeddings and quickly finds the most semantically similar content to a given query.

#What is chunking?

Chunking divides large documents into smaller pieces (chunks) before storage and search, making retrieval manageable and optimizing the relevance of returned information.

#What are encoders and decoders?

Encoder: Converts text into embeddings (numeric summaries that capture meaning).
Decoder: Converts those summaries back into human-readable text. Some models do only encoding, others only decoding, and some do both.

#What are agents?

Agents are AI assistants that can plan multiple steps and take actions to reach a goal (e.g., searching the web, running code, sending emails). They use external tools and can reason through tasks.

#What is inference and the difference between online vs batch?

Inference: Running the trained model to get predictions/outputs (generates one token at a time).
Online inference: Real-time, for live users (fast responses).
Batch inference: Processes many items at once, offline (higher throughput, lower cost).

#What are model benchmarks?

Benchmarks are standardized tests used to compare different models’ skills and capabilities across math, coding, reading, safety, etc.

#What is perplexity?

Perplexity measures how “surprised” a model is by text it hasn’t seen before. Lower perplexity means better prediction (less confusion).

#What is BLEU?

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating translation and summarization quality by comparing model outputs to reference answers, counting matching words/phrases.

#What is ROUGE?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is another metric, focused on recall—how much important content from the reference was captured in the model’s summary or translation.

#What is MCP?

MCP (Model Context Protocol): A standard interface so models, apps, and tools can easily connect and work together, simplifying integrations and ensuring compatibility.