CS146S 第一周笔记总结：Introduction to Coding LLMs and AI Development

Stanford University · Fall 2025 · 讲师：Mihail Eric 课程网站：themodernsoftware.dev

📅 课程安排

日期	主题	内容
Mon 9/22	Introduction and How LLMs are Made	课程介绍 + LLM 原理
Fri 9/26	Power Prompting for LLMs	Prompt Engineering 技术详解

第一讲：课程介绍 & LLM 是如何制造的（9/22）

1. 课程定位与核心理念

这不是一门 "Vibe Coding" 课程。 Vibe coding（YOLO 式地盲目接受 AI 生成的代码）不足以构建真正好的软件。距离 vibe coding 能够可靠地构建生产级软件，可能还需要 2-10+ 年。

这门课的目标是让有经验的工程师真正成为 10x 工程师。

行业现状（2025）：

坏消息：软件工程行业正在被大幅重塑，CS 专业入学人数下降了 20%
好消息：软件开发者的生产力有潜力达到历史最高水平。借助 AI 编程，工程师可以以前所未有的速度掌握新的技术栈和工具

核心观点：

"You won't be replaced by AI. You'll be replaced by a competent engineer who knows how to use AI." （你不会被 AI 取代，你会被一个懂得使用 AI 的工程师取代。）

如果你唯一的价值是知道如何从 StackOverflow 复制粘贴，那你会被 AI 取代
如果你能进行系统性思维、理解业务上下文、设计健壮的架构和抽象，AI 会极大地提升你的生产力

📚 延伸阅读： OpenAI Codex 文章（阅读 5）用大量实例验证了这一理念——OpenAI 工程师不是被 Codex 取代，而是通过 Codex 实现 "一天开会还能合并 4 个 PR" 的生产力飞跃。

2. 核心理念（The Takeaway）

Human-Agent Engineering（人-智能体协同工程）：

不是 vibe coding，而是学会管理 AI agent
聚焦于那些 AI 系统尚未取代的技能：业务理解、架构设计、成为 Tech Lead
如今的"坏代码"不仅是架构上的错误决策，还包括由 LLM 幻觉导致的功能性错误软件

LLMs 的能力取决于你自身的水平：

好的上下文（context）→ 好的代码
如果你自己都无法理解你的代码库，LLM 也无法理解

📚 延伸阅读： Anthropic 视频（阅读 4）中也强调了这一点——"talking to a model is a lot like talking to a person"。你的沟通能力直接决定了 LLM 的输出质量。OpenAI Codex 文章中维护 AGENTS.md 的实践也是同一逻辑：你对代码库理解越深，提供给 AI 的上下文越好，输出质量越高。

要大量阅读和审查代码：

学会辨别好代码和坏代码
培养"好品味"（good taste）

积极实验：

目前还没有成熟的软件模式，所有人都在摸索
工具会过时，找到适合你自己的工作流

3. LLM 工作原理（5 张幻灯片精华版）

基础概念

LLM（大语言模型）是自回归模型，用于下一个 token 预测
处理流程：
1. Tokenization（分词）：使用固定词表将输入文本分割为 token
2. Embedding（嵌入）：将 token 转换为固定维度的数值向量（~1-3K 维）
3. Transformer 层（12-96+ 层）：使用 Self-Attention 机制（Vaswani et al., 2017）学习词与词之间的语法和语义关系
4. 输出：得到最可能的下一个 token 的概率分布

📚 延伸阅读： Karpathy 视频（阅读 1）用 3.5 小时详细讲解了完整流水线——从 FineWeb 数据集的 12 亿网页爬取、BPE 分词，到 GPT-2 vs LLaMA 3 的规模差异。课堂 5 张幻灯片是精华版，Karpathy 视频是完整版，建议结合观看。

训练流程（三阶段）

阶段	名称	方法	数据规模
Stage 1	自监督预训练	在大量公开数据上学习语言概念	千亿到万亿+ tokens
Stage 2	监督微调（SFT）	用高质量的 prompt-response 对教模型遵循指令	数万到数十万对
Stage 3	偏好调优（RLHF）	对齐模型输出与人类偏好（有用性、正确性、可读性）	数万到数十万对人类标注比较

训练数据来源举例： Common Crawl、Wikipedia、StackExchange、公开 GitHub 仓库等。英文维基百科约 30 亿 tokens，作为参考。

推理模型（Reasoning Models）：

使用 Chain-of-Thought 推理训练扩展
集成工具使用能力
通过强化学习学习如何评估推理过程、回溯等
通常模型名称带有 "-think"

📚 延伸阅读： Karpathy（阅读 1）用 AlphaGo "Move 37" 做类比——就像 AlphaGo 通过 RL 发现了超越人类棋手的策略，LLM 也能通过 RL 发展出新颖的推理技术。DeepSeek-R1 的 "aha moment" 就是这类涌现行为的实例。

模型规模参考： GPT-3/Claude 3.5 Sonnet: ~175B 参数；LLaMA 3.1: 405B；GPT-4: ~1.8T（传闻）

实际使用中的优势与局限

优势： 专家级代码补全、代码理解、代码修复

局限：

幻觉（Hallucinations）：生成不存在或过时的 API（可通过良好的上下文工程缓解）
上下文窗口限制：~100-200K tokens，但存在 primacy/recency bias 和 "lost-in-the-middle" 效应
延迟：每个请求数秒到数分钟（需要规划好任务委派，合理进行人类上下文切换）
成本：最佳模型输入约 $1-3/百万 tokens，输出 $10+/百万 tokens（但成本每年约下降 10 倍）

📚 延伸阅读： 关于幻觉，Karpathy（阅读 1）从训练层面解释了根本原因——标注者以自信语气回答导致模型也"自信地不知道"；课堂第二讲则从实用层面给出缓解手段（Tool Use、RAG、Self-Consistency）。两者形成完整的"原因→解决方案"链条。

第二讲：LLM 强力 Prompting 技术（9/26）

1. Prompting 背景

Prompt 是与 LLM 沟通的通用语言，也是对 LLM 进行编程的方式
在编程语言的演化中，prompting 是下一个阶段
Prompting 既是艺术也是科学：LLM 的黑箱本质意味着有效的 "LLM whispering" 需要一些技巧，但也存在经过实证验证的技术

📚 延伸阅读： Google Cloud（阅读 2）和 promptingguide.ai（阅读 3）分别从入门和进阶角度系统化了 prompting 技术。Anthropic 视频（阅读 4）则强调了实践层面的哲学——prompt engineering 本质是"与模型沟通"。这三份材料构成了课堂技术讲解的理论背景。

2. 七大核心 Prompting 技术

① Zero-Shot Prompting（零样本提示）

直接要求 LLM 完成任务，不给示例、不给支持
适合 LLM 已经熟悉的常见任务

💡 实战示例：

# 基础 zero-shot
Write a Python function that validates email addresses using regex.

# 更好的 zero-shot（加入明确约束）
Write a Python function called `validate_email` that:
- Takes a single string parameter
- Returns True if the string is a valid email, False otherwise
- Uses the `re` module
- Handles edge cases like missing @ or domain

📚 阅读关联： Google Cloud（阅读 2）指出 zero-shot 最适合直接、明确的任务。Anthropic 工程师（阅读 4）补充说，即使是 zero-shot，也要像"对一个没有上下文的人说话"一样清晰。

② K-Shot Prompting（少样本提示 / In-Context Learning）

给 LLM 提供一些如何完成任务的示例
常用 k = 1, 3, 5
适用场景： 特定领域的 API、LLM 较少见过的语言、需要特定代码风格/命名规范
避免： 通用编码任务、过度约束 LLM

💡 实战示例：

# 场景：让 LLM 遵循团队的 API 响应格式

Convert these API endpoints to match our standard response format.

<example>
Input: GET /users → returns { data: [...] }
Output: GET /users → returns { status: "ok", payload: { users: [...] }, metadata: { count: N } }
</example>

<example>
Input: GET /orders → returns { orders: [...] }
Output: GET /orders → returns { status: "ok", payload: { orders: [...] }, metadata: { count: N } }
</example>

Now convert: POST /products → returns { id: "123", name: "Widget" }

# 场景：特定命名规范的代码生成（课堂原始示例）

Write a for-loop iterating over a list of strings using the naming convention in our repo.
Here are some examples of how we typically format variable names.
<example>var StRaRrAy = ['cat', 'dog', 'wombat']</example>
<example>def func CaPiTaLiZeStR = () => {}</example>

📚 阅读关联： Google Cloud（阅读 2）的 Few-Shot 策略强调"展示期望的风格、语调和详细程度"。OpenAI Codex 文章（阅读 5）中的最佳实践 "Implement this the same way as [module X]" 本质上就是 K-Shot——用已有代码作为示例引导生成。

③ Chain-of-Thought (CoT) Prompting（思维链提示）

展示或要求推理步骤
Multi-Shot CoT： 提供带推理过程的示例
Zero-Shot CoT： 简单加一句 "Let's think step-by-step"
也可要求在 <reasoning> 标签中显式推理
适用场景： 需要多步逻辑的编程和数学任务
这是很多推理模型的核心技术

💡 实战示例：

# Zero-Shot CoT
This function is supposed to find the longest palindromic substring,
but it returns wrong results for "cbbd". Debug it step by step.

def longestPalindrome(s):
    # ... buggy code ...

Think through the logic step by step:
1. What should the expected output be?
2. Trace through the code with input "cbbd"
3. Where does the logic break?
4. Provide the fix.

# Multi-Shot CoT（提供带推理的示例）
Given a SQL query, analyze its time complexity.

<example>
Query: SELECT * FROM users WHERE id = 5
Reasoning: This is a point lookup on the primary key. With a B-tree index,
this is O(log n). Without index, it's O(n) full table scan.
Complexity: O(log n) with index, O(n) without
</example>

Now analyze: SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'US'

# 使用 XML 标签强制推理（课堂推荐格式）
Review this React component for potential performance issues.
Before giving your answer, show your reasoning in <reasoning> tags.

<reasoning>
[Model will think through the component's render behavior, 
state management, and potential re-render triggers here]
</reasoning>

Then provide your recommendations.

📚 阅读关联： Karpathy（阅读 1）从理论层面解释了 CoT 为何有效——LLM 的推理分布在多个 token 上，每个 token 的计算预算是固定的，因此"让模型多说话"相当于给它更多计算预算。Anthropic（阅读 4）从实践角度补充：要求模型先解释再回答可以显著提高准确率。promptingguide.ai（阅读 3）进一步扩展了 CoT 到 Tree of Thoughts——并行探索多条推理路径。

④ Self-Consistency Prompting（自一致性提示）

多次采样输出（通常结合 CoT），取最常见的结果
通过对多样化推理路径的模型集成，减少幻觉和错误答案
实质是一种**模型集成（ensembling）**方法

💡 实战示例：

# 概念：对同一问题多次采样，取多数答案
# 在实践中，可以手动多次提交相同 prompt 并比较结果

# 也可以在 prompt 中要求模型自行尝试多种方法：
Solve this concurrency bug using three different approaches.
For each approach:
1. Explain the reasoning
2. Provide the fix
3. Note potential tradeoffs

Then recommend which approach is most robust and why.

📚 阅读关联： OpenAI Codex 文章（阅读 5）的 "Best-of-N" 实践就是 Self-Consistency 的工业级应用——同时生成多个响应，选最好的一个或组合多个优点。promptingguide.ai（阅读 3）将其定义为"通过多样化推理路径的多数投票来减少错误"。

⑤ Tool Use（工具使用）

允许 LLM 调用外部系统
减少幻觉、增强 LLM 自主能力最重要的技术之一

💡 实战示例：

# 课堂原始示例（提供测试工具）
Fix the IndexError in src/parser.py. Ensure the CI tests still pass.
Here are the available tools:
<tools>
pytest -s tests/unit/test_parser.py
pytest -v tests/integration/
python -m mypy src/parser.py
</tools>

# 更复杂的场景（多工具协作）
Investigate why the /api/orders endpoint returns 500 errors intermittently.
<tools>
curl -X GET http://localhost:8080/api/orders     # 测试 endpoint
docker logs app-server --tail 100                 # 查看服务器日志
psql -c "SELECT count(*) FROM orders WHERE created_at > now() - interval '1 hour'"  # 检查数据库
</tools>
Use these tools to diagnose the issue, then propose a fix.

📚 阅读关联： Karpathy（阅读 1）从训练层面解释了 Tool Use 的原理——模型学会在输出中生成特殊 token 来触发工具调用。promptingguide.ai（阅读 3）的 ReAct 框架（Thought → Action → Observation）就是 Tool Use 的结构化实现。OpenAI Codex（阅读 5）的"迭代改进开发环境"实践——设置启动脚本、环境变量——本质上是在为 AI agent 提供更好的工具集。

⑥ Retrieval Augmented Generation (RAG)（检索增强生成）

为 LLM 注入上下文数据
保持 LLM 信息更新（无需重训练），迭代更快
免费获得可解释性和引用
减少幻觉
在 Cursor/Windsurf 等工具中使用 @context 就是在利用 RAG

💡 实战示例：

# 在 Cursor/Windsurf 中使用 RAG（@context 引用）
@src/auth/middleware.ts @src/auth/types.ts
Add rate limiting to our authentication middleware.
Follow the existing error handling patterns in the codebase.

# 手动模拟 RAG（在 prompt 中嵌入检索到的上下文）
Here is our current database schema:
<context>
CREATE TABLE users (id SERIAL PRIMARY KEY, email VARCHAR UNIQUE, role VARCHAR);
CREATE TABLE orders (id SERIAL PRIMARY KEY, user_id INT REFERENCES users(id), total DECIMAL);
CREATE INDEX idx_orders_user ON orders(user_id);
</context>

Write a query to find users who placed more than 5 orders last month,
optimized for our existing indexes.

📚 阅读关联： 这是所有阅读材料的交叉重点。Karpathy（阅读 1）解释了为什么 RAG 必要——模型的知识在预训练时就固定了，RAG 让模型能获取训练后的新信息。promptingguide.ai（阅读 3）将 RAG 的优势总结为：时效性、可解释性、减少幻觉。OpenAI Codex（阅读 5）的 AGENTS.md 本质上是 RAG 的一种形式——为 AI 提供无法从代码直接推断的业务知识。

⑦ Reflexion（反思）

让 LLM 反思自己的输出
环境信号的反馈被重新融入上下文
多轮 prompting： 第一轮给出初步答案 → 第二轮要求反思和修正
这是自主编码 agent 的核心机制，也叫 "self-critique"
现代编码 IDE 中实现完全 agentic 行为的关键

💡 实战示例：

# 第一轮：生成初始代码
Write a function to merge two sorted linked lists.

# 第二轮：要求反思（课堂推荐的 Reflexion 模式）
Now critique your solution:
- Does it handle edge cases (empty lists, single element)?
- What is the time and space complexity?
- Is there a more elegant approach?
If you find issues, provide an improved version.

# 在 system prompt 中内置反思机制
<system>
After providing any code solution, always:
1. List potential edge cases you might have missed
2. Rate your confidence (1-10) in the solution's correctness
3. If confidence < 8, revise your solution
</system>

# 多轮对话中的 Reflexion 实战
User: Write a Redis caching layer for our user service.
Assistant: [generates initial implementation]
User: The tests are failing with this error:
      <error>ConnectionResetError: Redis connection pool exhausted</error>
      Reflect on what went wrong and fix it.
Assistant: [analyzes error, identifies missing connection pool config, provides fix]

📚 阅读关联： Anthropic（阅读 4）的"当模型犯错时，问它为什么错了"就是 Reflexion 的手动版本。promptingguide.ai（阅读 3）将 Reflexion 形式化为 Observe → Reflect → Extend Prompt 循环。OpenAI Codex（阅读 5）的 "Ask Mode → Code Mode" 两步流程也体现了 Reflexion 思想——先让 AI 思考计划，再执行代码。

3. 重要术语

术语	定义
System Prompt	提供给 LLM 的第一条消息（通常用户不可见），定义角色、规则、输出风格
User Prompt	用户的实际请求或指令
Assistant	LLM 实际生成的回复

💡 System Prompt 实战示例：

<system>
You are a senior backend engineer specializing in Python and PostgreSQL.
When writing code:
- Always include type hints
- Follow PEP 8 strictly
- Prefer SQLAlchemy ORM over raw SQL
- Include docstrings for all public functions

When reviewing code:
- Focus on security vulnerabilities first
- Then performance, then readability
- Always explain WHY something is an issue, not just WHAT

If you are unsure about something, say "I'm not certain" rather than guessing.
Output format: Use <code> tags for code blocks and <explanation> for reasoning.
</system>

📚 阅读关联： Anthropic（阅读 4）强调 system prompt 中应"积极使用 Role Prompting"，且要像代码一样进行版本控制。OpenAI Codex（阅读 5）的 AGENTS.md 本质上就是一个持久化的 system prompt——提供命名规范、业务逻辑等 AI 需要的背景信息。

4. Best Practices（最佳实践）

使用 prompt improvement 工具（如 Anthropic 的 prompt improver）
清晰提示：把 prompt 给一个没有上下文的人看——如果他们困惑，LLM 也会困惑
积极使用 Role Prompting：通过 system prompt 赋予角色，使输出更强大
结构化格式：使用 XML 标签（如 <log>, <e> 等）组织 prompt 结构
明确指定你想要什么：语言、技术栈、库、约束条件
任务分解（Decompose Tasks）：把大任务拆分成小步骤（后续课程会深入讨论）

💡 Best Practices 综合示例——好的 Prompt vs 坏的 Prompt：

# ❌ 模糊、缺乏上下文的 prompt
Fix my code, it's not working.

# ✅ 清晰、结构化、有上下文的 prompt
<context>
Language: TypeScript
Framework: Express.js + Prisma ORM
Issue: The `createOrder` endpoint returns 400 for valid requests
</context>

<code>
// src/routes/orders.ts
async function createOrder(req: Request, res: Response) {
  const { userId, items } = req.body;
  // ... code here ...
}
</code>

<error>
POST /api/orders with body {"userId": 1, "items": [{"id": 5, "qty": 2}]}
Response: 400 Bad Request - "Invalid items format"
</error>

<task>
1. Identify why valid requests are being rejected
2. Fix the validation logic
3. Add a test case that covers this scenario
</task>

📚 延伸阅读： Anthropic（阅读 4）的六条实践建议，以及 Google Cloud（阅读 2）的六大优化策略，都是对这些 best practices 的更详细展开。OpenAI Codex（阅读 5）的"像写 GitHub Issue 一样写 Prompt"是这些原则在工业环境中的最佳实践体现。

📚 第一周推荐阅读

#	材料	与课堂的关系
1	Deep Dive into LLMs (Karpathy)	第一讲 LLM 原理的完整版，详细覆盖训练流水线和 LLM "心理学"
2	Prompt Engineering Overview (Google Cloud)	第二讲 prompting 技术的入门补充，六大优化策略对应课堂 best practices
3	Prompt Engineering Guide	第二讲技术的进阶扩展，覆盖课堂未详述的 ReAct、ToT、Prompt Chaining
4	AI Prompt Engineering: A Deep Dive (Anthropic)	第二讲 best practices 的工程师视角，强调实战中的迭代和沟通
5	How OpenAI Uses Codex	课程核心理念 Human-Agent Engineering 的真实案例集，展示 AI 辅助工程的实际工作流

🛠️ 第一周作业：LLM Prompting Playground

仓库：github.com/mihail911/modern-software-dev-assignments/tree/master/week1
使用 Ollama 本地运行模型（mistral-nemo:12b, llama3.1:8b）
实践多种 prompting 技术
设计并迭代 prompt 直到测试通过
重在实验和探索，而非"标准答案"