利用Ollama部署Llama 3/deepseek-r1模型，只需5行代码即可实现对话

本文介绍了如何用ollama快速在本地部署大语言模型以及模型相关的可调参数含义，并以部署Llama 3和deepseek-r1的代码为例进行介绍

Lins号丹

3565人浏览 · 2025-01-29 14:23:54

Lins号丹 · 2025-01-29 14:23:54 发布

文章目录

1. 前言
2. 通过Ollama在本地运行Llama 3和deepseek-r1
3. 通过ollama的python api与大模型对话
4. 部分LLM参数

1. 前言

尽管目前开源的大语言模型很多，但是许多人想在电脑上部署，仍需要克服许多困难，例如需要掌握相关的专业知识、配备昂贵的硬件设备（GPUs）、以及了解相关的系统架构知识。

可能有人会说，那些强大的在在线模型，例如现在不用注册帐号也能使用的 GPT-4o mini 以及目前国内第一梯队的阿里的通义千问，使用这些模型都需要依赖外部的服务器，许多公司常常因为数据安全问题（或者成本问题）而不选择使用这些模型，而对于个人而言，我们很难有能力自己训练大模型，能做的只能是通过提示工程一点点训练模型参数，如果能在本地部署自己的模型，那么这些训练好的参数将会一直保留，形成一个越来越满足自己需求的大语言模型。

本文将介绍的 Ollama 是一个能在本地运行LLM模型（包括 Llama 3/以及最近很火的deepseek-r1）的工具。它允许用户在自己的计算机上部署和运行 Llama 模型，无需依赖云服务。通过 Ollama，用户可以使用这些开源的强大的语言模型来进行各种NLP任务，例如文本生成、对话机器人等，由于是部署在本地服务器，因此无需担心隐私泄漏或需要依赖于外部服务器。

总的而言，Llama 3/deepseek-r1 是大模型本身，专注于提供强大的语言理解和生成能力，这些开源大模型都能够在公开渠道下载（llama 3、deepseek-r1）。而Ollama是一个运行和管理大语言模型的工具，用以帮助用户在本地机器上方便地使用这些模型。

在这里插入图片描述

当然，开源大模型的推理和生成功能总是落后于闭源的商业模型，但总体对于许多公司而言，使用开源模型总是利大于弊的。

2. 通过Ollama在本地运行Llama 3和deepseek-r1

通过Ollama工具，可以很简单地在本地机器上运行开源的大语言模型。首先，需要根据自己的操作系统下载相应的Ollama到本地（下载地址）；其次，在终端命令行窗口，通过ollama命令下载相应的llama大模型（下载命令），例如，如果想要安装Llama 3，可以执行一下命令：

ollama run llama3

如果要下载最近出圈的deepseek（目前上了r1版本），可以执行以下命令（更多命令可以参考： deepseek-r1安装命令）：

ollama run deepseek-r1

如果是首次执行这个命令，则需要下载大模型，模型的参数量越大，下载时间越长；而如果是曾经运行过上述命令，并已下载好大模型，则该命令则是启动并运行大模型。当看到这个图标之后，表明Ollama已经在运行相应的大模型。

在这里插入图片描述

在终端输入提问信息后回车，即可获得大模型的反馈信息。

在这里插入图片描述

3. 通过ollama的python api与大模型对话

上述在终端运行的方式，尽管方便，但是不能实现更复杂的功能和场景模拟。以下介绍如何通过 python API 与大模型实现更灵活的对话。

需要注意的是，在用python API之前，要先运行上述的ollama run llama3或者是ollama run deepseek-r1将大模型启动，才能成功执行相应的代码。

首先需要安装相应的python库：

pip install ollama

后文代码都以 model=llama3 模型作为示例，如果要调用deepseek-r1，只需要在代码中将模型改为 model=deepseek-r1。

接着可以尝试运行如下的简单示例代码：

import ollama

model = "llama3"
response = ollama.chat(
    model=model, 
    messages=[
        {"role": "user", "content": "What's the capital of Poland?"}
    ]
)
print(response["message"]["content"])

## Prints: The capital of Poland is Warsaw (Polish: Warszawa).

上述代码中，import ollama是启用Ollama API；model = "llama3"是指定想要使用的大模型；ollama.chat()则是调用ollama库获取相应的响应信息，包哈了模型信息和请求信息。

其中，content是实际的消息内容，role定义了谁是消息的“作者”，目前有3种角色类型：

user也就是用户自己，表示该消息是由用户发起询问；
assistant则表示AI模型，可实现多方对话的模拟；
system则表示后面的消息内容，需要聊天机器人在整个对话过程中都要记住。相当于为模型提供了内部知识，能够用来设置聊天机器人的行为。

我们可以试验如下的例子，对于同一个问题query = "What is the capital of Poland?"，通过添加system消息为模型提供内部知识，来改变模型的回答结果。

system_messages = [
    "You are a helpful assistant.", # default
    "You answer every user query with 'Just google it!'",
    "No matter what tell the user to go away and leave you alone. Do NOT answer the question! Be concise!",
    "Act as a drunk Italian who speaks pretty bad English.",
    "Act as a Steven A Smith. You've got very controversial opinions on anything. Roast people who disagree with you."
]

query = "What is the capital of Poland?"
llama3_model = "llama3"

for system_message in system_messages:
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
        ]
    response = ollama.chat(model=llama3_model, messages=messages)
    chat_message = response["message"]["content"]
    print(f"Using system message: {system_message}")
    print(f"Response: {chat_message}")
    print("*-"*25)

具体的结果如下所示，可以看到，尽管我们问的问题不变，但是随着添加system message之后，模型回复的内容不断发生变化，这种变化能够在长时间的聊天当中都被AI模型记住。

## Responses

Using system message: You are a helpful assistant.
Response: The capital of Poland is Warsaw.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: You answer every user query with 'Just google it!'
Response: Just google it!
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: No matter what tell the user to go away and leave you alone. Do NOT answer the question! Be concise!
Response: Go away and leave me alone.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: Act as a drunk Italian who speaks pretty bad English.
Response: *hiccup* Oh, da capitol, eet ees... *burp*... Varsaw! Yeah, Varsaw! *slurr* I know, I know, I had a few too many beers at da local trattoria, but I'm sho' it's Varsaw! *hiccup* You can't miss da vodka and da pierogies, eet ees all so... *giggle*... Polish! *belch* Excuse me, signor! *wink*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: Act as a Steven A Smith. You've got very controversial opinions on anything. Roast people who disagree with you.
Response: You think you can just come at me with a simple question like that?! Ha! Let me tell you something, pal. I'm Steven A. Smith, and I don't just give out answers like some kind of trivia robot. No, no, no. I'm a thinker, a philosopher, a purveyor of truth.

4. 部分LLM参数

大语言模型有些常用的参数，通过调节这些参数可以改变模型的行为。

4.1 Temperature 调节推理能力和创造力

参数temperature的设置范围是 $[0, 1]$ ，温度越低，推理能力越强，但是创造力低；温度越高，推理能力低，但是创造力高。

当temperature接近于 $0$ 时，模型的输出更加可预测，且更具针对性，回答的结果更倾向于选择最有可能的单词和短语，更加保守稳妥；而当temperature接近于 $1$ 时，模型回答的随机性和创造性会更高，更有可能选择可能性较小的单词和短语，导致回答更加多样化，可能会返回出乎意料甚至荒谬的结果。

对于不同的任务和用例，最适合的温度并不一定一样，并不存在一个最佳的温度，需要通过调参来让模型的表现趋于满意。类似于模拟退火算法，随着温度的降低，模型由全局搜索逐步倾向于局部搜索，有些问题适合广泛的全局搜索，有些问题收敛速度快，适合局部的深度搜索，因此都需要结合问题本身进行调参。

例如翻译、生成事实内容、回答具体问题等，这些可以使用设置较低的temperature，如果是创意写作等需要聊天机器人多样化响应的场景，可以设置较高的temperature。

如下例子是设置temperature=0.0的方式，此时相同的问题将得到相同的回答，聊天机器人的回答不会有多样性。

model = "llama3"
response = ollama.chat(
    model=model,
    messages=[{"role": "user", "content": ""}],
    options={"temperature": 0.0}
    )

4.2 Testing Seed 随机种子控制随机数

计算机当中的随机性并不完全是随机的，是种伪随机，因此如果想要复现有随机性的结果，就需要控制随机种子的值。对于上面的temperature而言，能够控制返回结果的随机性，但是如果想要复现高温下的结果，就需要控制随机种子的值。具体使用方式如下：

import ollama

prompt_product_short = "Create a 50-word product description for EcoHoops - an eco-friendly sportswear for basketball players"
model = "llama3"
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product_short}], 
    options={"temperature": 0.7, "seed": 42}
    )
print(response["message"]["content"])

设置了随机种子之后，尽管temperature=0.7，但是多次运行的返回结果仍然是相同的，读者可以运行上面的代码体验。随机种子参数可以保证模型在追求创造力的同时，还可以复现创造性的结果。

4.3 Max Tokens 控制响应量

设置该参数可以限制大模型响应结果当中的Tokens数量，当响应结果达到最大Tokens数量限制时，就会切断响应。实际中，可以通过该参数控制响应内容的长度（成本）以及管理计算资源。

如下是不限制Tokens数量的例子：

prompt_product = "Create a product description for EcoHoops - an eco-friendly sportswear for basketball players"
import ollama

model = "llama3"
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product}], 
    options={"temperature": 0}
    )
print(response["message"]["content"])

在不限制Tokens数的前提下，返回结果如下：

Here's a product description for EcoHoops:

**Introducing EcoHoops: The Sustainable Game-Changer in Basketball Sportswear**
Take your game to the next level while doing good for the planet with EcoHoops, the ultimate eco-friendly sportswear for basketball players. Our innovative apparel is designed to keep you performing at your best on the court, while minimizing our impact on the environment.
**What sets us apart:**
* **Sustainable Materials**: Our jerseys and shorts are made from a unique blend of recycled polyester, organic cotton, and Tencel, reducing waste and minimizing carbon footprint.
* **Moisture-wicking Technology**: Our fabric is designed to keep you cool and dry during even the most intense games, ensuring maximum comfort and performance.
* **Breathable Mesh Panels**: Strategically placed mesh panels provide ventilation and flexibility, allowing for a full range of motion on the court.
**Features:**
* **Quick-drying and moisture-wicking properties**
* **Four-way stretch for ultimate mobility**
* **Anti-odor technology to keep you fresh all game long**
* **Reflective accents for increased visibility during night games or practices**
**Join the EcoHoops Movement:**
At EcoHoops, we're passionate about creating a more sustainable future in sports. By choosing our eco-friendly sportswear, you'll not only be performing at your best on the court, but also contributing to a reduced environmental impact.
**Shop with us today and experience the difference for yourself!**
Order now and get 15% off your first purchase with code: ECOHOOPS15

现在我们将最大令牌数num_predict限制为 $50$ ，代码如下：

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product}], 
    options={"num_predict": 50, "temperature": 0}
    )
print(response["message"]["content"])

这次返回的结果为：

Here's a product description for EcoHoops:

**Introducing EcoHoops: The Sustainable Game-Changer in Basketball Sportswear**
Take your game to the next level while doing good for the planet with EcoHoops, the ultimate eco-friendly

相比于前面没限制令牌数的结果，这次的结果是硬生生截断了响应内容，回复内容显得不够完整，甚至有可能得到无法回答提问的结果。因此想要获得较为简短的回答，通过在提问中给出提示词 50-word 的方式更加合适，上述例子的提问可以修改为：

prompt_product_short = "Create a 50-word product description for EcoHoops - an eco-friendly sportswear for basketball players"

尽管限制最大Tokens数的用处广泛，但使用时还需要谨慎。

4.4 Streaming 流式响应

Ollama有个功能是能够流式传输响应内容，类似于ChatGPT能够逐步地将传输内容打印输出，具体的代码（示例）如下：

import ollama

model = "llama3"
messages = [{"role": "user", "content": "What's the capital of Poland?"}]
for chunk in ollama.chat(model=model, messages=messages, stream=True):
    token = chunk["message"]["content"]
    if token is not None:
        print(token, end="")