Hands-On – Deploying and Experimenting with a Foundation Model

Fundamentals of AI Engineering

Foundational Models

Transformers

Fine Tuning

Vector Databases

RAG

LangChain

API-Based Deployment: Using OpenAI’s GPT-4

A. Setup and Installation

First, install OpenAI’s Python package:

pip install openai

B. Code Example: Generating Text with GPT-4

import os
from dotenv import load_dotenv
import openai

# Load environment variables from .env file
load_dotenv()

# Set up your API key from environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_text(prompt, max_tokens=100):

    """
    Generate text using GPT-4 based on the provided prompt.
    Args:
        prompt (str): The input prompt for text generation
        max_tokens (int): Maximum number of tokens to generate
    Returns:
        str: Generated text
    """
    try:
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=max_tokens,
            temperature=0.7,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0
        )

        # Extract the generated text from the response
        return response.choices[0].message.content
    except Exception as e:
        return f"Error generating text: {str(e)}"

# Example usage
if __name__ == "__main__":
    prompt = "Write a short paragraph about artificial intelligence."
    generated_text = generate_text(prompt)
    print("Generated Text:")
    print(generated_text)

Output:

Generated Text:
Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, problem-solving, perception, and language understanding. AI technology is now frequently applied in areas such as robotics, voice recognition, image recognition, natural language processing, and many others. It has the potential to greatly impact various sectors of society, from healthcare and education to business and entertainment, by automating tasks and providing insightful data analysis.

Local Deployment: Running LLaMA 2 on Your Machine

A. Installing the Required Libraries

To run Meta’s LLaMA 2 model locally, install:

pip install transformers torch accelerate

B. Loading and Running LLaMA 2 Locally

Let me walk you through the entire process of downloading and using LLaMA 2 models from scratch:

Step 1: Request Access to LLaMA 2

LLaMA 2 is not freely available — you must first request access from Meta.

Go to Meta’s request form: Meta AI LLaMA Downloads
Fill out the required information (personal/professional details)
Accept the license terms
Submit your request
Wait for approval (typically takes a few days)

Step 2: Download the Model Files

Once you’re approved, you have two options to get the model files:

Option A: Direct Download from Meta

Meta will email you a download link.
Download the model files to your computer (several GB).
Extract the files to a folder (e.g., C:\AI\llama-2-7b-chat).

Option B: Using Hugging Face with Approval

Link your Meta approval with your Hugging Face account.
Install the Hugging Face CLI: pip install huggingface_hub
Login to Hugging Face: huggingface-cli login
Enter your token when prompted
Download the model: huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir ./my-llama-model

Step 3: Set Up Your Python Code

Create a new Python file (e.g., llama_inference.py) with this code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Path to your locally downloaded model files
local_model_path = "path/to/downloaded/llama-2-7b-chat"  # Change this to your actual path

# Initialize tokenizer from local files
tokenizer = AutoTokenizer.from_pretrained(local_model_path, use_fast=False)

# Initialize model from local files
model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare input using LLaMA 2 chat template
input_text = "Explain reinforcement learning."
chat = [{"role": "user", "content": input_text}]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate response
output = model.generate(
    inputs.input_ids,
    max_new_tokens=500,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

# Decode and print the response
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Login