Retrieval-Augmented Generation (RAG) in Rails: A Practical Implementation

10:38 PM September 15 2025 LLM

I wanted to generate study material for specific topics within one of my applications. Time to code a RAG feature and maybe take some inspiration in notebooklm to design the prompts for this.

A week ago I started a new project because I wanted to get a better feel and knodledge about using inertia.js to integrate react within ruby on rails.

It began as a small one-day app: a job application tracker/CRM to help me organize my interactions from LinkedIn messages and emails, along with relevant job descriptions, contacts, and company information.

It was just a barebones CRUD app, but my main goal was to play with inertia.js to integrate react in the same app. Super fun and quickly found myself using it daily.

Planning the RAG feature

So this past weekend I decided to add RubyLLM and code a basic RAG feature that generates interview prep material based on job descriptions and messages for a specific job opening.

RAG is basically a technique that helps you generate a prompt with relevant context from your domain based on a user query.

This requires two main steps: data ingestion (chunking and embedding those chunks) and then retrieving that data to build a prompt with the most relevant context

Here’s the plan:

Write how to generate chunks of data from each data source
Generate embeddings and store them in a vector column in postgres with pgvector
Generate the embedding of the query
Use the query embedding to perform a similarity search over the chunk embeddings
Take the most relevant chunks and add them as context in the pre-designed RAG prompt
I will also like to link citations to the relevant data, so im requesting structured outputs

Data ingestion

I decided for the first iteration to just include job descriptions and tracked messages from linkedin.

class IngestService
  def self.ingest_job_description(job_opening)
    text = "#{job_opening.title} (#{job_opening.tech_stack})\n#{job_opening.description}"
    documents = []
    Chunker.split(text).each_with_index do |chunk, index|
      documents << Document.create!(
        chunk_text: chunk,
        metadata: {
          source_type: "job_opening",
          job_opening_id: job_opening.id,
          index: index
        }
      )
    end
    job_opening.update!(ingested_at: Time.current)
    documents
  end

  def self.ingest_direct_message(direct_message)
    text = "#{direct_message.direction == 'received' ? 'Recruiter' : 'Candidate'}: #{direct_message.content}"
    Chunker.split(text).each_with_index do |chunk, index|
      Document.create!(
        chunk_text: chunk,
        metadata: {
          source_type: "direct_message",
          direct_message_id: direct_message.id,
          direction: direct_message.direction,
          index: index
        }
      )
    end
    direct_message.update!(ingested_at: Time.current)
  end
end

I went for the simplest chunking strategy, a 2000 characters with 10% overlap. For this use case that will work really well and will maintain some context.

class Chunker
  CHUNK_SIZE = 2000
  OVERLAP = 200

  def self.split(text)
    return [] if text.blank?

    chunks = []
    start = 0
    while start < text.length
      finish = [ start + CHUNK_SIZE, text.length ].min
      chunk = text[start...finish]
      chunks << chunk.strip
      start += (CHUNK_SIZE - OVERLAP)
    end
    chunks
  end
end

I will write about chunking strategies and implementations in a future article

RubyLLM and generating embeddings

In this application I will be using RubyLLM instead of manually calling each LLM API. It has really cool abstractions built in, so I dont need to reinvent them (I will be defining tools and writing agentic features next weekend). It also makes switching model providers really easy, so doing different evals and benchmarks won’t be too time consuming.

I have the chunks persisted as documents in my db. To generate the embeddings of these chunks I am using sidekiq-cron to run this in the background periodically.

class EmbedDocumentsService
  EMBEDDING_MODEL = "text-embedding-3-large"

  def self.in_batches(limit: 100)
    docs = Document.where(embedding: nil).limit(limit)
    return if docs.empty?

    inputs = docs.map(&:chunk_text)

    # Request embeddings in batch
    response = RubyLLM.embed(
      inputs,
      model: EMBEDDING_MODEL,
      dimensions: 1536
    )

    embeddings = response.vectors

    docs.zip(embeddings).each do |doc, embedding|
      doc.update!(embedding: embedding)
    end

    docs.size
  end
end

OpenAI’s embedding endpoint accepts arrays and strings as input. If you send an array you can emebed multiple chunks in a single request. RubyLLM supports this seamlessly, you just pass an array as the input to RubyLLM.embed() and it will send the request as expected. This approach is much better than sending each chunk separately since it reduces API requests and lowers cost.

Orchestration Service: Bringing it all together

For this feature I will pre-construct the query in the backend instead of letting the user to write it.

In the UI it’s just a “Generate Interview Prep” button. Lets create the query and generate its embedding vector.

query = "Technical topics, frameworks, and challenges mentioned in job description and messages for #{job_opening.title} at #{job_opening.company.name}."

query_embedding = RubyLLM.embed(
  query,
  model: EmbedDocumentsService::EMBEDDING_MODEL,
  dimensions: 1536
).vectors

Now the retrieval part. I am using the gem neighbor to generate the db query. Lets pull the top 8 most relevant documents for this query.

retrieved_chunks = Document.nearest_neighbors(:embedding, query_embedding, distance: "euclidean").first(8)

And here comes the artistic part of the feature: designing the prompt.

context_text = retrieved_chunks.map.with_index(1) do |doc, index|
  source_id = doc.metadata["job_opening_id"] || doc.metadata["direct_message_id"]
  source_info = [
    "Source Type: #{doc.metadata["source_type"]}",
    "Source ID: #{source_id}",
    metadata["date"]
  ].compact.join(" | ")

  "[#{index}] #{source_info}\n#{doc.chunk_text}"
end.join("\n\n")

user_prompt = <<~PROMPT
  /context/
  #{context_text}

  Prepare a concise interview prep for:
  - Company: #{@company}
  - Role: #{@role}
  - Interview date: #{@interview_date}

  Produce:
  1) 30-second company summary
  2) Top 5 likely technical questions specific for the role. Dont include generic questions / commonly asked questions
  3) Top 3 behavioral questions + STAR outline examples
  4) 5 smart questions to ask the interviewer
  5) Exact 2~3 source snippets (with metadata) the candidate should re-read

  Keep it <= 650 words. Cite sources under each fact using using (Source [index]). Markdown format using headings, subheadings, bold text for questions. Questions should be listed with numbers, answers should be in next line without bullet points.

  This is an example format:

  ## 2) Top 5 Likely Technical Questions and Model Answers:
  - **When would you choose MySQL vs. DynamoDB, and how would ElasticSearch, Redis, or Memcached fit in?**
    Select MySQL or DynamoDB per data/scale needs; leverage ElasticSearch and in-memory stores to meet performance and search requirements. (Source [1], [2])
PROMPT

system_prompt = <<~SYSTEM
  You are an expert interview coach. Use ONLY the provided context to produce answers. Cite sources, and say 'Not in your documents' if info is missing.
  The provided context has an index, source type, and source ID for each snippet.
SYSTEM

You’ll need to experiment to see what works best for your use case, but here are some tips:

Enforce context-only rules
Prefer Positive instructions over negative instructions. For example in my prompt it says “Use ONLY the provided context to produce answers”, the negative form will be something like “Do not rely on outside knowledge or make assumptions to produce answers”. You can use both if needed.
Proofread your prompt to avoid ambiguity and contradictions
Set a clear role using a system prompt
Adding examples helps guide the output format
Explicit text-length limits works better than asking for “consice output”. You can reinforce this by adding explicit limits to each part of the requested response. It wont be perfect but I’ve had good results with this approach.

In top of that I also used RubyLLM::Schema that helps to define an output schema structure, which is supported by many model providers.

class InterviewPrepSchema < RubyLLM::Schema
  string :content, description: "Interview preparation text"

  array :snippets do
    object do
      string :source_id, description: "ID of the source document"
      string :snippet, description: "Text snippet from the source document"
      string :source_type, enum: [ "job_opening", "direct_message" ]
    end
  end
end

chat = Chat.create!(model: "gpt-5")
chat.with_temperature(0.3)
chat.with_schema(InterviewPrepSchema)
chat.with_instructions(system_prompt)
chat.ask(user_prompt)
@job_opening.update!(interview_prep: chat)

Wrapping up

Its really easy and straightforward to implement a RAG strategy for content generation in rails, and the ecosystem is very active, growing and with good documentation.

Adding this feature in the UI was fun too, this next weekend I will experiment a little bit with streaming the responses so the interaction feels more “AI’ish” instead of just showing a spinner while waiting for the model to finish.

I also plan to write a shorter version of this implementation as notes, along with deeper dives into specific topics around implementing LLM features in our apps.