n8nflow.net logo

Automate Document Ingestion & RAG System with Google Drive, Sheets & OpenAI

by Mohamed AbdelwahabUpdated: Last update 23 days agoSource: n8n.io

Getting Started

1. Overview

The IngestionDocs workflow is a fully automated document
ingestion and knowledge management system
built with n8n. Its
purpose is to continuously ingest organizational documents
from Google Drive , transform them into vector embeddings using
OpenAI , store them in Pinecone , and make them searchable and
retrievable through an AI-powered Q &A interface.

This ensures that employees always have access to the most up-to-date
knowledge base without requiring manual intervention.


2. Key Objectives

  • Automated Ingestion → Seamlessly process new and updated
    documents from Google Drive.\
  • Change Detection → Track and differentiate between new, updated,
    and previously processed documents.\
  • Knowledge Base Construction → Convert documents into embeddings
    for semantic search.\
  • AI-Powered Assistance → Provide an intelligent Q&A system for
    employees to query manuals.\
  • Scalable & Maintainable → Modular design using n8n, LangChain,
    and Pinecone.

3. Workflow Breakdown

A. Document Monitoring and Retrieval

  • The workflow begins with two Google Drive triggers :
    • File Created Trigger → Fires when a new document is
      uploaded.\
    • File Updated Trigger → Fires when an existing document is
      modified.\
  • A search operation lists the files in the designated Google
    Drive folder.\
  • Non-downloadable items (e.g., subfolders) are filtered out.\
  • For valid files:
    • The file is downloaded.\
    • A SHA256 hash is generated to uniquely identify the file's
      content.

B. Record Management (Google Sheets Integration)

To keep track of ingestion states, the workflow uses a Google
Sheets--based Record Manager
:\

  • Each file entry contains:\
  • Id (Google Drive file ID)\
  • Name (file name)\
  • hashId (SHA256 checksum)\
  • The workflow compares the current file's hash with the stored one:\
  • New Document → File not found in records → Inserted into the
    Record Manager.\
  • Already Processed → File exists and hash matches → Skipped.\
  • Updated Document → File exists but hash differs → Record is
    updated.

This guarantees that only new or modified content is processed, avoiding
duplication.


C. Document Processing and Vectorization

Once a document is marked as new or updated:\

  1. Default Data Loader extracts its content (binary files
    supported).\
  • Pages are split into individual chunks.\
  • Metadata such as file ID and name are attached.\
  1. Recursive Character Text Splitter divides the content into
    manageable segments with overlap.\
  2. OpenAI Embeddings (text-embedding-3-large) transform each text
    chunk into a semantic vector.\
  3. Pinecone Vector Store stores these vectors in the configured
    index:\
  • For new documents , embeddings are inserted into a namespace based
    on the file name.\
  • For updated documents , the namespace is cleared first, then
    re-ingested with fresh embeddings.

This process builds a scalable and queryable knowledge base.


D. Knowledge Base Q &A Interface

The workflow also provides an interactive form-based user
interface
:\

  • Form Trigger → Collects employee questions.\
  • LangChain AI Agent :\
  • Receives the question.\
  • Retrieves relevant context from Pinecone using vector similarity
    search.\
  • Processes the response using OpenAI Chat Model (gpt-4.1-mini).\
  • Answer Formatting :\
  • Responses are returned in HTML format for readability.\
  • A custom CSS theme ensures a modern, user-friendly design.\
  • Answers may include references to page numbers when available.

This creates a self-service knowledge base assistant that employees
can query in natural language.


4. Technologies Used

  • n8n → Orchestration of the entire workflow.\
  • Google Drive API → File monitoring, listing, and downloading.\
  • Google Sheets API → Record manager for tracking file states.\
  • OpenAI API :
    • text-embedding-3-large for semantic vector creation.\
    • gpt-4.1-mini for conversational Q&A.\
  • Pinecone → Vector database for embedding storage and retrieval.\
  • LangChain → Document loaders, text splitters, vector store
    connectors, and agent logic.\
  • Crypto (SHA256) → File hash generation for change detection.\
  • Form Trigger + Form Node → Employee-facing Q&A submission and
    answer display.\
  • Custom CSS → Provides a modern, responsive, styled UI for the
    knowledge base.

5. End-to-End Data Flow

  1. Employee uploads or updates a document → Google Drive detects
    the change.\
  2. Workflow downloads and hashes the file → Ensures uniqueness and
    detects modifications.\
  3. Record Manager (Google Sheets) → Decides whether to skip,
    insert, or update the record.\
  4. Document Processing → Splitting + Embedding + Storing into
    Pinecone.\
  5. Knowledge Base Updated → The latest version of documents is
    indexed.\
  6. Employee asks a question via the web form.\
  7. AI Agent retrieves embeddings from Pinecone + uses GPT-4.1-mini
    → Generates a contextual answer.\
  8. Answer displayed in styled HTML → Delivered back to the employee
    through the form interface.

6. Benefits

  • Always Up-to-Date → Automatically syncs documents when uploaded
    or changed.\
  • No Duplicates → Smart hashing ensures only relevant updates are
    reprocessed.\
  • Searchable Knowledge Base → Employees can query documents
    semantically, not just by keywords.\
  • Enhanced Productivity → Answers are immediate, reducing time
    spent browsing manuals.\
  • Scalable → New documents and users can be added without workflow
    redesign.

✅ In summary, IngestionDocs is a robust AI-driven
document ingestion and retrieval system
that integrates Google
Drive, Google Sheets, OpenAI, and Pinecone
within n8n. It
continuously builds and maintains a knowledge base of manuals while
offering employees an intelligent, user-friendly Q &A assistant for
fast and accurate knowledge retrieval.