n8nflow.net logo

Analyze Images, Videos, Documents & Audio with Gemini Tools and Qwen LLM Agent

by Mauricio PereraUpdated: Last update 2 months agoSource: n8n.io
Loading workflow viewer...

Getting Started

📁 Analyze uploaded images, videos, audio, and documents with specialized tools — powered by a lightweight language-only agent.


🧭 What It Does

This workflow enables multimodal file analysis using Google Gemini tools connected to a text-only LLM agent. Users can upload images, videos, audio files, or documents via a chat interface. The workflow will:

  • Upload each file to Google Gemini and obtain an accessible URL.
  • Dynamically generate contextual prompts based on the file(s) and user message.
  • Allow the agent to invoke Gemini tools for specific media types as needed.
  • Return a concise, helpful response based on the analysis.

🚀 Use Cases

  • Customer support : Let users upload screenshots, documents, or recordings and get helpful insights or summaries.
  • Multimedia QA : Review visual, audio, or video content for correctness or compliance.
  • Educational agents : Interpret content from PDFs, diagrams, or audio recordings on the fly.
  • Low-cost multimodal assistants : Achieve multimodal functionality without relying on large vision-language models.

🎯 Why This Architecture Matters

Unlike end-to-end multimodal LLMs (like Gemini 1.5 or GPT-4o), this template:

  • Uses a text-only LLM (Qwen 32B via Groq) for reasoning.
  • Delegates media analysis to specialized Gemini tools.

✅ Advantages

FeatureBenefit
🧩 ModularLLM + Tools are decoupled; can update them independently
💸 Cost-EfficientNo need to pay for full multimodal models; only use tools when needed
🔧 Tool-based ReasoningAgent invokes tools on demand, just like OpenAI’s Toolformer setup
⚡ FastGroq LLMs offer ultra-fast responses with low latency
📚 MemoryIncludes context buffer for multi-turn chats (15 messages)

🧪 How It Works

🔹 Input via Chat

  • Users submit a message and (optionally) files via the chatTrigger.

🔹 File Handling

  • If no files: prompt is passed directly to the agent.

  • If files are included:

    • Files are split, uploaded to Gemini (to get public URLs).
    • Metadata (name, type, URL) is collected and embedded into the prompt.

🔹 Prompt Construction

  • A new chatInput is dynamically generated:

    User message
    

    Media: [array of file data]

🔹 Agent Reasoning

  • The Langchain Agent receives:

    • The enriched prompt

    • File URLs

    • Memory context (15 turns)

    • Access to 4 Gemini tools:

      • IMG: analyze image
      • VIDEO: analyze video
      • AUDIO: analyze audio
      • DOCUMENT: analyze document

The agent autonomously decides whether and how to use tools, then responds with concise output.


🧱 Nodes & Services

CategoryNode / ToolPurpose
Chat InputchatTriggerUser interface with file support
File ProcessingsplitOut, splitInBatchesProcess each uploaded file
UploadgoogleGeminiUploads each file to Gemini, gets URL
Metadataset, aggregateBuilds structured file info
AI AgentLangchain AgentReceives context + file data
ToolsgoogleGeminiToolAnalyze media with Gemini
LLMlmChatGroq (Qwen 32B)Text reasoning, high-speed
MemorymemoryBufferWindowMaintains session context

⚙️ Setup Instructions

1. 🔑 Required Credentials

  • Groq API key (for Qwen 32B model)
  • Google Gemini API key (Palm / Gemini 1.5 tools)

2. 🧩 Nodes That Need Setup

  • Replace existing credentials on:

    • Upload a file
    • Each GeminiTool (IMG, VIDEO, AUDIO, DOCUMENT)
    • lmChatGroq

3. ⚠️ File Size & Format Considerations

  • Some Gemini tools have file size or format restrictions.
  • You may add validation nodes before uploading if needed.

🛠️ Optional Improvements

  • Add logging and error handling (e.g., for upload failures).
  • Add MIME-type filtering to choose the right tool explicitly.
  • Extend to include OCR or transcription services pre-analysis.
  • Integrate with Slack, Telegram, or WhatsApp for chat delivery.

🧪 Example Use Case

"Hola, ¿qué dice este PDF?"

Uploads a document → Agent routes it to Gemini DOCUMENT tool → Receives extracted content → LLM summarizes it in Spanish.


🧰 Tags

multimodal, agent, langchain, groq, gemini, image analysis, audio analysis, document parsing, video analysis, file uploader, chat assistant, LLM tools, memory, AI tools

📂 Files

  • This template is ready to use as-is in n8n.
  • No external webhooks or integrations required.