Skip to content

Confluence On-Premises Integration Guide

This document explains how to configure and use InspectRAG's integration with Confluence On-Premises installations.

Overview

The Confluence On-Premises integration allows InspectRAG to index and keep synchronized your Confluence spaces, pages, blog posts, comments, and attachments. This integration uses Confluence's webhook system to receive real-time notifications when content changes and processes them via asynchronous tasks.

Confluence On-Premises Integration Architecture

Prerequisites

Before configuring the Confluence On-Premises integration, ensure you have:

  • Administrative access to your Confluence On-Premises installation
  • A Personal Access Token (PAT) with appropriate permissions
  • Redis server for temporary data storage
  • Network connectivity between InspectRAG and your Confluence server

Configuration

Environment Variables

The following environment variables must be configured:

ONPREM_CONFLUENCE_HOST=https://confluence.your-company.com
ONPREM_CONFLUENCE_PAT=your_personal_access_token
RAG_REDIS_URL=redis://localhost:6379/0
ONPREM_WEBHOOK_SECRET=YOUR_WEBHOOK_SECRET

Required Permissions

The Confluence PAT needs the following permissions:

  • READ permissions on all spaces you want to index
  • Permission to view attachments
  • Permission to view comments

Setting Up Webhooks in Confluence

  1. Log in to your Confluence On-Premises instance as an administrator
  2. Navigate to General Configuration > Webhooks
  3. Click Create WebHook
  4. Configure the webhook:
  5. Name: InspectRAG Integration
  6. URL: https://your-inspectrag-instance.com/api/webhook/confluence-onprem
  7. Status: Enabled
  8. Events: Select the following events:

    • Page: created
    • Page: updated
    • Page: deleted
    • Page: trashed
    • Page: restored
    • Blog: created
    • Blog: updated
    • Blog: deleted
    • Blog: trashed
    • Blog: restored
    • Comment: created
    • Comment: updated
    • Comment: removed
    • Attachment: created
    • Attachment: updated
    • Attachment: removed
  9. Click Create to save the webhook configuration

Data Processing

Content Types

The integration processes the following Confluence content types:

Content Type Description Storage Format
Pages Wiki pages in spaces Full page content with metadata
Blog Posts Blog entries in spaces Full blog content with metadata
Comments Comments on pages and blogs Individual comment documents
Attachments Files attached to pages Processed based on file type

Event Types

The integration handles the following Confluence event types:

Event Type Description Processing Logic
Content Created New page, blog, or attachment Indexes the content immediately
Content Updated Edited page, blog, or attachment Updates the indexed content
Content Deleted/Trashed Removed content Removes from the index
Content Restored Restored from trash Re-indexes the content
Comment Added New comment Indexes the comment immediately
Comment Updated Edited comment Updates the indexed comment
Comment Removed Deleted comment Removes from the index

Metadata Capture

Metadata Key Description Example
file_id Unique document identifier combining content type and ID 12345-description
confluence_id (pages) Original Confluence content ID 67890
source_type Document type: description, comment, or attachment description
platform Fixed as confluence confluence
user_id Origin of the document (task/process) webhook
digest MD5 digest of the content or filename adcf1234abcd5678...
title (pages) Title of the page Home Page
page_updated (pages) Last updated timestamp 2023-05-01T12:00:00.000Z
page_author (pages) Display name of the author John Doe
page_key (pages/comments/attachments) Page ID for grouping content 12345
comment_id (comments) ID of the comment 98765
comment_author (comments) Display name of comment author Jane Smith
comment_updated (comments) Timestamp of comment update 2023-05-02T09:30:00.000Z
attachment_id (attachments) ID of the attachment 54321
attachment_filename (attachments) File name diagram.png
creator (attachments) Creator of attachment John Doe
attachment_created (attachments) Timestamp of attachment creation 2023-04-20T08:45:00.000Z

Delta Processing

The integration uses an efficient delta processing approach:

  1. Each received webhook triggers a Celery task
  2. For content updates, it compares the new state with the previous state stored in Redis
  3. Only changed components are re-indexed
  4. For large documents, content is processed in chunks to optimize embedding

Handling Special Content

Macros

Confluence pages often contain macros that render dynamic content. The integration handles common macros by:

  1. Extracting text content from macros where possible
  2. Ignoring purely visual macros
  3. Processing table macros to preserve tabular data

Rich Content

The integration preserves the semantic structure of:

  • Headings and sections
  • Lists (ordered and unordered)
  • Tables
  • Code blocks with syntax highlighting
  • Information panels and notes

Troubleshooting

Common Issues

Issue Possible Causes Solution
Webhook events not received Network connectivity, incorrect URL Verify network settings, check Confluence webhook logs
Authentication failures Invalid or expired PAT Generate a new PAT in Confluence
Missing content Permission issues Ensure PAT has access to all relevant spaces
Attachment processing failures Unsupported file types Check logs for file format errors
Content not found Recently deleted/moved content Verify content exists and permissions are correct

Logging

The integration logs detailed information about each step of the process. To increase logging verbosity:

  1. Adjust the logging level in your settings
  2. Monitor the Celery task logs for webhook processing information

Testing Connection

You can test the connection to your Confluence instance with:

curl -H "Authorization: Bearer YOUR_PAT" https://confluence.your-company.com/rest/api/user/current

This should return your user information if the connection and authentication are working correctly.

Performance Considerations

  • The integration uses asynchronous processing to handle high volumes of updates
  • Large pages are processed in chunks to optimize memory usage
  • Redis is used as temporary storage to track content state between updates
  • File attachments are downloaded to a temporary directory and removed after processing

Security Considerations

  • All credentials are stored as environment variables, not in code
  • PATs should be configured with the minimum required permissions
  • Network traffic between InspectRAG and Confluence should be encrypted
  • Temporary files are removed after processing
  • Access to indexed content respects original Confluence permissions

Limitations

  • Some complex macros may not be fully processed
  • Very large attachments may take longer to process
  • Rate limiting may apply based on your Confluence instance configuration
  • Content in archived spaces may require manual indexing