Confluence On-Premises Integration Guide

This document explains how to configure and use InspectRAG's integration with Confluence On-Premises installations.

Overview

The Confluence On-Premises integration allows InspectRAG to index and keep synchronized your Confluence spaces, pages, blog posts, comments, and attachments. This integration uses Confluence's webhook system to receive real-time notifications when content changes and processes them via asynchronous tasks.

Confluence On-Premises Integration Architecture

Prerequisites

Before configuring the Confluence On-Premises integration, ensure you have:

Administrative access to your Confluence On-Premises installation
A Personal Access Token (PAT) with appropriate permissions
Redis server for temporary data storage
Network connectivity between InspectRAG and your Confluence server

Configuration

Environment Variables

The following environment variables must be configured:

ONPREM_CONFLUENCE_HOST=https://confluence.your-company.com
ONPREM_CONFLUENCE_PAT=your_personal_access_token
RAG_REDIS_URL=redis://localhost:6379/0
ONPREM_WEBHOOK_SECRET=YOUR_WEBHOOK_SECRET

Required Permissions

The Confluence PAT needs the following permissions:

READ permissions on all spaces you want to index
Permission to view attachments
Permission to view comments

Setting Up Webhooks in Confluence

Log in to your Confluence On-Premises instance as an administrator
Navigate to General Configuration > Webhooks
Click Create WebHook
Configure the webhook:
Name: InspectRAG Integration
URL: https://your-inspectrag-instance.com/api/webhook/confluence-onprem
Status: Enabled
Events: Select the following events:
- Page: created
- Page: updated
- Page: deleted
- Page: trashed
- Page: restored
- Blog: created
- Blog: updated
- Blog: deleted
- Blog: trashed
- Blog: restored
- Comment: created
- Comment: updated
- Comment: removed
- Attachment: created
- Attachment: updated
- Attachment: removed
Click Create to save the webhook configuration

Data Processing

Content Types

The integration processes the following Confluence content types:

Content Type	Description	Storage Format
Pages	Wiki pages in spaces	Full page content with metadata
Blog Posts	Blog entries in spaces	Full blog content with metadata
Comments	Comments on pages and blogs	Individual comment documents
Attachments	Files attached to pages	Processed based on file type

Event Types

The integration handles the following Confluence event types:

Event Type	Description	Processing Logic
Content Created	New page, blog, or attachment	Indexes the content immediately
Content Updated	Edited page, blog, or attachment	Updates the indexed content
Content Deleted/Trashed	Removed content	Removes from the index
Content Restored	Restored from trash	Re-indexes the content
Comment Added	New comment	Indexes the comment immediately
Comment Updated	Edited comment	Updates the indexed comment
Comment Removed	Deleted comment	Removes from the index

Metadata Capture

Metadata Key	Description	Example
`file_id`	Unique document identifier combining content type and ID	`12345-description`
`confluence_id`	(pages) Original Confluence content ID	`67890`
`source_type`	Document type: description, comment, or attachment	`description`
`platform`	Fixed as `confluence`	`confluence`
`user_id`	Origin of the document (task/process)	`webhook`
`digest`	MD5 digest of the content or filename	`adcf1234abcd5678...`
`title`	(pages) Title of the page	`Home Page`
`page_updated`	(pages) Last updated timestamp	`2023-05-01T12:00:00.000Z`
`page_author`	(pages) Display name of the author	`John Doe`
`page_key`	(pages/comments/attachments) Page ID for grouping content	`12345`
`comment_id`	(comments) ID of the comment	`98765`
`comment_author`	(comments) Display name of comment author	`Jane Smith`
`comment_updated`	(comments) Timestamp of comment update	`2023-05-02T09:30:00.000Z`
`attachment_id`	(attachments) ID of the attachment	`54321`
`attachment_filename`	(attachments) File name	`diagram.png`
`creator`	(attachments) Creator of attachment	`John Doe`
`attachment_created`	(attachments) Timestamp of attachment creation	`2023-04-20T08:45:00.000Z`

Delta Processing

The integration uses an efficient delta processing approach:

Each received webhook triggers a Celery task
For content updates, it compares the new state with the previous state stored in Redis
Only changed components are re-indexed
For large documents, content is processed in chunks to optimize embedding

Handling Special Content

Macros

Confluence pages often contain macros that render dynamic content. The integration handles common macros by:

Extracting text content from macros where possible
Ignoring purely visual macros
Processing table macros to preserve tabular data

Rich Content

The integration preserves the semantic structure of:

Headings and sections
Lists (ordered and unordered)
Tables
Code blocks with syntax highlighting
Information panels and notes

Troubleshooting

Common Issues

Issue	Possible Causes	Solution
Webhook events not received	Network connectivity, incorrect URL	Verify network settings, check Confluence webhook logs
Authentication failures	Invalid or expired PAT	Generate a new PAT in Confluence
Missing content	Permission issues	Ensure PAT has access to all relevant spaces
Attachment processing failures	Unsupported file types	Check logs for file format errors
Content not found	Recently deleted/moved content	Verify content exists and permissions are correct

Logging

The integration logs detailed information about each step of the process. To increase logging verbosity:

Adjust the logging level in your settings
Monitor the Celery task logs for webhook processing information

Testing Connection

You can test the connection to your Confluence instance with:

curl -H "Authorization: Bearer YOUR_PAT" https://confluence.your-company.com/rest/api/user/current

This should return your user information if the connection and authentication are working correctly.

Performance Considerations

The integration uses asynchronous processing to handle high volumes of updates
Large pages are processed in chunks to optimize memory usage
Redis is used as temporary storage to track content state between updates
File attachments are downloaded to a temporary directory and removed after processing

Security Considerations

All credentials are stored as environment variables, not in code
PATs should be configured with the minimum required permissions
Network traffic between InspectRAG and Confluence should be encrypted
Temporary files are removed after processing
Access to indexed content respects original Confluence permissions

Limitations

Some complex macros may not be fully processed
Very large attachments may take longer to process
Rate limiting may apply based on your Confluence instance configuration
Content in archived spaces may require manual indexing