Confluence On-Premises Integration Guide
This document explains how to configure and use InspectRAG's integration with Confluence On-Premises installations.
Overview
The Confluence On-Premises integration allows InspectRAG to index and keep synchronized your Confluence spaces, pages, blog posts, comments, and attachments. This integration uses Confluence's webhook system to receive real-time notifications when content changes and processes them via asynchronous tasks.
Prerequisites
Before configuring the Confluence On-Premises integration, ensure you have:
- Administrative access to your Confluence On-Premises installation
- A Personal Access Token (PAT) with appropriate permissions
- Redis server for temporary data storage
- Network connectivity between InspectRAG and your Confluence server
Configuration
Environment Variables
The following environment variables must be configured:
ONPREM_CONFLUENCE_HOST=https://confluence.your-company.com
ONPREM_CONFLUENCE_PAT=your_personal_access_token
RAG_REDIS_URL=redis://localhost:6379/0
ONPREM_WEBHOOK_SECRET=YOUR_WEBHOOK_SECRET
Required Permissions
The Confluence PAT needs the following permissions:
READ
permissions on all spaces you want to index- Permission to view attachments
- Permission to view comments
Setting Up Webhooks in Confluence
- Log in to your Confluence On-Premises instance as an administrator
- Navigate to General Configuration > Webhooks
- Click Create WebHook
- Configure the webhook:
- Name: InspectRAG Integration
- URL:
https://your-inspectrag-instance.com/api/webhook/confluence-onprem
- Status: Enabled
-
Events: Select the following events:
- Page: created
- Page: updated
- Page: deleted
- Page: trashed
- Page: restored
- Blog: created
- Blog: updated
- Blog: deleted
- Blog: trashed
- Blog: restored
- Comment: created
- Comment: updated
- Comment: removed
- Attachment: created
- Attachment: updated
- Attachment: removed
-
Click Create to save the webhook configuration
Data Processing
Content Types
The integration processes the following Confluence content types:
Content Type | Description | Storage Format |
---|---|---|
Pages | Wiki pages in spaces | Full page content with metadata |
Blog Posts | Blog entries in spaces | Full blog content with metadata |
Comments | Comments on pages and blogs | Individual comment documents |
Attachments | Files attached to pages | Processed based on file type |
Event Types
The integration handles the following Confluence event types:
Event Type | Description | Processing Logic |
---|---|---|
Content Created | New page, blog, or attachment | Indexes the content immediately |
Content Updated | Edited page, blog, or attachment | Updates the indexed content |
Content Deleted/Trashed | Removed content | Removes from the index |
Content Restored | Restored from trash | Re-indexes the content |
Comment Added | New comment | Indexes the comment immediately |
Comment Updated | Edited comment | Updates the indexed comment |
Comment Removed | Deleted comment | Removes from the index |
Metadata Capture
Metadata Key | Description | Example |
---|---|---|
file_id |
Unique document identifier combining content type and ID | 12345-description |
confluence_id |
(pages) Original Confluence content ID | 67890 |
source_type |
Document type: description, comment, or attachment | description |
platform |
Fixed as confluence |
confluence |
user_id |
Origin of the document (task/process) | webhook |
digest |
MD5 digest of the content or filename | adcf1234abcd5678... |
title |
(pages) Title of the page | Home Page |
page_updated |
(pages) Last updated timestamp | 2023-05-01T12:00:00.000Z |
page_author |
(pages) Display name of the author | John Doe |
page_key |
(pages/comments/attachments) Page ID for grouping content | 12345 |
comment_id |
(comments) ID of the comment | 98765 |
comment_author |
(comments) Display name of comment author | Jane Smith |
comment_updated |
(comments) Timestamp of comment update | 2023-05-02T09:30:00.000Z |
attachment_id |
(attachments) ID of the attachment | 54321 |
attachment_filename |
(attachments) File name | diagram.png |
creator |
(attachments) Creator of attachment | John Doe |
attachment_created |
(attachments) Timestamp of attachment creation | 2023-04-20T08:45:00.000Z |
Delta Processing
The integration uses an efficient delta processing approach:
- Each received webhook triggers a Celery task
- For content updates, it compares the new state with the previous state stored in Redis
- Only changed components are re-indexed
- For large documents, content is processed in chunks to optimize embedding
Handling Special Content
Macros
Confluence pages often contain macros that render dynamic content. The integration handles common macros by:
- Extracting text content from macros where possible
- Ignoring purely visual macros
- Processing table macros to preserve tabular data
Rich Content
The integration preserves the semantic structure of:
- Headings and sections
- Lists (ordered and unordered)
- Tables
- Code blocks with syntax highlighting
- Information panels and notes
Troubleshooting
Common Issues
Issue | Possible Causes | Solution |
---|---|---|
Webhook events not received | Network connectivity, incorrect URL | Verify network settings, check Confluence webhook logs |
Authentication failures | Invalid or expired PAT | Generate a new PAT in Confluence |
Missing content | Permission issues | Ensure PAT has access to all relevant spaces |
Attachment processing failures | Unsupported file types | Check logs for file format errors |
Content not found | Recently deleted/moved content | Verify content exists and permissions are correct |
Logging
The integration logs detailed information about each step of the process. To increase logging verbosity:
- Adjust the logging level in your settings
- Monitor the Celery task logs for webhook processing information
Testing Connection
You can test the connection to your Confluence instance with:
This should return your user information if the connection and authentication are working correctly.
Performance Considerations
- The integration uses asynchronous processing to handle high volumes of updates
- Large pages are processed in chunks to optimize memory usage
- Redis is used as temporary storage to track content state between updates
- File attachments are downloaded to a temporary directory and removed after processing
Security Considerations
- All credentials are stored as environment variables, not in code
- PATs should be configured with the minimum required permissions
- Network traffic between InspectRAG and Confluence should be encrypted
- Temporary files are removed after processing
- Access to indexed content respects original Confluence permissions
Limitations
- Some complex macros may not be fully processed
- Very large attachments may take longer to process
- Rate limiting may apply based on your Confluence instance configuration
- Content in archived spaces may require manual indexing