Skip to content

InspectRAG S3 Bucket Integration Guide

Overview

This guide explains how to configure and use InspectRAG's integration with Amazon S3 buckets. This integration allows InspectRAG to index and keep synchronized your S3 objects, automatically processing files when they are created, updated, or deleted.

Prerequisites

Before configuring the S3 bucket integration, ensure you have:

  • An AWS account with appropriate permissions
  • Access to create or modify S3 buckets
  • AWS credentials (Access Key ID and Secret Access Key)
  • InspectRAG application installed and configured

Step 1: Set Up AWS Credentials

  1. Create an IAM user with programmatic access in your AWS account
  2. Attach policies that allow S3 bucket access (e.g., AmazonS3ReadOnlyAccess for read-only integration, or more specific permissions as needed)
  3. Generate and securely store the Access Key ID and Secret Access Key

Step 2: Configure Environment Variables

In your InspectRAG installation directory, locate the environment configuration file (.env) and add the following variables:

# AWS Credentials
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=us-east-1

# S3 Configuration
S3_BUCKET_NAME=your-inspectrag-bucket
S3_DEFAULT_PERMISSIONS=public,authenticated
S3_DEFAULT_CREATOR=s3_webhook

Environment Variable Details

Variable Description Example
AWS_ACCESS_KEY_ID Your AWS access key AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY Your AWS secret key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
AWS_REGION AWS region where your bucket is located us-east-1
S3_BUCKET_NAME Name of the S3 bucket to integrate company-documents
S3_DEFAULT_PERMISSIONS Comma-separated list of default permissions public,authenticated
S3_DEFAULT_CREATOR Default creator identifier for S3 objects s3_webhook

Step 3: Set Up S3 Event Notifications

For real-time updates, you need to configure S3 event notifications to trigger the InspectRAG webhook:

  1. Go to the AWS Management Console and navigate to the S3 service
  2. Select your bucket and go to the "Properties" tab
  3. Scroll down to "Event notifications" and click "Create event notification"
  4. Configure the notification:
  5. Name: InspectRAGIntegration
  6. Event types: Select "All object create events" and "All object delete events"
  7. Destination: Select "Lambda Function" or "SQS Queue" based on your setup
  8. If using Lambda, create a Lambda function that forwards events to your InspectRAG webhook endpoint
  9. If using SQS, configure your InspectRAG application to poll the queue

For robust S3 event handling, we recommend using AWS Lambda to forward S3 events to your InspectRAG webhook.

3a.1: Create the Lambda Function

  1. Navigate to AWS Lambda Console:
  2. Go to AWS Management Console → Lambda
  3. Click "Create function"

  4. Function Configuration:

  5. Choose "Author from scratch"
  6. Function name: InspectRAG-S3-Webhook-Forwarder
  7. Runtime: Python 3.9 or later
  8. Architecture: x86_64
  9. Click "Create function"

3a.2: Configure the Lambda Function Code

  1. Download the Lambda Function Code:
  2. The lambda_function.py file is included with your InspectRAG release package
  3. Open the file in any text editor

  4. Update the Webhook URL:

  5. In the lambda_function.py file, locate line 14:
    webhook_url = 'https://your-rag-api-url/s3-webhook'
    
  6. Replace https://your-rag-api-url with your actual InspectRAG API URL
  7. Examples:

    # If your InspectRAG is hosted at inspectrag.company.com
    webhook_url = 'https://inspectrag.company.com/s3-webhook'
    
    # If using a specific port
    webhook_url = 'https://inspectrag.company.com:8000/s3-webhook'
    
    # If using an IP address
    webhook_url = 'https://203.0.113.10:8000/s3-webhook'
    

  8. Deploy the Function Code:

  9. Copy the entire contents of the modified lambda_function.py file
  10. In the AWS Lambda console, paste the code into the function editor
  11. Click "Deploy" to save your changes

3a.3: Connect S3 Bucket to Lambda

  1. Return to S3 Bucket Configuration:
  2. Go to your S3 bucket → Properties → Event notifications
  3. Click "Create event notification"

  4. Event Notification Details:

  5. Name: InspectRAG-Event-Notification
  6. Event types: Select:
    • ✅ All object create events (s3:ObjectCreated:*)
    • ✅ All object delete events (s3:ObjectRemoved:*)
  7. Destination: Lambda function
  8. Lambda function: Select InspectRAG-S3-Webhook-Forwarder

  9. Save Configuration:

  10. Click "Save changes"
  11. AWS will automatically grant the necessary permissions

3a.4: Test the Integration

  1. Upload a test file to your S3 bucket
  2. Check Lambda logs:
  3. Go to Lambda → Functions → InspectRAG-S3-Webhook-Forwarder
  4. Click "Monitor" → "View logs in CloudWatch"
  5. Look for successful webhook forwarding messages
  6. Verify in InspectRAG:
  7. Check that the file appears in your InspectRAG index

3a.5: Troubleshooting Lambda Setup

Issue Cause Solution
Function not triggered S3 event notification not configured Verify event notification is properly set up
HTTP errors in logs Incorrect webhook URL Double-check the webhook URL in lambda_function.py
Timeout errors InspectRAG not responding Verify InspectRAG is accessible and running
Permission denied Lambda execution role missing permissions Check IAM role and policies

Important Notes

  • URL Requirements: Your InspectRAG API must be accessible from the internet
  • HTTPS Recommended: Use HTTPS for secure communication
  • File Location: The lambda_function.py file is included in your InspectRAG release package
  • Only Edit the URL: Only modify the webhook_url variable - don't change other parts of the code

Step 4: Configure the Webhook Endpoint

If you're using a direct webhook connection:

  1. Set up a publicly accessible endpoint in your InspectRAG application (e.g., /s3-webhook)
  2. Ensure this endpoint is secure and validates incoming requests
  3. Configure AWS to send notifications to this endpoint

Step 5: Test the Integration

  1. Upload a file to your S3 bucket
  2. Check the InspectRAG logs to ensure the file is processed
  3. Verify the file is indexed by searching for its content in the InspectRAG interface

Step 6: Initial Synchronization

To index existing files in your S3 bucket, you can trigger a full synchronization from the InspectRAG admin interface. This will process all existing objects in the bucket according to your configuration settings.

Step 7: Scheduled Consistency Checks

InspectRAG can perform periodic consistency checks between your S3 bucket and its index. This helps identify and fix any objects that may have been missed during normal operations. Configure these checks through the administrative interface.

Data Processing

Object Types

The integration processes various file types from your S3 bucket:

File Type Processing
PDFs Text extraction with structure preservation
Documents (Word, Excel, etc.) Content extraction
Text files Direct processing
Images OCR processing if configured

Metadata Capture

Metadata Key Description Example
file_id Unique ID for the object, format: s3-{bucket_name}-{object_key} s3-company-documents-folder/file.txt
platform Source platform identifier s3
bucket_name Name of the S3 bucket company-documents
object_key Object key (path) in the bucket folder/file.txt
permissions List of default permissions applied to the object ["public", "authenticated"]
creator Identifier of the creator (default from S3_DEFAULT_CREATOR) s3_webhook

Troubleshooting

Common Issues

Issue Possible Causes Solution
Files not being indexed Incorrect AWS credentials Verify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Authentication errors Insufficient permissions Check IAM policies for the AWS user
Missing files in index Event notifications not configured Set up bucket event notifications correctly
Processing errors Unsupported file types Check logs for specific file processing errors

Checking Indexed Files

You can verify if specific S3 objects are indexed using the InspectRAG admin interface. This provides a way to check the status of individual files and reprocess them if necessary.

Manually Re-indexing Files

If you need to re-index specific files, you can do so through the InspectRAG administrative interface. Simply locate the file in question and use the "reprocess" option.

Performance Considerations

  • Files are processed in batches to optimize performance
  • Large files may take longer to process
  • Consider using a prefix strategy for large buckets
  • Monitoring Redis and application logs helps identify bottlenecks

Security Considerations

  • Store AWS credentials securely as environment variables
  • Use IAM roles with the minimum required permissions
  • Set up appropriate bucket policies
  • Consider encrypting sensitive files in S3 (using SSE)
  • Implement appropriate access controls in InspectRAG

This guide provides comprehensive instructions for setting up and configuring InspectRAG's S3 bucket integration. Follow these steps carefully to ensure proper configuration and operation.