Archiving At Scale

08 May 2024 - sj, tags: archiving, insights, news, product

Large organizations with several thousands of employees are challenged to archive several hundreds of TBs of data or even more. In this post we’ll setup a distributed environment where the load is spread among several nodes.

Assumptions

Your company’s domain name is example.com, and your SMTP servers or your provider’s mail servers send a copy of each received email to archive@archive-gw.example.com using SMTP journaling.

You have 5 worker nodes to store the emails.

Preparation

You have an smtp gateway for the archive (archive-gw.example.com) that forwards the received emails to be archived to archive@archive.example.com.

Create the following MX records for archive.example.com. You may tweak the TTL values as well as the MX preference numbers.

archive IN MX 10 worker0.example.com. 3600
        IN MX 10 worker1.example.com. 3600
        IN MX 10 worker2.example.com. 3600
        IN MX 10 worker3.example.com. 3600
        IN MX 10 worker4.example.com. 3600

Setup the archive gateway

In this example we’ll use postfix with the below configuration files

/etc/postfix/main.cf:

smtpd_banner = $myhostname ESMTP
biff = no
compatibility_level = 3.6

smtpd_tls_cert_file=/etc/ssl/certs/ssl-cert-snakeoil.pem
smtpd_tls_key_file=/etc/ssl/private/ssl-cert-snakeoil.key
smtpd_tls_security_level=may
smtp_tls_CApath=/etc/ssl/certs
smtp_tls_security_level=may
smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache

myhostname = archive-gw.example.com
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mynetworks = 127.0.0.0/8
inet_protocols = ipv4

smtpd_recipient_restrictions = check_recipient_access hash:/etc/postfix/domains, reject
virtual_mailbox_domains = archive.example.com
virtual_alias_maps = hash:/etc/postfix/virtual
virtual_mailbox_base = /var/mail
message_size_limit = 50000000

/etc/postfix/virtual:

archive@archive-gw.example.com archive@archive.example.com

/etc/postfix/domains:

archive-gw.example.com OK

Run postmap to create the db files:

postmap /etc/postfix/virtual /etc/postfix/domains

Setup the worker nodes

The worker nodes feature the same configuration, only the license file is slightly different. The licensed hostname is archive.example.com for all worker node, however each node has a dedicated server_id parameter, eg. server_id=0 for worker0.example.com, server_id=1 for worker1.example.com, etc.

Eg.

customer=EXAMPLE,expiry=1715143432,server_id=0,max_release=20240531,multitenancy=0,hostname=archive.example.com,ip=0.0.0.0
customer=EXAMPLE,expiry=1715143432,server_id=1,max_release=20240531,multitenancy=0,hostname=archive.example.com,ip=0.0.0.0
customer=EXAMPLE,expiry=1715143432,server_id=2,max_release=20240531,multitenancy=0,hostname=archive.example.com,ip=0.0.0.0
...

Conclusion

You have a high performant and fault tolerant email archiving solution. In case of a worker node is unavailable the archiving gateway can send the emails to the rest of the nodes.

Next steps

You may want to add a second archiving gateway to eliminate its single point of failure.

Then even though the whole setup is fault tolerant (ie. it can keep archiving new emails when a worker node becomes unavailable), the invidual worker nodes are not. You need to backup the worker nodes allowing you to restore them in case of an issue.

You may even consider setting up a DR site, ie. an independent datacenter where you replicate the whole setup. In a nutshell your smtp servers or your provider’s smtp servers send the journaled emails to archive@dr-archive-gw.example.com as well which distributes the emails among dr-worker{0-4}.example.com nodes in the other datacenter.

Contact

Contact Us