Docs Update: Large Scale LNMS Deployment Example (#16226)

* Update Distributed-Poller.md * Update Distributed-Poller.md * Rename librenms-distributed-diagram.png to librenms-distributed-diagram-openstack.png * Update Distributed-Poller.md * Add files via upload * Update Distributed-Poller.md * Update Distributed-Poller.md * Update Distributed-Poller.md
2024-10-07 16:52:45 +00:00 · 2024-07-21 05:30:17 -07:00
parent aad110102f
commit 1f01d7e627
3 changed files with 82 additions and 2 deletions
--- a/doc/Extensions/Distributed-Poller.md
+++ b/doc/Extensions/Distributed-Poller.md
@ -89,15 +89,95 @@ CACHE_DRIVER=memcached
 If you want to use memcached, you will also need to install an additional
 Python 3 python-memcached package.

-## Example Setup
+## Example Setups

+### OpenStack
 Below is an example setup based on a real deployment which at the time
 of writing covers over 2,500 devices and 50,000 ports. The setup is
 running within an OpenStack environment with some commodity hardware
 for remote pollers. Here's a diagram of how you can scale LibreNMS
 out:

-![Example Setup](@= config.site_url =@/img/librenms-distributed-diagram.png)
+![OpenStack Example Setup](@= config.site_url =@/img/librenms-distributed-diagram-openstack.png)
+
+### ESXi
+This is a distributed setup that I created for a regional hybrid ISP 
+(fixed wireless/fiber optic backhaul). It was created at around the 
+~4,000 device mark to transition from multiple separate instances to one more central.
+When I left the company, it was monitoring:
+* 10,800 devices
+* 307,700 ports
+* 37,000 processors
+* 17,000 wireless sensors
+* ~480,000 other objects/sensors.
+
+As our goal was more to catch alerts and monitor overall trends we went with a 10 minute polling cycle.
+Polling the above would take roughly 8 minutes and 120GHz worth of CPU across all VMs. 
+CPUs were older Xeons (E5). The diagram below shows the CPU and RAM utilization of each VM during polling.
+Disk space utilization for SQL/RRD is also included.
+
+Device discovery was split off into its own VM as that process would take multiple hours.
+
+
+
+![ESXi Example Setup](@= config.site_url =@/img/librenms-distributed-diagram-esxi.png)
+
+Workers were assigned in the following way:
+
+* Web/RRD Server:
+  * alerting: 1
+  * billing: 2
+  * discovery: 0
+  * ping: 1
+  * poller: 10
+  * services: 16
+* Discovery Server:
+  * alerting: 1
+  * billing: 2
+  * discovery: 60
+  * ping: 1
+  * poller: 5
+  * services: 8
+* Pollers
+  * alerting: 1
+  * billing: 2
+  * discovery: 0
+  * ping: 1
+  * poller: 40
+  * services: 8
+
+Each poller had on average 19,500/24,000 worker seconds consumed.
+
+RRDCached is incredibly important; this setup ran on spinning disks
+due to the wonders of caching.
+
+I very strongly recommend setting up recursive DNS on your discovery 
+and polling servers. While I used DNSMASQ there are many options.
+
+SQL tuner will help you quite a bit. You'll also want to increase your 
+maximum connections amount to support the pollers. This setup was at 500.
+Less important, but putting ~12GB of the database in RAM was reported to 
+have helped web UI performance as well as some DB-heavy Tableau reports.
+RAM was precious in this environment or it would've been more, but it
+wasn't necessary either.
+
+Be careful with keeping the default value for 'Device Down Retry' as it
+can eat up quite a lot of poller activity. I freed up over 20,000 worker seconds
+when setting this to only happen once or twice per 10-minute polling cycle. 
+The impact of this will vary depending on the percentage of down device in your system.
+This example had it set at 400 seconds.
+
+Also be wary of keeping event log and syslog entries for too long as it can
+have a pretty negative effect on web UI performance.
+
+To resolve an issue with large device groups the php fpm max input vars was increased to 20000.
+
+All of these VMs were within the same physical data center so latency was minimal.
+
+The decision of redis over the other locking methods was arbitrary but in over 2 years 
+I never had to touch that VM aside from security updates.
+
+This install used the service instead of cron.

 ## Architecture

--- a/doc/img/librenms-distributed-diagram-esxi.png
+++ b/doc/img/librenms-distributed-diagram-esxi.png
--- a/doc/img/librenms-distributed-diagram-openstack.png
+++ b/doc/img/librenms-distributed-diagram-openstack.png