mirror of
https://github.com/librenms/librenms.git
synced 2024-10-07 16:52:45 +00:00
Docs Update: Large Scale LNMS Deployment Example (#16226)
* Update Distributed-Poller.md * Update Distributed-Poller.md * Rename librenms-distributed-diagram.png to librenms-distributed-diagram-openstack.png * Update Distributed-Poller.md * Add files via upload * Update Distributed-Poller.md * Update Distributed-Poller.md * Update Distributed-Poller.md
This commit is contained in:
@ -89,15 +89,95 @@ CACHE_DRIVER=memcached
|
||||
If you want to use memcached, you will also need to install an additional
|
||||
Python 3 python-memcached package.
|
||||
|
||||
## Example Setup
|
||||
## Example Setups
|
||||
|
||||
### OpenStack
|
||||
Below is an example setup based on a real deployment which at the time
|
||||
of writing covers over 2,500 devices and 50,000 ports. The setup is
|
||||
running within an OpenStack environment with some commodity hardware
|
||||
for remote pollers. Here's a diagram of how you can scale LibreNMS
|
||||
out:
|
||||
|
||||

|
||||

|
||||
|
||||
### ESXi
|
||||
This is a distributed setup that I created for a regional hybrid ISP
|
||||
(fixed wireless/fiber optic backhaul). It was created at around the
|
||||
~4,000 device mark to transition from multiple separate instances to one more central.
|
||||
When I left the company, it was monitoring:
|
||||
* 10,800 devices
|
||||
* 307,700 ports
|
||||
* 37,000 processors
|
||||
* 17,000 wireless sensors
|
||||
* ~480,000 other objects/sensors.
|
||||
|
||||
As our goal was more to catch alerts and monitor overall trends we went with a 10 minute polling cycle.
|
||||
Polling the above would take roughly 8 minutes and 120GHz worth of CPU across all VMs.
|
||||
CPUs were older Xeons (E5). The diagram below shows the CPU and RAM utilization of each VM during polling.
|
||||
Disk space utilization for SQL/RRD is also included.
|
||||
|
||||
Device discovery was split off into its own VM as that process would take multiple hours.
|
||||
|
||||
|
||||
|
||||

|
||||
|
||||
Workers were assigned in the following way:
|
||||
|
||||
* Web/RRD Server:
|
||||
* alerting: 1
|
||||
* billing: 2
|
||||
* discovery: 0
|
||||
* ping: 1
|
||||
* poller: 10
|
||||
* services: 16
|
||||
* Discovery Server:
|
||||
* alerting: 1
|
||||
* billing: 2
|
||||
* discovery: 60
|
||||
* ping: 1
|
||||
* poller: 5
|
||||
* services: 8
|
||||
* Pollers
|
||||
* alerting: 1
|
||||
* billing: 2
|
||||
* discovery: 0
|
||||
* ping: 1
|
||||
* poller: 40
|
||||
* services: 8
|
||||
|
||||
Each poller had on average 19,500/24,000 worker seconds consumed.
|
||||
|
||||
RRDCached is incredibly important; this setup ran on spinning disks
|
||||
due to the wonders of caching.
|
||||
|
||||
I very strongly recommend setting up recursive DNS on your discovery
|
||||
and polling servers. While I used DNSMASQ there are many options.
|
||||
|
||||
SQL tuner will help you quite a bit. You'll also want to increase your
|
||||
maximum connections amount to support the pollers. This setup was at 500.
|
||||
Less important, but putting ~12GB of the database in RAM was reported to
|
||||
have helped web UI performance as well as some DB-heavy Tableau reports.
|
||||
RAM was precious in this environment or it would've been more, but it
|
||||
wasn't necessary either.
|
||||
|
||||
Be careful with keeping the default value for 'Device Down Retry' as it
|
||||
can eat up quite a lot of poller activity. I freed up over 20,000 worker seconds
|
||||
when setting this to only happen once or twice per 10-minute polling cycle.
|
||||
The impact of this will vary depending on the percentage of down device in your system.
|
||||
This example had it set at 400 seconds.
|
||||
|
||||
Also be wary of keeping event log and syslog entries for too long as it can
|
||||
have a pretty negative effect on web UI performance.
|
||||
|
||||
To resolve an issue with large device groups the php fpm max input vars was increased to 20000.
|
||||
|
||||
All of these VMs were within the same physical data center so latency was minimal.
|
||||
|
||||
The decision of redis over the other locking methods was arbitrary but in over 2 years
|
||||
I never had to touch that VM aside from security updates.
|
||||
|
||||
This install used the service instead of cron.
|
||||
|
||||
## Architecture
|
||||
|
||||
|
BIN
doc/img/librenms-distributed-diagram-esxi.png
Normal file
BIN
doc/img/librenms-distributed-diagram-esxi.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 38 KiB |
Before Width: | Height: | Size: 23 KiB After Width: | Height: | Size: 23 KiB |
Reference in New Issue
Block a user