Process Management
SRE software processes are controlled by the supervisord daemon. The systemctl service name to start, stop and request status of the SRE software is sre.
Besides these standard service operations, it is possible to connect directly to the running supervisord instance by launching directly /opt/sre/bin/supervisorctl. Once connected to supervisord, several commands are available, as shown with the help command.
supervisor> help
default commands (type help <topic>):
=====================================
add exit open reload restart start tail
avail fg pid remove shutdown status update
clear maintail quit reread signal stop version
The current status of the processes can be obtained by using the status command.
supervisor> status
sre-REST STOPPED Not started
sre-call-processor:0 RUNNING pid 6671, uptime 0:48:58
sre-gui STOPPED Not started
sre-manager STOPPED Not started
The current status can also be obtained by looking at the SRE GUI Dashboard, as shown in the figure below.
On start, supervisord reads its configuration file (/opt/sre/etc/supervisord-program.conf) to select which programs must be started. It is possible to overrule this configuration by manually starting or stopping processes.
A single process can be restarted with the restart <program> command.
supervisor> restart sre-manager
sre-manager: stopped
sre-manager: started
A single process can be stopped with the stop <program> command.
supervisor> stop sre-manager
sre-manager: stopped
A single process can be started with the start <program> command.
supervisor> start sre-manager
sre-manager: started
The supervisord configuration can be reloaded with the reload command. This operation stops all the processes and they are restarted according to the supervisord configuration file. In particular, if a process has been manually started while it is not active in the configuration, this process will not start after the reload operation.
supervisor> reload
Really restart the remote supervisord process y/N? y
Restarted supervisord
It is possible to read what a process outputs on its standard output with the tail <program>command.
supervisor> tail sre-manager
Services management
SRE
SRE service can be stopped/started/restarted using the following commands
[root@sre-em1 ~]# systemctl stop sre
[root@sre-em1 ~]# systemctl start sre
[root@sre-em1 ~]# systemctl restart sre
PostgreSQL
PostgreSQL service can be stopped/started/restarted using the following commands
[root@sre-em1 ~]# systemctl stop postgresql-14
[root@sre-em1 ~]# systemctl start postgresql-14
[root@sre-em1 ~]# systemctl restart postgresql-14
InfluxDB
InfluxDB service can be stopped/started/restarted using the following commands
[root@sre-em ~]# systemctl stop influxd
[root@sre-em ~]# systemctl start influxd
[root@sre-em ~]# systemctl restart influxd
Kamailio
Kamailio service can be stopped/started/restarted using the following commands
[root@sre-cp1 ~]# systemctl stop kamailio
[root@sre-cp1 ~]# systemctl start kamailio
[root@sre-cp1 ~]# systemctl restart kamailio
Mongo
Mongo service can be stopped/started/restarted using the following commands
[root@sre-cp1 ~]# systemctl stop mongod
[root@sre-cp1 ~]# systemctl start mongod
[root@sre-cp1 ~]# systemctl restart mongod
Monitoring
This section describes several key indicators of the system health. These indicators should be monitored by external monitoring systems to trigger alarms in case of issues.
Note
Some of these monitoring commands rely on queries run against the PostgreSQL database with the psql CLI tool. In case these tasks should be scripted, the output format can be adapted to ease parsing of the results. In particular, the option -t does not print the headers, the option -A does not align the table output and the option -R allows to define the separator. Other output format options can be obtained by running /usr/pgsql-<version>/bin/psql --help. The output samples in the following sections are provided with the full output, to better illustrate the output data. Alternatively, these queries can be run remotely on a PostgreSQL connection, provided that the access rights allow them.
Filesystems Monitoring
These filesystems should be monitored through SRE Alarming and optionally through external scripts:
/: there must be enough space on the root filesystem to allow normal operations of system services, PostgreSQL, Kamailio and SRE software. Alarm threshold for disk usage should be set on maximum 75%.
/var/lib/pgsql: (if existing) there must be enough space (< 75%) for PostgreSQL
/var/log/: logs are rotated daily and size should remain stable under standard log levels. Alarm threshold for disk usage should be set on maximum 90%.
/data/sre/db/backups: automated backups should not fill the filesystem. Alarm threshold for disk usage should be set on maximum 90%.
/data/sre/db/wals: archived work-ahead-logs are essential for backup recovery. Alarm threshold for disk usage should be set on maximum 90%.
/data/sre/db/provisioning: sufficient disk space must be retained to keep an history of provisioning and ensure that automatic NPACT synchronization does not block. Alarm threshold for disk usage should be set on maximum 80%.
/data/sre/accounting: on EM nodes, sufficient disk space must be retained to be able to collect events from CP nodes and consequently produce CDRs. Alarm threshold for disk usage should be set on maximum 80%.
/var/lib/mongo: on all nodes, sufficient disk space must be retained to be able to store call counters for CAC. Alarm threshold for disk usage should be set on maximum 80%.
The "df -k" command provide you file system usage information
[root@sre-em1 ~]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 1928240 0 1928240 0% /dev
tmpfs 1940072 204 1939868 1% /dev/shm
tmpfs 1940072 178504 1761568 10% /run
tmpfs 1940072 0 1940072 0% /sys/fs/cgroup
/dev/mapper/cl-root 20503120 3866652 15571920 20% /
/dev/sda1 999320 151016 779492 17% /boot
/dev/mapper/data-sre 51470816 661184 48172016 2% /data/sre
/dev/mapper/cl-var 10190100 4282252 5367176 45% /var
tmpfs 388016 0 388016 0% /run/user/0
/var/log/sre cleaning
Very often, a full disk usage is related to the /var/log/sre directory (logs are filling in the fie system).
You can run the following command to clean that directory:
[root@sre-em1 ~]# cd /var/log/sre
[root@sre-em1 sre]# rm -f *.1
[root@sre-em1 sre]# rm -f *.2
[root@sre-em1 sre]# rm -f *.3
[root@sre-em1 sre]# rm -f *.4
[root@sre-em1 sre]# rm -f *.5
[root@sre-em1 sre]# rm -f *.6
[root@sre-em1 sre]# rm -f *.7
[root@sre-em1 sre]# rm -f *.8
[root@sre-em1 sre]# rm -f *.9
[root@sre-em1 sre]# rm -f *.10
Be aware than doing that means that you will lose the logs history.
Memory and CPU usage monitoring
Memory and CPU usage consumption can be monitored using the top command.
Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie
%Cpu(s): 50,0 us, 3,1 sy, 0,0 ni, 43,8 id, 0,0 wa, 0,0 hi, 0,0 si, 3,1 st
KiB Mem : 3880144 total, 253148 free, 1468636 used, 2158360 buff/cache
KiB Swap: 1048572 total, 760308 free, 288264 used. 1818724 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26429 influxdb 20 0 1560028 239292 74696 S 80,0 6,2 20982:59 influxd
3297 sre 20 0 1646556 156956 6652 S 13,3 4,0 4363:27 sre-health-moni
9 root 20 0 0 0 0 S 6,7 0,0 673:47.40 rcu_sched
28612 root 20 0 162244 2368 1548 R 6,7 0,1 0:00.01 top
1 root 20 0 125640 2940 1628 S 0,0 0,1 53:17.18 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:04.12 kthreadd
4 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0,0 0,0 25:55.80 ksoftirqd/0
7 root rt 0 0 0 0 S 0,0 0,0 0:01.36 migration/0
8 root 20 0 0 0 0 S 0,0 0,0 0:00.00 rcu_bh
10 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 lru-add-drain
11 root rt 0 0 0 0 S 0,0 0,0 2:24.84 watchdog/0
12 root rt 0 0 0 0 S 0,0 0,0 1:44.89 watchdog/1
13 root rt 0 0 0 0 S 0,0 0,0 0:06.07 migration/1
14 root 20 0 0 0 0 S 0,0 0,0 13:13.88 ksoftirqd/1
16 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kworker/1:0H
18 root 20 0 0 0 0 S 0,0 0,0 0:00.00 kdevtmpfs
19 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 netns
20 root 20 0 0 0 0 S 0,0 0,0 0:19.20 khungtaskd
21 root 0 -20 0 0 0 S 0,0 0,0 0:00.06 writeback
22 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kintegrityd
23 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 bioset
24 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 bioset
25 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 bioset
26 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kblockd
27 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 md
28 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 edac-poller
29 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 watchdogd
35 root 20 0 0 0 0 S 0,0 0,0 7:41.33 kswapd0
36 root 25 5 0 0 0 S 0,0 0,0 0:00.00 ksmd
37 root 39 19 0 0 0 S 0,0 0,0 1:15.34 khugepaged
38 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 crypto
46 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kthrotld
48 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kmpath_rdacd
49 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kaluad
51 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kpsmoused
53 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 ipv6_addrconf
The output of the top command provide you valuable information like:
CPU in %
RAM memory usage
The list of processes ordered by their consumption
SRE Software Monitoring
The SRE software can be monitored at several levels: processes and stats.
SRE Process Monitoring
The operational status of processes running on SRE can be monitored by using the ps command and "grepping" on the process name. All SRE processes start with the string /opt/sre/bin/python /opt/sre/bin/sre-.
[root@sre-em1 ~]# pgrep -a -f "/opt/sre/bin/python"
3283 /opt/sre/bin/python /opt/sre/bin/supervisord -n
3294 /opt/sre/bin/python /opt/sre/bin/sre-REST
3295 /opt/sre/bin/python /opt/sre/bin/sre-cdr-collector
3296 /opt/sre/bin/python /opt/sre/bin/sre-gui
3297 /opt/sre/bin/python /opt/sre/bin/sre-health-monitor
3298 /opt/sre/bin/python /opt/sre/bin/sre-http-processor
19487 /opt/sre/bin/python /opt/sre/bin/sre-manager
The administrative status of the processes can be monitored with the supervisorctl tool, as all SRE processes are managed by supervisord. You can also retrieve the status of the SRE processes by polling the sre service status:
[root@sre-em1 ~]# /opt/sre/bin/supervisorctl status
sre-REST RUNNING pid 3294, uptime 46 days, 17:00:24
sre-agents-monitor STOPPED Not started
sre-broker STOPPED Not started
sre-call-processor:0 STOPPED Not started
sre-cdr-collector RUNNING pid 3295, uptime 46 days, 17:00:24
sre-cdr-postprocessor STOPPED Not started
sre-cdr-sender STOPPED Not started
sre-dns-updater STOPPED Not started
sre-enum-processor STOPPED Not started
sre-gui RUNNING pid 3296, uptime 46 days, 17:00:24
sre-health-monitor RUNNING pid 3297, uptime 46 days, 17:00:24
sre-http-processor RUNNING pid 3298, uptime 46 days, 17:00:24
sre-manager RUNNING pid 19487, uptime 3 days, 5:56:38
On a typical deployment, the processes to check are respectively:
Element Manager:
sre-REST
sre-gui
sre-manager
sre-health-monitor
sre-cdr-collector
Call Processor:
sre-agents-monitor
sre-broker
sre-call-processor:[0-N] (the number of processes might be different depending on the supervisord configuration file)
sre-health-monitor
sre-cdr-sender
Stats Monitoring
Near real-time stats are kept in InfluxDB.
Counters of occurrences of events are stored also in the counter.csv file, in /var/log/sre. Each record is composed of the fields:
hostname: node which generated the event
stat name: counter (event) identifier. It can represent system resources stats, or nodes in the Service Logic(s), or number of outcome form SRE (relay/redirect/serviceLogicError/sipResponse/...)
timestamp (60-sec aligned) in human format: timestamp of the minute for which event occurred.
timestamp (60-sec aligned): Unix timestamp (seconds since EPOCH) of the minute for which event occurred (this value is always a multiple of 60).
values: this covers the following 15 fields. Each field contains the total number of occurrences of this counter type during this window of 1 minute, from the most recent one to the least recent. For instance, the first value contains the number of occurrences at 14:41, the second one the number of occurrences at 14:42, and so on.
These values will "shift to the right" every minute, as the file is refreshed with new stats every minute.
To provide an example, here is a possible content of the counter.csv
[root@sre-em1 ~]# more /var/log/sre/counters.csv
sre32-cp2-testbed,custom.fleg_relay,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
sre32-cp2-testbed,profiling.cp.CAC test.503,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
sre32-cp2-testbed,profiling.cp.CAC test.Extract Contacts,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.Extract mContacts,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
sre32-cp2-testbed,profiling.cp.CAC test.Start,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.add counter0,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.check CAC,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.register CAC,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.relay msg,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.remove t,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.replace To,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,profiling.cp.CAC test.set a and b,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,request.INVITE,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,response.503,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
sre32-cp2-testbed,response.loop,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
sre32-cp2-testbed,response.relay,2022-05-10T16:31:00,1652193060,12,11,13,12,12,12,12,12,12,12,12,12,12,12,12
sre32-cp2-testbed,response.serviceLogicError,2022-05-10T16:31:00,1652193060,,,,,,,,,,,,,,,
Counters of interest are described in the following table.
Counter name | Description |
---|---|
request.INVITE | INVITE requests |
request.OPTIONS | OPTIONS requests |
response.redirect | Redirect responses (301/302) |
response.loop | Loop responses (482) |
response.serviceLogicError | Service Logic Error responses (604) |
response.serviceDown | Service Down responses (503) |
response.genericError | Generic Error responses (500) |
response.genericError | Generic Error responses (500) |
request.http.<method> | HTTP requests |
response.http.<code> | HTTP responses |
request.dns.NAPTR | NAPTR requests |
response.dns.NOERROR | Successful DNS responses |
response.dns.SERVFAIL | Failed DNS responses |
In addition, the file samples.csv will provide in the values fields the average processing time for each event (based on the formula: sum of duration of events occurred / number of events (e.g. processing time for INVITE's).
An example of samples.csv is provided here below:
[root@sre-em1 ~]# more /var/log/sre/samples.csv
sre32-cp2-testbed,accounting.openCalls,2022-05-10T16:32:01,1652193121,3800.000,3800.000,3800.000,3700.000,3566.667,3833.333,3833.333,3866.667,3666.667,3666.667,3900.000,3833.333,3933.333,3633.333,3866.667
sre32-cp2-testbed,profiling.cp.CAC test.503,2022-05-10T16:32:01,1652193121,,,,,,,,,,,,,,,
sre32-cp2-testbed,profiling.cp.CAC test.Extract Contacts,2022-05-10T16:32:01,1652193121,0.113,0.121,0.103,0.115,0.123,0.114,0.111,0.117,0.121,0.110,0.115,0.120,0.129,0.101,0.103
sre32-cp2-testbed,profiling.cp.CAC test.Extract mContacts,2022-05-10T16:32:01,1652193121,,,,,,,,,,,,,,,
sre32-cp2-testbed,profiling.cp.CAC test.Start,2022-05-10T16:32:01,1652193121,0.076,0.079,0.064,0.080,0.080,0.083,0.062,0.081,0.087,0.072,0.075,0.085,0.072,0.067,0.063
sre32-cp2-testbed,profiling.cp.CAC test.add counter0,2022-05-10T16:32:01,1652193121,0.064,0.063,0.066,0.059,0.057,0.067,0.052,0.060,0.070,0.064,0.076,0.050,0.073,0.057,0.057
sre32-cp2-testbed,profiling.cp.CAC test.check CAC,2022-05-10T16:32:01,1652193121,4.674,3.809,4.454,4.630,7.464,5.426,9.562,3.780,4.517,7.097,5.204,4.359,3.724,4.870,5.153
sre32-cp2-testbed,profiling.cp.CAC test.register CAC,2022-05-10T16:32:01,1652193121,0.050,0.049,0.053,0.049,0.050,0.060,0.048,0.050,0.049,0.049,0.052,0.047,0.057,0.047,0.052
sre32-cp2-testbed,profiling.cp.CAC test.relay msg,2022-05-10T16:32:01,1652193121,0.097,0.088,0.080,0.085,0.088,0.092,0.079,0.081,0.085,0.082,0.092,0.077,0.095,0.088,0.082
sre32-cp2-testbed,profiling.cp.CAC test.remove t,2022-05-10T16:32:01,1652193121,0.041,0.045,0.045,0.042,0.050,0.052,0.036,0.039,0.042,0.041,0.045,0.036,0.047,0.037,0.040
sre32-cp2-testbed,profiling.cp.CAC test.replace To,2022-05-10T16:32:01,1652193121,0.054,0.051,0.055,0.048,0.059,0.064,0.049,0.048,0.054,0.046,0.055,0.051,0.060,0.049,0.050
sre32-cp2-testbed,profiling.cp.CAC test.set a and b,2022-05-10T16:32:01,1652193121,0.081,0.085,0.071,0.087,0.088,0.080,0.078,0.084,0.081,0.064,0.083,0.072,0.083,0.063,0.069
sre32-cp2-testbed,profiling.cp.INVITE,2022-05-10T16:32:01,1652193121,13.214,10.280,13.943,13.042,19.519,13.832,19.339,10.739,15.154,16.718,12.929,13.994,10.781,13.135,13.157
sre32-cp2-testbed,profiling.cp.loop,2022-05-10T16:32:01,1652193121,0.836,0.078,0.722,0.668,2.033,1.148,0.823,0.086,1.377,1.157,0.881,1.325,0.096,0.773,1.013
sre32-cp1-testbed,system.cpu,2022-05-10T16:32:01,1652193121,11718.333,11660.000,11665.000,11911.667,12000.000,11545.000,11658.333,11815.000,11630.000,11716.667,11668.333,11620.000,11853.333,11685.000,11606.667
sre32-cp2-testbed,system.cpu,2022-05-10T16:32:01,1652193121,11463.333,11186.667,11150.000,11330.000,11451.667,11265.000,11330.000,11233.333,11185.000,11046.667,11090.000,11201.667,11268.333,11003.333,11178.333
sre32-em1-testbed,system.cpu,2022-05-10T16:32:01,1652193121,10006.122,9622.449,10550.000,9261.224,10508.163,9285.417,9935.417,9518.000,10800.000,9275.000,10174.000,9456.250,10647.917,9506.000,10518.750
sre32-em2-testbed,system.cpu,2022-05-10T16:32:01,1652193121,10224.074,10170.909,10077.778,10127.778,10260.000,10033.333,10049.091,10177.778,10135.185,10140.000,9944.444,10111.111,10094.545,10220.370,10124.074
sre32-cp1-testbed,system.disk./,2022-05-10T16:32:01,1652193121,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000,67900.000
sre32-cp2-testbed,system.disk./,2022-05-10T16:32:01,1652193121,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000,69600.000
sre32-em1-testbed,system.disk./,2022-05-10T16:32:01,1652193121,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000,76400.000
sre32-em2-testbed,system.disk./,2022-05-10T16:32:01,1652193121,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000,67500.000
sre32-cp1-testbed,system.disk./boot,2022-05-10T16:32:01,1652193121,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000
sre32-cp2-testbed,system.disk./boot,2022-05-10T16:32:01,1652193121,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000
sre32-em1-testbed,system.disk./boot,2022-05-10T16:32:01,1652193121,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000
sre32-em2-testbed,system.disk./boot,2022-05-10T16:32:01,1652193121,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000,66700.000
sre32-cp1-testbed,system.disk./boot/efi,2022-05-10T16:32:01,1652193121,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
sre32-cp2-testbed,system.disk./boot/efi,2022-05-10T16:32:01,1652193121,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
sre32-em1-testbed,system.disk./boot/efi,2022-05-10T16:32:01,1652193121,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
sre32-em2-testbed,system.disk./boot/efi,2022-05-10T16:32:01,1652193121,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
sre32-cp1-testbed,system.disk./data/sre/db/backups,2022-05-10T16:32:01,1652193121,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000
sre32-cp2-testbed,system.disk./data/sre/db/backups,2022-05-10T16:32:01,1652193121,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000,17100.000
sre32-em1-testbed,system.disk./data/sre/db/backups,2022-05-10T16:32:01,1652193121,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000,10400.000
sre32-em2-testbed,system.disk./data/sre/db/backups,2022-05-10T16:32:01,1652193121,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000,17300.000
sre32-cp1-testbed,system.disk./data/sre/db/wals,2022-05-10T16:32:01,1652193121,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000
sre32-cp2-testbed,system.disk./data/sre/db/wals,2022-05-10T16:32:01,1652193121,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000
sre32-em1-testbed,system.disk./data/sre/db/wals,2022-05-10T16:32:01,1652193121,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000,1900.000
sre32-em2-testbed,system.disk./data/sre/db/wals,2022-05-10T16:32:01,1652193121,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000,200.000
sre32-cp1-testbed,system.disk./data/sre/provisioning,2022-05-10T16:32:01,1652193121,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000
sre32-cp2-testbed,system.disk./data/sre/provisioning,2022-05-10T16:32:01,1652193121,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000
sre32-em1-testbed,system.disk./data/sre/provisioning,2022-05-10T16:32:01,1652193121,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000
sre32-em2-testbed,system.disk./data/sre/provisioning,2022-05-10T16:32:01,1652193121,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000,400.000
sre32-cp1-testbed,system.disk./opt,2022-05-10T16:32:01,1652193121,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000
sre32-cp2-testbed,system.disk./opt,2022-05-10T16:32:01,1652193121,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000,35300.000
sre32-em1-testbed,system.disk./opt,2022-05-10T16:32:01,1652193121,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000,35000.000
sre32-em2-testbed,system.disk./opt,2022-05-10T16:32:01,1652193121,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000,34900.000
sre32-cp1-testbed,system.disk./var/log,2022-05-10T16:32:01,1652193121,44426.667,44500.000,44500.000,44496.667,44400.000,44400.000,44400.000,44400.000,44400.000,44480.000,44500.000,44408.333,44400.000,44400.000,44400.000
sre32-cp2-testbed,system.disk./var/log,2022-05-10T16:32:01,1652193121,39906.667,39916.667,39916.667,39900.000,39910.000,39900.000,39905.000,39900.000,39900.000,39900.000,39900.000,39900.000,39900.000,39900.000,39900.000
sre32-em1-testbed,system.disk./var/log,2022-05-10T16:32:01,1652193121,40967.347,40900.000,40900.000,40900.000,40900.000,40831.250,40800.000,40800.000,40800.000,40800.000,40708.000,40700.000,40700.000,40700.000,40700.000
sre32-em2-testbed,system.disk./var/log,2022-05-10T16:32:01,1652193121,62600.000,62600.000,62600.000,62600.000,62600.000,62600.000,62600.000,62600.000,62600.000,62600.000,62501.852,62505.556,62536.364,62500.000,62500.000
sre32-cp1-testbed,system.mem.ram,2022-05-10T16:32:01,1652193121,51891.667,52018.333,52100.000,52100.000,52098.333,52100.000,52100.000,52098.333,52071.667,51886.667,51898.333,51890.000,51886.667,51881.667,51886.667
sre32-cp2-testbed,system.mem.ram,2022-05-10T16:32:01,1652193121,46798.333,46798.333,46733.333,46595.000,46593.333,46580.000,46588.333,46591.667,46593.333,46586.667,46571.667,46556.667,46528.333,46551.667,46705.000
sre32-em1-testbed,system.mem.ram,2022-05-10T16:32:01,1652193121,59100.000,59100.000,59100.000,59100.000,59097.959,59100.000,59039.583,59000.000,59002.083,59000.000,58960.000,58910.417,58910.417,58906.000,58900.000
sre32-em2-testbed,system.mem.ram,2022-05-10T16:32:01,1652193121,34618.519,34601.818,34609.259,34605.556,34605.455,34611.111,34607.273,34611.111,34603.704,34609.091,34601.852,34605.556,34605.455,34607.407,34601.852
sre32-cp1-testbed,system.mem.swap,2022-05-10T16:32:01,1652193121,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000,3200.000
sre32-cp2-testbed,system.mem.swap,2022-05-10T16:32:01,1652193121,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000,2500.000
sre32-em1-testbed,system.mem.swap,2022-05-10T16:32:01,1652193121,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000,5600.000
sre32-em2-testbed,system.mem.swap,2022-05-10T16:32:01,1652193121,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000,1200.000
Samples of interest are described in the following table.
Sample name | Description |
---|---|
profiling.cp.INVITE | Duration to process INVITE requests |
profiling.cp.OPTIONS | Duration to process OPTIONS requests |
profiling.cp.loop | Duration to perform loop detection |
profiling.enum.NAPTR | Duration to process NAPTR requests |
profiling.http.<method> | Duration to process HTTP requests |
accounting.openCalls | Calls opened in the last minute |
InfluxDB query
Regularly, stats collected by the sre-manager are dumped to the internal Influx database.
Counters of occurrences of events of the last minute can be shown with this command:
[root@sre-em1 ~]# influx query 'from(bucket: "counters")|> range(start: -1m)|> drop(columns: ["_start", "_stop", "_field"])' | grep request.OPTIONS
sip request.OPTIONS SRE-33-CP1 2023-10-18T12:18:44.000000000Z 1
sip request.OPTIONS SRE-33-CP2 2023-10-18T12:18:09.000000000Z 1
Cumulative time of the duration of events and the number of such occurrences are stored in the samples bucket. Each record is composed of the fields:
hostname: node which generated the event
_measurement: event name.
_time: event timestamp.
elapsed_time: sum of the durations of the single events.
occurrences: total number of occurrences of this event type during this window of 1 minute.
Dividing elapsed_time by occurrences for a record computes the average duration of such an event.
Samples of events of the last 10 seconds can be shown with this command:
[root@sre-em1 ~]# influx query 'from(bucket: "samples")|> range(start: -10s)|> drop(columns: ["_start", "_stop"])' | grep profiling.cp.INVITE
elapsed_time profiling.cp.INVITE SRE-33-CP1 2023-10-18T12:20:50.000000000Z 0.212
occurences profiling.cp.INVITE SRE-33-CP1 2023-10-18T12:20:50.000000000Z 2
PostgreSQL Monitoring
Service monitoring
To verify the status of PostgreSQL, execute the command systemctl status postgresql-14.
[root@sre-em ~]# systemctl status postgresql-14
- postgresql-14.service - PostgreSQL 14 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-14.service; enabled; vendor preset: disabled)
Active: active (running) since ven 2023-04-28 10:11:14 CEST; 5 months 20 days ago
Docs: https://www.postgresql.org/docs/14/static/
Main PID: 21254 (postmaster)
CGroup: /system.slice/postgresql-14.service
3353 postgres: sre sre 127.0.0.1(38618) idle
3501 postgres: postgres postgres 127.0.0.1(39082) idle
5775 postgres: sre sre 127.0.0.1(50674) idle
5778 postgres: sre sre 127.0.0.1(50688) idle
5781 postgres: sre stirshaken_a 127.0.0.1(50704) idle
5782 postgres: sre stirshaken_b 127.0.0.1(50706) idle
5783 postgres: sre regression_test_08_12_22_a 127.0.0.1(50708) idle
5784 postgres: sre regression_test_08_12_22_b 127.0.0.1(50716) idle
Process Monitoring
The master process, postmaster should be present on all nodes.
[root@sre-em ~]# ps -ef|grep postmaster|grep -v grep
postgres 21254 1 0 apr28 ? 04:53:27 /usr/pgsql-14/bin/postmaster -D /var/lib/pgsql/14/data/
On the master node, there should be a number of work-ahead logs senders equal to the number of nodes replicating from the master (e.g. standby EM and 4 CP nodes). Beware that the streaming ID (that is, the current "screenshot" of the DB should be the same on all nodes, unless the synchronization has been stopped on one or more nodes on purpose).
[root@sre-em ~]# ps -ef|grep "postgres: walsender"|grep -v grep
postgres 16401 21254 0 apr28 ? 00:19:45 postgres: walsender repmgr 10.0.161.181(44576) streaming 2/B8264AB0
postgres 21300 21254 0 apr28 ? 00:26:17 postgres: walsender repmgr 10.0.161.183(48710) streaming 2/B8264AB0
postgres 21306 21254 0 apr28 ? 00:20:02 postgres: walsender repmgr 10.0.161.182(57050) streaming 2/B8264AB0
On the standby nodes, there should be exactly one work-ahead log receiver.
[root@sre-cp ~]# ps -ef|grep "postgres: walreceiver"|grep -v grep
postgres 2444 2436 0 apr28 ? 03:32:36 postgres: walreceiver streaming 2/B8264AB0
The number of open connections from the sre user should remain stable. If the number of connections increases over time, this might be an indication that the sessions are not correctly ended by the SRE software.
[root@sre-cp ~]# ps -ef|grep "postgres: sre"|grep -v grep
postgres 10216 2436 0 set01 ? 00:00:00 postgres: sre sre 127.0.0.1(37400) idle
postgres 10221 2436 0 set01 ? 00:00:00 postgres: sre sre 127.0.0.1(37426) idle
postgres 10224 2436 0 set01 ? 00:00:00 postgres: sre stirshaken_a 127.0.0.1(37432) idle
postgres 10225 2436 0 set01 ? 00:00:00 postgres: sre stirshaken_b 127.0.0.1(37434) idle
postgres 10226 2436 0 set01 ? 00:00:00 postgres: sre regression_test_08_12_22_a 127.0.0.1(37436) idle
postgres 10227 2436 0 set01 ? 00:00:00 postgres: sre regression_test_08_12_22_b 127.0.0.1(37438) idle
Processes handling active transactions can be counted by "grepping" on the string "in transaction". This number should be stable over time.
[root@sre-cp ~]# ps -ef|grep "postgres: sre"|grep -c "in transaction"
1
Idle connections are the other ones.
[root@sre-cp ~]# ps -ef|grep "postgres: sre"|grep -c idle
59
Replication
Replication status is shown in the GUI > Dashboard > Database.
The status can be also checked with the following queries. The result f (false) indicates that the node is not replicating, so is master. The result t (true) indicates that the node is replicating from a master node.
On the master EM:
[root@sre-em ~]# /usr/pgsql-14/bin/psql -U postgres -h 127.0.0.1 -c "select * from pg_is_in_recovery()"
pg_is_in_recovery
-------------------
f
(1 row)
The number of clients connected to replicate the databases can be retrieved by querying the pg_stat_replication.
The state field can be used in the query to differentiate between streaming replication (normal mode of operation for all nodes) and backup replication (result of an ongoing backup activity). This table is only present on the master PostgreSQL instance.
On the master EM:
[root@sre.em ~]# /usr/pgsql-14/bin/psql -U postgres -h 127.0.0.1 -c "select * from pg_stat_replication"
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn
| write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+-----------
-+-----------------+-----------------+-----------------+---------------+------------+-------------------------------
21300 | 16389 | repmgr | cp2 | 10.0.161.183 | | 48710 | 2023-04-28 10:11:15.927884+02 | | streaming | 2/B827AF38 | 2/B827AF38 | 2/B827AF38 | 2/B827AF38
| 00:00:00.001141 | 00:00:00.002427 | 00:00:00.002952 | 0 | async | 2023-10-18 11:01:45.525076+02
21306 | 16389 | repmgr | cp1 | 10.0.161.182 | | 57050 | 2023-04-28 10:11:16.291374+02 | | streaming | 2/B827AF38 | 2/B827AF38 | 2/B827AF38 | 2/B827AF38
| 00:00:00.001203 | 00:00:00.002264 | 00:00:00.002315 | 0 | async | 2023-10-18 11:01:45.524272+02
16401 | 16389 | repmgr | em2 | 10.0.161.181 | | 44576 | 2023-04-28 11:59:13.512262+02 | | streaming | 2/B827AF38 | 2/B827AF38 | 2/B827AF38 | 2/B827AF38
| 00:00:00.001277 | 00:00:00.002492 | 00:00:00.002493 | 0 | async | 2023-10-18 11:01:45.524472+02
(3 rows)
WALS files
WALS files are transferred from the master to the standby nodes to replicate the data stored in each table.
Checking the WALS files on the master and standby nodes provide indication about the replication status. The content of the direction /var/lib/pgsql/14/data/pg_wal/ on the stanby nodes must be the same as on the active nodes. You should also see the latest WALS file being updated. This can be seen using the following command:
[root@SRE-33-EM1 ~]# ls -ltr /var/lib/pgsql/14/data/pg_wal/
totale 573468
-rw------- 1 postgres postgres 343 20 gen 2023 00000009.history
-rw------- 1 postgres postgres 386 20 gen 2023 0000000A.history
-rw------- 1 postgres postgres 430 20 gen 2023 0000000B.history
-rw------- 1 postgres postgres 474 20 gen 2023 0000000C.history
-rw------- 1 postgres postgres 518 20 gen 2023 0000000D.history
-rw------- 1 postgres postgres 16777216 20 ago 03.17 0000000D00000002000000B9
-rw------- 1 postgres postgres 16777216 22 ago 22.56 0000000D00000002000000BA
-rw------- 1 postgres postgres 16777216 25 ago 23.36 0000000D00000002000000BB
-rw------- 1 postgres postgres 16777216 27 ago 03.15 0000000D0000000200000099
-rw------- 1 postgres postgres 16777216 27 ago 03.16 0000000D000000020000009A
-rw------- 1 postgres postgres 16777216 30 ago 10.41 0000000D000000020000009B
-rw------- 1 postgres postgres 16777216 2 set 16.11 0000000D000000020000009C
-rw------- 1 postgres postgres 16777216 3 set 03.48 0000000D000000020000009D
-rw------- 1 postgres postgres 16777216 3 set 03.49 0000000D000000020000009E
-rw------- 1 postgres postgres 16777216 6 set 16.08 0000000D000000020000009F
-rw------- 1 postgres postgres 16777216 10 set 02.20 0000000D00000002000000A0
-rw------- 1 postgres postgres 16777216 10 set 03.15 0000000D00000002000000A1
-rw------- 1 postgres postgres 16777216 10 set 03.17 0000000D00000002000000A2
-rw------- 1 postgres postgres 16777216 13 set 03.06 0000000D00000002000000A3
-rw------- 1 postgres postgres 16777216 15 set 08.31 0000000D00000002000000A4
-rw------- 1 postgres postgres 16777216 17 set 03.15 0000000D00000002000000A5
-rw------- 1 postgres postgres 16777216 17 set 03.17 0000000D00000002000000A6
-rw------- 1 postgres postgres 16777216 20 set 04.21 0000000D00000002000000A7
-rw------- 1 postgres postgres 16777216 23 set 02.08 0000000D00000002000000A8
-rw------- 1 postgres postgres 16777216 24 set 03.15 0000000D00000002000000A9
-rw------- 1 postgres postgres 16777216 24 set 03.17 0000000D00000002000000AA
-rw------- 1 postgres postgres 16777216 27 set 14.57 0000000D00000002000000AB
-rw------- 1 postgres postgres 16777216 30 set 00.36 0000000D00000002000000AC
-rw------- 1 postgres postgres 16777216 1 ott 03.15 0000000D00000002000000AD
-rw------- 1 postgres postgres 16777216 1 ott 03.17 0000000D00000002000000AE
-rw------- 1 postgres postgres 16777216 3 ott 20.01 0000000D00000002000000AF
-rw------- 1 postgres postgres 16777216 6 ott 17.22 0000000D00000002000000B0
-rw------- 1 postgres postgres 16777216 8 ott 03.15 0000000D00000002000000B1
-rw------- 1 postgres postgres 16777216 8 ott 03.18 0000000D00000002000000B2
-rw------- 1 postgres postgres 16777216 10 ott 16.41 0000000D00000002000000B3
-rw------- 1 postgres postgres 16777216 13 ott 04.37 0000000D00000002000000B4
-rw------- 1 postgres postgres 16777216 15 ott 03.15 0000000D00000002000000B5
-rw------- 1 postgres postgres 16777216 15 ott 03.17 0000000D00000002000000B6
-rw------- 1 postgres postgres 345 15 ott 03.17 0000000D00000002000000B6.00000028.backup
-rw------- 1 postgres postgres 16777216 17 ott 23.44 0000000D00000002000000B7
drwx------. 2 postgres postgres 4096 17 ott 23.46 archive_status
-rw------- 1 postgres postgres 16777216 18 ott 11.01 0000000D00000002000000B8
DB Disk Usage
Databases sizes (in bytes) can be retrieved in the GUI Dashboard > Databases, or alternatively with the following query:
[root@localhost ~]# /usr/pgsql-14/bin/psql -U postgres -h 127.0.0.1 -c "select datname, pg_database_size(datname) from pg_database"
datname | pg_database_size
-----------------------------------------+------------------
postgres | 8979235
template1 | 8823299
template0 | 8823299
repmgr | 9257763
sre | 48988963
temp_test_delete_me_a | 9151267
temp_test_delete_me_b | 9118499
temp_test_delete_me_please_a | 9102115
temp_test_delete_me_please_b | 9077539
test_default_value_a | 9044771
test_default_value_b | 9044771
test_a | 9044771
test_b | 9044771
test2_a | 9044771
test2_b | 9044771
sss_a | 9093923
sss_b | 9044771
stirshaken_a | 9167651
stirshaken_b | 9044771
test_versioning_a | 9069347
test_versioning_b | 9044771
test_versioning_2_a | 9069347
test_versioning_2_b | 9069347
regression_test_08_12_22_a | 9224995
regression_test_08_12_22_b | 9143075
m247_lab_a | 9585443
m247_lab_b | 9044771
fuse2_voice_a | 9044771
fuse2_voice_b | 9044771
dm_validation_1_a | 9298723
dm_validation_1_b | 9044771
demo_a | 9110307
demo_b | 9044771
demo_doc_versioning_position_a | 9093923
demo_doc_versioning_position_b | 9044771
demo_doc_export_dm_diagram_a | 9044771
demo_doc_export_dm_diagram_b | 9069347
inventory_a | 9216803
inventory_b | 9044771
...
(61 rows)
Tablespaces sizes (in bytes) can be retrieved with this query.
[root@localhost ~]# /usr/pgsql-14/bin/psql -U postgres -h 127.0.0.1 -c "select spcname, pg_tablespace_size(spcname) from pg_tablespace"
spcname | pg_tablespace_size
------------+--------------------
pg_default | 594550039
pg_global | 622368
(2 rows)
Database automatic switchover monitoring
SSH access is essential for performing a manual or automated cluster switchover. To check that all nodes are reachable via ssh in both directions run the following command as user postgres:
-bash-4.2$ /usr/pgsql-14/bin/repmgr cluster crosscheck
Name | ID | 1 | 2 | 3 | 4
------------+----+---+---+---+---
sre-em1 | 1 | * | * | * | *
sre-em2 | 2 | * | * | * | *
sre-cp1 | 3 | * | * | * | *
sre-cp2 | 4 | * | * | * | *
All cells should contain a * meaning that a succesful connection is working between the servers.
To show the status of repmgrd daemons and if the automatic switchover is disabled (paused) run the following command as user postgres:
-bash-4.2$ /usr/pgsql-14/bin/repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+------------+---------+-----------+-----------+---------+---------+---------+--------------------
1 | sre-em1 | primary | * running | | running | 107904 | no | n/a
2 | sre-em2 | standby | running | sre-em1 | running | 1343 | no | 0 second(s) ago
3 | sre-cp1 | standby | running | sre-em1 | running | 6087 | no | 1 second(s) ago
4 | sre-cp2 | standby | running | sre-em1 | running | 3826938 | no | 1 second(s) ago
The identical information is displayed within the dashboard's Databases tab in the GUI.
If rempgrd daemon is not currently running on a node, establish a connection to that node and execute the command:
[root@sre-cp ~]# systemctl start repmgr-14
Kamailio Monitoring
Process Monitoring (on CP nodes only)
Kamailio processes can be listed with the ps command. Their number should remain stable and they should not be continuously restarted (check the PID's).
On each CP node:
[root@sre-cp ~]# ps -ef|grep kamailio|grep -v grep
kamailio 7992 1 0 lug20 ? 00:00:32 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8006 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8007 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8008 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8009 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8010 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8011 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8012 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8013 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8014 7992 0 lug20 ? 00:05:55 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8015 7992 0 lug20 ? 00:05:41 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8016 7992 0 lug20 ? 00:05:39 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8017 7992 0 lug20 ? 00:05:40 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8018 7992 0 lug20 ? 00:05:27 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8019 7992 0 lug20 ? 00:05:32 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8020 7992 0 lug20 ? 00:05:25 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8021 7992 0 lug20 ? 00:05:53 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8022 7992 0 lug20 ? 00:14:58 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8023 7992 0 lug20 ? 01:17:49 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8024 7992 0 lug20 ? 00:05:52 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8025 7992 0 lug20 ? 00:13:57 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8026 7992 0 lug20 ? 00:00:00 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8027 7992 0 lug20 ? 00:11:12 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8028 7992 0 lug20 ? 00:00:37 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8029 7992 0 lug20 ? 00:03:42 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8030 7992 0 lug20 ? 00:03:44 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8031 7992 0 lug20 ? 00:03:41 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8032 7992 0 lug20 ? 00:03:41 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8033 7992 0 lug20 ? 00:03:40 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8034 7992 0 lug20 ? 00:03:39 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8035 7992 0 lug20 ? 00:03:42 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8036 7992 0 lug20 ? 00:03:42 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
kamailio 8037 7992 0 lug20 ? 00:02:14 /usr/sbin/kamailio -DD -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg -m 64 -M 8
Detailed information about the role of each processes can be obtained with the kamctl ps command.
[root@sre-cp ~]# kamctl ps
{
"jsonrpc": "2.0",
"result": [
{
"IDX": 0,
"PID": 7992,
"DSC": "main process - attendant"
}, {
"IDX": 1,
"PID": 8006,
"DSC": "udp receiver child=0 sock=127.0.0.1:5060"
}, {
"IDX": 2,
"PID": 8007,
"DSC": "udp receiver child=1 sock=127.0.0.1:5060"
}, {
"IDX": 3,
"PID": 8008,
"DSC": "udp receiver child=2 sock=127.0.0.1:5060"
}, {
"IDX": 4,
"PID": 8009,
"DSC": "udp receiver child=3 sock=127.0.0.1:5060"
}, {
"IDX": 5,
"PID": 8010,
"DSC": "udp receiver child=4 sock=127.0.0.1:5060"
}, {
"IDX": 6,
"PID": 8011,
"DSC": "udp receiver child=5 sock=127.0.0.1:5060"
}, {
"IDX": 7,
"PID": 8012,
"DSC": "udp receiver child=6 sock=127.0.0.1:5060"
}, {
"IDX": 8,
"PID": 8013,
"DSC": "udp receiver child=7 sock=127.0.0.1:5060"
}, {
"IDX": 9,
"PID": 8014,
"DSC": "udp receiver child=0 sock=10.0.161.182:5060"
}, {
"IDX": 10,
"PID": 8015,
"DSC": "udp receiver child=1 sock=10.0.161.182:5060"
}, {
"IDX": 11,
"PID": 8016,
"DSC": "udp receiver child=2 sock=10.0.161.182:5060"
}, {
"IDX": 12,
"PID": 8017,
"DSC": "udp receiver child=3 sock=10.0.161.182:5060"
}, {
"IDX": 13,
"PID": 8018,
"DSC": "udp receiver child=4 sock=10.0.161.182:5060"
}, {
"IDX": 14,
"PID": 8019,
"DSC": "udp receiver child=5 sock=10.0.161.182:5060"
}, {
"IDX": 15,
"PID": 8020,
"DSC": "udp receiver child=6 sock=10.0.161.182:5060"
}, {
"IDX": 16,
"PID": 8021,
"DSC": "udp receiver child=7 sock=10.0.161.182:5060"
}, {
"IDX": 17,
"PID": 8022,
"DSC": "slow timer"
}, {
"IDX": 18,
"PID": 8023,
"DSC": "timer"
}, {
"IDX": 19,
"PID": 8024,
"DSC": "secondary timer"
}, {
"IDX": 20,
"PID": 8025,
"DSC": "JSONRPCS FIFO"
}, {
"IDX": 21,
"PID": 8026,
"DSC": "JSONRPCS DATAGRAM"
}, {
"IDX": 22,
"PID": 8027,
"DSC": "ctl handler"
}, {
"IDX": 23,
"PID": 8028,
"DSC": "Dialog Clean Timer"
}, {
"IDX": 24,
"PID": 8029,
"DSC": "tcp receiver (generic) child=0"
}, {
"IDX": 25,
"PID": 8030,
"DSC": "tcp receiver (generic) child=1"
}, {
"IDX": 26,
"PID": 8031,
"DSC": "tcp receiver (generic) child=2"
}, {
"IDX": 27,
"PID": 8032,
"DSC": "tcp receiver (generic) child=3"
}, {
"IDX": 28,
"PID": 8033,
"DSC": "tcp receiver (generic) child=4"
}, {
"IDX": 29,
"PID": 8034,
"DSC": "tcp receiver (generic) child=5"
}, {
"IDX": 30,
"PID": 8035,
"DSC": "tcp receiver (generic) child=6"
}, {
"IDX": 31,
"PID": 8036,
"DSC": "tcp receiver (generic) child=7"
}, {
"IDX": 32,
"PID": 8037,
"DSC": "tcp main process"
}
],
"id": 7588
}
Kamailio Stats Monitoring
Stats about Kamailio internals can be displayed with the kamctl stats command. By default, it displays stats for all groups. Individual groups stats can be retrieved with the kamctl stats <group> command (e.g. kamctl stats sl). Under normal operation, these counters should be increasing proportionally.
[root@sre-cp ~]# kamctl stats
{
"jsonrpc": "2.0",
"result": [
"app_python3:active_dialogs = 0",
"app_python3:early_dialogs = 0",
"app_python3:expired_dialogs = 118",
"app_python3:failed_dialogs = 6",
"app_python3:processed_dialogs = 76014",
"core:bad_URIs_rcvd = 0",
"core:bad_msg_hdr = 0",
"core:drop_replies = 0",
"core:drop_requests = 3",
"core:err_replies = 0",
"core:err_requests = 0",
"core:fwd_replies = 132759",
"core:fwd_requests = 1711587",
"core:rcv_replies = 419914",
"core:rcv_replies_18x = 65248",
"core:rcv_replies_1xx = 108873",
"core:rcv_replies_1xx_bye = 0",
"core:rcv_replies_1xx_cancel = 0",
"core:rcv_replies_1xx_invite = 108873",
"core:rcv_replies_1xx_message = 0",
"core:rcv_replies_1xx_prack = 0",
"core:rcv_replies_1xx_refer = 0",
"core:rcv_replies_1xx_reg = 0",
"core:rcv_replies_1xx_update = 0",
"core:rcv_replies_2xx = 310992",
"core:rcv_replies_2xx_bye = 106582",
"core:rcv_replies_2xx_cancel = 0",
"core:rcv_replies_2xx_invite = 83786",
"core:rcv_replies_2xx_message = 0",
"core:rcv_replies_2xx_prack = 0",
"core:rcv_replies_2xx_refer = 0",
"core:rcv_replies_2xx_reg = 0",
"core:rcv_replies_2xx_update = 0",
"core:rcv_replies_3xx = 0",
"core:rcv_replies_3xx_bye = 0",
"core:rcv_replies_3xx_cancel = 0",
"core:rcv_replies_3xx_invite = 0",
"core:rcv_replies_3xx_message = 0",
"core:rcv_replies_3xx_prack = 0",
"core:rcv_replies_3xx_refer = 0",
"core:rcv_replies_3xx_reg = 0",
"core:rcv_replies_3xx_update = 0",
"core:rcv_replies_401 = 0",
"core:rcv_replies_404 = 0",
"core:rcv_replies_407 = 0",
"core:rcv_replies_480 = 0",
"core:rcv_replies_486 = 0",
"core:rcv_replies_4xx = 48",
"core:rcv_replies_4xx_bye = 48",
"core:rcv_replies_4xx_cancel = 0",
"core:rcv_replies_4xx_invite = 0",
"core:rcv_replies_4xx_message = 0",
"core:rcv_replies_4xx_prack = 0",
"core:rcv_replies_4xx_refer = 0",
"core:rcv_replies_4xx_reg = 0",
"core:rcv_replies_4xx_update = 0",
"core:rcv_replies_5xx = 1",
"core:rcv_replies_5xx_bye = 0",
"core:rcv_replies_5xx_cancel = 0",
"core:rcv_replies_5xx_invite = 0",
"core:rcv_replies_5xx_message = 0",
"core:rcv_replies_5xx_prack = 0",
"core:rcv_replies_5xx_refer = 0",
"core:rcv_replies_5xx_reg = 0",
"core:rcv_replies_5xx_update = 0",
"core:rcv_replies_6xx = 0",
"core:rcv_replies_6xx_bye = 0",
"core:rcv_replies_6xx_cancel = 0",
"core:rcv_replies_6xx_invite = 0",
"core:rcv_replies_6xx_message = 0",
"core:rcv_replies_6xx_prack = 0",
"core:rcv_replies_6xx_refer = 0",
"core:rcv_replies_6xx_reg = 0",
"core:rcv_replies_6xx_update = 0",
"core:rcv_requests = 2039960",
"core:rcv_requests_ack = 83991",
"core:rcv_requests_bye = 112644",
"core:rcv_requests_cancel = 5",
"core:rcv_requests_info = 0",
"core:rcv_requests_invite = 76022",
"core:rcv_requests_message = 0",
"core:rcv_requests_notify = 0",
"core:rcv_requests_options = 1767298",
"core:rcv_requests_prack = 0",
"core:rcv_requests_publish = 0",
"core:rcv_requests_refer = 0",
"core:rcv_requests_register = 0",
"core:rcv_requests_subscribe = 0",
"core:rcv_requests_update = 0",
"core:unsupported_methods = 0",
"dns:failed_dns_request = 0",
"dns:slow_dns_request = 0",
"registrar:accepted_regs = 0",
"registrar:default_expire = 3600",
"registrar:default_expires_range = 0",
"registrar:expires_range = 0",
"registrar:max_contacts = 1",
"registrar:max_expires = 3600",
"registrar:rejected_regs = 0",
"shmem:fragments = 6",
"shmem:free_size = 64144248",
"shmem:max_used_size = 7896000",
"shmem:real_used_size = 2964616",
"shmem:total_size = 67108864",
"shmem:used_size = 2718232",
"sl:1xx_replies = 0",
"sl:200_replies = 0",
"sl:202_replies = 0",
"sl:2xx_replies = 0",
"sl:300_replies = 0",
"sl:301_replies = 0",
"sl:302_replies = 0",
"sl:3xx_replies = 0",
"sl:400_replies = 0",
"sl:401_replies = 0",
"sl:403_replies = 0",
"sl:404_replies = 0",
"sl:407_replies = 0",
"sl:408_replies = 0",
"sl:483_replies = 0",
"sl:4xx_replies = 0",
"sl:500_replies = 0",
"sl:5xx_replies = 4",
"sl:6xx_replies = 0",
"sl:failures = 0",
"sl:received_ACKs = 3",
"sl:sent_err_replies = 0",
"sl:sent_replies = 76018",
"sl:xxx_replies = 76014",
"tcp:con_reset = 0",
"tcp:con_timeout = 0",
"tcp:connect_failed = 0",
"tcp:connect_success = 0",
"tcp:current_opened_connections = 0",
"tcp:current_write_queue_size = 0",
"tcp:established = 0",
"tcp:local_reject = 0",
"tcp:passive_open = 0",
"tcp:send_timeout = 0",
"tcp:sendq_full = 0",
"tmx:2xx_transactions = 307515",
"tmx:3xx_transactions = 0",
"tmx:4xx_transactions = 12551",
"tmx:5xx_transactions = 0",
"tmx:6xx_transactions = 386",
"tmx:UAC_transactions = 0",
"tmx:UAS_transactions = 314069",
"tmx:active_transactions = 0",
"tmx:inuse_transactions = 0",
"tmx:rpl_absorbed = 43649",
"tmx:rpl_generated = 142170",
"tmx:rpl_received = 287155",
"tmx:rpl_relayed = 243506",
"tmx:rpl_sent = 385676",
"usrloc:location_contacts = 0",
"usrloc:location_expires = 0",
"usrloc:location_users = 0",
"usrloc:registered_users = 0"
],
"id": 8599
}
Mongo Monitoring
Service monitoring
Mongo service status can be retrieved with the following command
[root@sre-cp ~]# systemctl status mongod
- mongod.service - MongoDB Database Server
Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: active (running) since gio 2023-01-12 15:12:45 CET; 9 months 4 days ago
Docs: https://docs.mongodb.org/manual
Main PID: 1258 (mongod)
CGroup: /system.slice/mongod.service
1258 /usr/bin/mongod -f /etc/mongod.conf
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Replica set status
The following command provide the following information (amongst many others)
If nodes belong to replica set
The other nodes belonging to that replica set
The node which acting as primary
First enter the mongo CLI interface with:
[root@sre-cp ~]# mongo
MongoDB shell version v5.0.13
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("31a2a49c-e4ad-4a5d-9764-898450fec607") }
MongoDB server version: 5.0.13
================
Warning: the "mongo" shell has been superseded by "mongosh",
which delivers improved usability and compatibility.The "mongo" shell has been deprecated and will be removed in
an upcoming release.
For installation instructions, see
https://docs.mongodb.com/mongodb-shell/install/
================
---
The server generated these startup warnings when booting:
2023-01-12T15:12:41.813+01:00: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem
2023-01-12T15:12:44.786+01:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
2023-01-12T15:12:44.786+01:00: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. We suggest setting it to 'never'
2023-01-12T15:12:44.786+01:00: /sys/kernel/mm/transparent_hugepage/defrag is 'always'. We suggest setting it to 'never'
---
---
Enable MongoDB's free cloud-based monitoring service, which will then receive and display
metrics about your deployment (disk utilization, CPU, operation statistics, etc).
The monitoring data will be available on a MongoDB website with a unique URL accessible to you
and anyone you share the URL with. MongoDB may use this information to make product
improvements and to suggest MongoDB products and deployment options to you.
To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
---
sre_location:SECONDARY>
Then use the command rs.status() in the CLI:
sre_location:SECONDARY> rs.status()
{
"set" : "sre_location",
"date" : ISODate("2023-10-18T09:11:22.901Z"),
"myState" : 2,
"term" : NumberLong(9),
"syncSourceHost" : "10.0.161.183:27017",
"syncSourceId" : 2,
"heartbeatIntervalMillis" : NumberLong(2000),
"majorityVoteCount" : 3,
"writeMajorityCount" : 3,
"votingMembersCount" : 4,
"writableVotingMembersCount" : 4,
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"lastCommittedWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"readConcernMajorityOpTime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"appliedOpTime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"durableOpTime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"lastAppliedWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastDurableWallTime" : ISODate("2023-10-18T09:11:14.341Z")
},
"lastStableRecoveryTimestamp" : Timestamp(1697620254, 1),
"electionParticipantMetrics" : {
"votedForCandidate" : true,
"electionTerm" : NumberLong(9),
"lastVoteDate" : ISODate("2023-08-04T14:02:09.274Z"),
"electionCandidateMemberId" : 2,
"voteReason" : "",
"lastAppliedOpTimeAtElection" : {
"ts" : Timestamp(1691157640, 1),
"t" : NumberLong(8)
},
"maxAppliedOpTimeInSet" : {
"ts" : Timestamp(1691157640, 1),
"t" : NumberLong(8)
},
"priorityAtElection" : 1,
"newTermStartDate" : ISODate("2023-08-04T14:02:13.753Z"),
"newTermAppliedDate" : ISODate("2023-08-04T14:02:25.595Z")
},
"members" : [
{
"_id" : 0,
"name" : "10.0.161.180:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 4606205,
"optime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"optimeDurable" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"optimeDate" : ISODate("2023-10-18T09:11:14Z"),
"optimeDurableDate" : ISODate("2023-10-18T09:11:14Z"),
"lastAppliedWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastDurableWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastHeartbeat" : ISODate("2023-10-18T09:11:22.273Z"),
"lastHeartbeatRecv" : ISODate("2023-10-18T09:11:20.959Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncSourceHost" : "10.0.161.183:27017",
"syncSourceId" : 2,
"infoMessage" : "",
"configVersion" : 1,
"configTerm" : 9
},
{
"_id" : 1,
"name" : "10.0.161.182:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 24087522,
"optime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"optimeDate" : ISODate("2023-10-18T09:11:14Z"),
"lastAppliedWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastDurableWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"syncSourceHost" : "10.0.161.183:27017",
"syncSourceId" : 2,
"infoMessage" : "",
"configVersion" : 1,
"configTerm" : 9,
"self" : true,
"lastHeartbeatMessage" : ""
},
{
"_id" : 2,
"name" : "10.0.161.183:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1588046,
"optime" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"optimeDurable" : {
"ts" : Timestamp(1697620274, 1),
"t" : NumberLong(9)
},
"optimeDate" : ISODate("2023-10-18T09:11:14Z"),
"optimeDurableDate" : ISODate("2023-10-18T09:11:14Z"),
"lastAppliedWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastDurableWallTime" : ISODate("2023-10-18T09:11:14.341Z"),
"lastHeartbeat" : ISODate("2023-10-18T09:11:21.792Z"),
"lastHeartbeatRecv" : ISODate("2023-10-18T09:11:21.017Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"electionTime" : Timestamp(1691157730, 1),
"electionDate" : ISODate("2023-08-04T14:02:10Z"),
"configVersion" : 1,
"configTerm" : 9
}
],
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1697620274, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
},
"operationTime" : Timestamp(1697620274, 1)
}
One cluster member should be in the PRIMARY state, while all the others should be in the SECONDARY state.
Pacemaker Monitoring (if implemented)
An SRE implementation may include a Clustering layer of CP (kamailio) resources, obtained by means of a Pacemaker configuration.
When the CP Cluster is used, kamailio instances are not started directly through the kamailio service, instead they are controlled by pcs. It is therefore important not to start kamailio instances by service commands, rather do it from pcs commands. The pcs configuration provides twin resources:
VIP (Virtual IP used by one of the CP in the cluster)
SIP resource associated to a VIP
A Cluster can host multiple VIP+SIP resources, as long as each VIP and its associated SIP resource runs on the same node.
The resource agent kamailio is using SIPSAK as a mechanism to send SIP OPTIONS messages to SRE Call Processing Instances, therefore internal SIP OPTIONS messages are expected in SRE to poll resources availability.
To check the pcs configuration, run the following command:
[root@sre-cp ~]# pcs config show
To check the status of the configuration, as well as latest failure/timeout actions, run either:
[root@sre-cp ~]# pcs status
[root@sre-cp ~]# pcs cluster status
[root@sre-cp ~]# pcs status resources
Sample output:
[root@sre-cp1 ~]# pcs status
Cluster name: hacluster
Stack: corosync
Current DC: sre-cp2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Wed Oct 18 11:16:08 2023
Last change: Thu Jul 20 16:21:13 2023 by root via cibadmin on sre-cp1
2 nodes configured
0 resource instances configured
Online: [ sre-cp1 sre-cp2 ]
Full list of resources:
Resource Group: Group1
ClusterIP1 (ocf::heartbeat:IPaddr2): Started sre-cp2
Kamailio1 (ocf::heartbeat:kamailio): Started sre-cp2
Resource Group: Group2
ClusterIP2 (ocf::heartbeat:IPaddr2): Started sre-cp1
Kamailio2 (ocf::heartbeat:kamailio): Started sre-cp1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Put a CP node in standby
The following command puts the specified node into standby mode. The specified node is no longer able to host resources. Any resources currently active on the node will be moved to another node. The
[root@sre-cp1 ~]# pcs cluster standby <node>
The following command removes the specified node from the standby mode.
[root@sre-cp1 ~]# pcs cluster unstandby <node>
The following command removes all nodes from the standby mode.
[root@sre-cp1 ~]# pcs cluster unstandby --all
Disabling a resource
The following command disables a resource. This command may be useful if we want to disable a virtual IP address.
[root@sre-cp1 ~]# pcs resource disable <resource_id>
The following command enables a resource. This command may be useful to put back a virtual IP address.
[root@sre-cp1 ~]# pcs resource enable <resource_id>
Restarting a resource
The following command restarts a resource. This command may be useful if we want to restart kamailio (after modifying kamailio.cfg for example).
[root@sre-cp1 ~]# pcs resource restart <resource_id>
Moving a resource
The following command moves a resource. This command may be useful if we want to move a virtual IP address.
[root@sre-cp1 ~]# pcs resource move <resource_id> <node>
Moving a resource means adding a constraint in the config in the background. To remove this constraint, we need to specify the constraint ID (displayed through "pcs config show").
[root@sre-cp1 ~]# pcs constraint location remove <constraint ID>
Cluster resources cleanup
If a resource has failed or a move action didn't succeed, a failure message appears when you display the cluster status. You can then clear that failure status with the pcs resource cleanup command. This command resets the resource status and failcount, telling the cluster to forget the operation history of a resource and re-detect its current state. The following command cleans up the resource specified by resource_id.
[root@sre-cp1 ~]# pcs resource cleanup <resource_id>
Note
If you do not specify a resource_id, this command resets the resource status and failcount for all resources, which results in a restart of all kamailio instances in the Cluster.
Deleting a resource
[root@sre-cp1 ~]# pcs resource delete <resource_id>
Removing a node
[root@sre-cp1 ~]# pcs cluster node remove <node>
Adding a node
[root@sre-cp1 ~]# pcs cluster node add <node>
Pcs backup and restore
Useful if a node is completely lost.
From a running node:
[root@sre-cp1 ~]# pcs cluster stop --all
[root@sre-cp1 ~]# pcs config backup backupfile
[root@sre-cp1 ~]# pcs config restore backupfile.pc.tar.bz2
[root@sre-cp1 ~]# pcs cluster start --all
[root@sre-cp1 ~]# pcs cluster enable --all
Log files
You can check log files at
/var/log/cluster/corosync.log
Troubleshooting
In case of service outage, go first to the SRE GUI on any of the Element Manager nodes and check the statistics for each Call Processor nodes in the tab "Stats: Counters" of the Dashboard.
If you notice that the counter for the INVITE is equal to 0.00/sec on a Call Processor node where you expect traffic, connect in SSH to the Call Processor node:
check the status of SRE process as described in section SRE Process Monitoring
check the status of the kamailio processes as described in section Kamailio Monitoring.
If you notice a high response.genericError counter on a Call Processor, check the status of the PostgreSQL process on the Call Processor node, as described in section PostgreSQL Monitoring.
If all processes are running correctly, while the INVITE counter is null, there is possibly no SIP traffic arriving on the Call Processor node. If you expect traffic to hit the CP, check for any incoming SIP traffic: you can display all SIP messages arriving on the interface eth0 with the following command (the list of available interfaces can be retrieved from tshark -D).
[root@sre-cp ~]# tshark -i eth0 -R sip
Running as user \"root\" and group \"root\". This could be dangerous.
Capturing on eth0
0.769872938 10.211.1.1 -> 10.210.1.5 SIP 386 Request: OPTIONS
sip:10.210.1.5:5060
0.778437740 10.210.1.5 -> 10.211.1.1 SIP 430 Status: 200 OK
0.994916328 10.211.1.1 -> 10.210.1.3 SIP 386 Request: OPTIONS
sip:10.210.1.3:5060
1.000359472 10.210.1.3 -> 10.211.1.1 SIP 430 Status: 200 OK
...
If sngrep is installed on the CP nodes, this is a valid graphic alternative to tshark for sip traffic. Sngrep is a CLI based tool allowing to trace SIP messages. This can be very useful in order to troubleshoot issues. This can be used, for example, to troubleshoot agent monitoring. In case agent monitoring identifies an agent as down, you can check if SIP OPTION messages are sent and answer by this agent.
[root@sre-cp ~]# sngrep
If some processes are not running correctly on a Call Processor node, try to restart them.
At first check the status of the PostgreSQL cluster and restart it if it is stopped, using the command service postgresql-14 {start|stop|status|restart}. Logging information can be found in the file /var/lib/pgsql/14/data/pg_log/postgresql-<Day>.log.
Then check the status of the SRE using the command service sre status. On the Call Processor node, check that the sre-call-processor is RUNNING. Restart the SRE if needed, using the commands service sre {start|stop|status|restart}. Logging information can be found in /var/log/sre/.
Then check the status of Kamailio. If it is stopped, restart it using the command service kamailio {start|stop|status|restart}. Logging information can be found in /var/log/messages.
Kamailio should listen on the UDP port 5060 (or different if configured in the kamailio.cfg files) for SIP requests, thus make sure that the CP are listening on the expected address:port
[root@sre-cp ~]# netstat -anu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
udp 0 0 10.0.161.182:42235 0.0.0.0:*
udp 0 0 10.0.161.182:5405 0.0.0.0:*
udp 0 0 127.0.0.1:48531 0.0.0.0:*
udp 0 0 0.0.0.0:53 0.0.0.0:*
udp 0 0 10.0.161.182:55379 0.0.0.0:*
udp 0 0 127.0.0.1:323 0.0.0.0:*
udp 0 0 10.0.161.182:5060 0.0.0.0:*
udp 0 0 127.0.0.1:5060 0.0.0.0:*
udp6 0 0 ::1:323 :::*
udp6 0 0 ::1:59940 ::1:59940 ESTABLISHED
For all the chain to be ready to host calls, the sre-broker should be listening on TCP port 5555 for requests originated by Kamailio that trigger the interface to SRE, also the PostgreSQL cluster should be listening on TCP port 5432
[root@sre-cp ~]# netstat -ant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:5432 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:8090 0.0.0.0:* LISTEN
tcp 0 0 10.0.161.182:5060 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:5060 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:9001 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:6666 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:6000 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:5555 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:10004 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:53 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.0.161.182:48764 10.0.161.182:27017 ESTABLISHED
tcp 0 0 127.0.0.1:50198 127.0.0.1:9001 TIME_WAIT
tcp 0 0 127.0.0.1:49848 127.0.0.1:9001 TIME_WAIT
tcp 0 0 127.0.0.1:48462 127.0.0.1:6666 ESTABLISHED
tcp 0 0 10.0.161.182:39652 10.0.161.181:10000 ESTABLISHED
tcp 0 0 10.0.161.182:39386 10.0.161.183:27017 ESTABLISHED
tcp 0 0 127.0.0.1:48452 127.0.0.1:6666 ESTABLISHED
tcp 0 0 10.0.161.182:45534 10.0.161.180:10000 ESTABLISHED
tcp 0 0 127.0.0.1:5555 127.0.0.1:33690 ESTABLISHED
...
SRE service issues
At the application level, you might encounter issues related to the expected execution of a Service Logic. Such issues might be caused, for example, by misconfiguration of one or more nodes in a Service Logic, or by missing data in the user's data (Data Administration) used by the logic.
The Service Logic stats (Dashboard and Counters) will allow you to understand the size of the issue, namely how many times the response is an SRE-generated error or how often the logic traverses a node. At some point, you will need to either trace a call where the issue appears or reproduce it through the SIP Simulation. Both methods are suitable for understanding the exact part of the logic (node or group of nodes) that must be modified in order to obtain the desired behavior.
In order to activate tracing-flow traces, there are 2 conditions:
the log-level of "Call tracing service logic flow" must be set to DEBUG
the Tracing criteria (calling and called ranges) must match the ones of the call
Note
While it is not a problem in lab environments, in production networks the tracing capability will reduce the CP performance, therefore it is recommended to not activate it in high-traffic conditions, and to limit the Tracing criteria to match exactly the calling/called ranges of interest.
When a Trace is produced, the tracing flow logs are available:
in the GUI, on the active Service Logic (and its sub-service logics), under the Trace tab
in the CLI, in the log file /var/log/sre/sre.log: you can grep on the string "tracing.flow"
To provide an example of an issue that can be noticed at the application level, 604 responses from the SRE would be the result of service logic exceptions in the SRE (e.g. a query node is failing due to unexpected inputs/outputs). In order to isolate those errors, please activate tracing for calls which end up in a 604 message, and check /var/log/sre/sre.log, which indicates which is the exception and which is the latest node traversed (where the exception is occurring). Further information might be obtained by the /var/log/sre/sre-call-processor.out.log in the CP.
SRE Logs
SRE provides application logs per-channel, that is, per functionality, which are available in the EM and CP nodes, at /var/log/sre:
Log type | File (in /var/log/sre) | Node type |
---|---|---|
Generic logs (including tracing logs collected from the CP) | sre.log | EM |
Accounting | accounting.log | EM |
CDR Sender | sre-cdr-sender.out.log | CP |
CDR Collector | sre-cdr-collector.out.log | EM |
CDR Post-processing | accounting-post-processing.log | EM |
Audit | audit.log | EM |
Service Logic Execution | service-logic-execution.log | EM |
GUI logs | sre-gui.out.log | EM |
Health monitor | sre-health-monitor.out.log | EM and CP |
Manager | sre-manager.out.log | EM |
REST API | sre-REST.out.log | EM |
Supervisord | supervisord.log | EM and CP |
ENUM processor | sre-enum-processor.out.log | CP |
HTTP processor | sre-http-processor.out.log | CP |
Interface (between Kamailio and SRE core) | interface.log | CP |
SIP Agents monitor | sre-agents-monitor.out.log | CP |
Broker | sre-broker.out.log | CP |
Logs are rotated daily and kept on a 7-day circular buffer.
Nodes Operational Status
Putting a CP in out of service
The page *System -> Nodes Operational Status *allows the operator to modify the operational state of a CP node. For each node, the operator can put the node in service (default) or out of service.
If, for any reason, the GUI is not available or the setting cannot be saved (e.g. no master PostgreSQL instance, ...), this setting can be overridden by creating an empty file named /tmp/cp.oos on the CP node to disable (e.g. touch /tmp/cp.oos). The presence of such file has always priority on the configuration setting.
Once a node is put out of service, it will answer to both INVITE requests and OPTIONS requests with a SIP response 503 Service Unavailable.
Backup & Restore Procedure
The purpose of this section is to describe the manual backup procedure. We also explain how to restore the master PostgreSQL cluster from the backup.
It is important to note that the backup and the restore procedures can only be applied on the master PostgreSQL cluster.
Several backup & restore procedures are available:
Full database backup: This method creates a DB backup including all data and configuration (e.g. system configuration, privileges, ...). Upon restore, the other servers must be re-synchronized from the master server.
Database dumps: This method creates one SQL dump of the data per database (including the SRE database and the services databases). Upon restore, only the specified database is restored. There is no-need to resynchronize from the master server.
Node re-synchronization: In the event that a standby server must be restored and that the master server is available, data is re-synchronized from the master server.
Choosing the Best Backup Strategy
Each backup methods feature different advantages, as described in the table below.
Method | Granularity | Backup Speed | Restore Speed |
---|---|---|---|
Full database backup | (full system) | ++ | ++ |
Database dumps | ++ (database-based) | + | + |
Node re-synchronisation | - (current re-synchronisation master) | N/A | ++ |
In case of server issue, as long as the master DB server is available, the server DB can be restored from the master server by performing a node re-synchronisation.
In case of human error, where data has been affected on all servers through replication, the database dumps offer a way to restore the specific DB (be it system configuration or services data) where the error occurred. The time-accuracy of the restore depends on how often these dumps are performed.
In case of loss of all servers or if restore speed is a concern, then the full database backup may offer the best option to restore the DB. The time-accuracy of the restore depends on how often these backups are performed.
Full Database Backup
The master PostgreSQL cluster is periodically and automatically backed up.
Nevertheless, on some occasions, the Operator may want to execute a manual backup. This operation can be safely executed while the database is running.
Backup Procedure
Note
It is not required to backup the standby PostgreSQL nodes as they can be recovered at any time from a master PostgreSQL node, by cloning them with the repmgr tool.
For this, connect as a postgres user on the master PostgreSQL cluster, and use command pg_basebackup to create a backup. The backup is saved as a tar.gz file, containing the contents of the /var/lib/pgsql/14/data directory. The option -D specifies the directory receiving the base backup. In the example below, the tar gzip files are stored in the directory backup-20190212.
[root@sre-em ~]# su - postgres
-bash-4.1$ pg_basebackup -h <master EM ip address> -U repmgr -F t -z -X f -D backup-20230212/
-bash-4.1$ ls backup-20230212/
base.tar.gz
Restore Procedure
A backup can be used to restore the content of the postgreSQL database.
Note
The restore procedure should only be used when the complete cluster must be recovered. If a single node must be recovered and a master PostgreSQL node is available, this node can be more easily recovered by using the repmgr tool to clone its database content from the current master PostgreSQL instance.
Before proceeding with the restore operation, stop the PostgreSQL server.
[root@sre-em ~]# systemctl stop sre
[root@sre-em ~]# systemctl stop postgresql-14
As postgres user, delete the data into the PostgreSQL root directory /var/lib/pgsql/14/data/.
[root@sre-em ~]# rm -rf /var/lib/pgsql/14/data/*
Then in each of these directories, gunzip the corresponding file, as user postgres.
[root@sre-em ~]# su - postgres
-bash-4.1$ cd /var/lib/pgsql/14/data/
-bash-4.1$ tar -zxvf /var/lib/pgsql/backup-20230212/base.tar.gz
Restart the PostgreSQL cluster and the sre software.
[root@sre-em ~]# systemctl start postgresql-14
[root@sre-em ~]# systemctl start sre
Database Dumps
Database dumps can be performed on a running system. They can also be restored on a live system and the replication will pick up the modifications and stream them to the standby servers.
These dumps are performed automatically but can also be performed manually.
Backup Procedure
The backup can be performed by executing the pg_dump command and indicating the DB name to dump. This DB can either be the system DB (sre) holding the system configuration or a service DB (<service-name> suffixed with _a or _b, depending on the version).
In this example, a backup of the DB mix_a (i.e. service mix, version A) is performed and stored in the file /data/sre/backup/ em1/db/dump/manual_backup:
[root@sre-em ~]# pg_dump -h <master EM ip address> -U repmgr -c mix_a > /data/sre/backup/em1/db/dump/manual_backup
The produced file contains the list of SQL statements to remove the current schema, create a new one and insert the data.
Restore Procedure
The restore of a single DB can be performed by launching the psql command in such a way to execute the SQL statements from the backup file created. This procedure can be executed on a live system.
In this example, the manual file previously created is used to rebuild schema and data for the DB mix_a:
[root@sre-em ~]# psql -h <master EM ip address> -U repmgr mix_a < /data/sre/backup/em1/db/dump/manual_backup
Full system backup and restore procedure
A full database backup is only useful in specific circumstances, namely when you need to restore a VM from an empty system, when a snapshot of the virtual environment is not available.
Creating a full backup file
Note 1: Once the backup is created, download and store it somewhere external to the server!
Note 2: if the tar command fails because a file/directory is not present, check which file is missing and adjust the command (or delete that part if not needed).
Note 3: check that you can create the backup file in a partition that has enough disk space (1-2 GB).
Before launching the command, replace the parts in yellow with the actual directories/files relevant to your case. It might be the case that a specific deployment does not have all the components referenced in this document (for instance Mongo or rsyslog etc.). In such a case, for the command to work properly, it's most likely needed to remove the part of the backup commands related to the not existing components.
- The command to create a backup on a EM node:
[root@sre-em ~]# tar zcf <backup_path_filename>.tar.gz /opt/sre/ \
/data/sre/accounting /etc/mongod.conf \
/etc/cron.d/<crontabfile> /var/log/sre/ \
/etc/repmgr/14/repmgr.conf /var/lib/pgsql/14/data/pg_hba.conf \
/var/lib/pgsql/14/data/postgresql.conf \
/data/sre/db/backups/<node>*/ \
/etc/sysconfig/network-scripts/ifcfg-eth*
- The command to create a backup on a CP node:
[root@sre-cp ~]# tar zcf <backup_path_filename>.tar.gz /opt/sre/ \
/data/sre/accounting /etc/mongod.conf \
/etc/cron.d/<crontabfile> /var/log/sre/ \
/etc/repmgr/14/repmgr.conf /etc/sysconfig/network-scripts/ifcfg-eth* \
/etc/kamailio/kamailio.cfg
Comments:
- /data/sre/db/backups/<node>*/: backup folder (can be different in some deployments, check the crontab in /etc/cron.d/<sre crontab file>
- /opt/sre/: SRE sw and configuration files
- /data/sre/accounting: CDRs
- /etc/mongod.conf: MongoDB config (if applicable)
- /etc/kamailio/kamailio.cfg: Kamailio config
- /etc/cron.d/*: crontab files that were configured for this host
- /var/log/sre/: SRE logs
- /etc/repmgr/14/repmgr.conf: Replication Manager config
- /var/lib/pgsql/14/data/pg_hba.conf: PostgreSQL access config
- /etc/rsyslog.conf: Rsyslog config (if applicable)
Fully restoring a server
Note
The restore procedure from a full backup should only be used when the complete cluster must be recovered. If a single node must be recovered and a master PostgreSQL node is available, this node should be recovered by using the repmgr tool to clone its database content from the current master PostgreSQL instance.
If the master node is down and cannot be recovered in an acceptable timeframe, the suggestion is to proceed with Master Switchover (see here) and re-synchronize the failed node once again available.
Make sure the host has CentOS/RedHat running and meets all the requirement to run SRE.
Adjust ip, dns, ntp (check that the date is the same as in the other nodes). Add the following directory needed by MongoDB (if applicable to your deployment):
# mkdir /data/sre/location
# chown mongod.mongod /data/sre/location
# systemctl restart mongod
Verify that the postgres db versions a and b for the customer are there, if not create them (e.g. service_a and service_b):
# su - postgres
> psql
postgres=# create database sre owner sre;
postgres=# create database <service>_a owner sre;
postgres=# create database <service>_b owner sre;
postgres=# \q
You need to be root and positioned in / to launch the restore:
# cd /
This will restore all files in the original directories:
# tar -zxvf /data/<backup_filename>
Then either:
a. if you are restoring a postgres standby node, you need to resynch the node using Node Re-Synchronisation.
b. if it's the postgres master node that you are attempting to restore, then follow Restore Procedure from the Monitor SRE.
As postgres user, delete also the data into the PostgreSQL root directory /var/lib/pgsql/14/data/.
Then in each of these directories, gunzip the corresponding files, as user postgres.
Restart the PostgreSQL cluster and the sre software.
[root@sre-em ~]# systemctl stop sre
[root@sre-em ~]# systemctl stop postgresql-14
Delete the content of the following directories before re-synching:
[root@sre-em ~]# rm -rf /var/lib/pgsql/14/data/*
[root@sre-em ~]# su - postgres
-bash-4.1$ cd /var/lib/pgsql/14/data/
-bash-4.1$ tar -zxvf /var/lib/pgsql/backup-20230415/base.tar.gz \...
Start postgresql and SRE:
[root@sre-em ~]# systemctl start postgresql-14
[root@sre-em ~]# systemctl start sre
For mongoDB (in case you use it on that node for CAC or Registrar), restoring the /etc/mongod.conf should be sufficient for the platform to re-synch the restored node from the primary node, assuming there are half+1nodes available after the restore.
At the end of the re-synch, the restored node's mongo instance will appear as SECONDARY as one of the other previously available nodes has been promoted to PRIMARY.
Node Re-Synchronisation
Node re-synchronization is the preferred way to recover a standby node. As the node will be cloned from a master node, it ensures that the data is up-to-date and that the standby node immediately starts replicating from the master.
Backup Procedure
As the failed server is recovered from the master server, there is no specific backup operation to perform in advance. The failed machine can be recovered or re-installed from zero using the SRE Installation Guide, and then following Restore procedure.
Restore Procedure (via DB Clone)
Prerequisite: the OS system is installed, base packages needed by SRE are installed (e.g. postgres, repmgr, mongo, ...) and SRE sw is installed.
Stop the SRE service, then PostgreSQL.
[root@sre-cp ~]# systemctl stop sre
[root@sre-cp ~]# systemctl stop postgresql-14
Delete all content of the main PostgreSQL directory.
[root@sre-cp ~]# rm -rf /var/lib/pgsql/14/data/*
Clone the data from the master node, as user postgres. The parameter -h indicates the IP address of the master node.
[root@sre-cp ~]# su - postgres
-bash-4.1$ /usr/pgsql-14/bin/repmgr -h 10.0.10.45 -F -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone
[2023-04-16 17:43:56] [NOTICE] Redirecting logging output to '/var/log/repmgr/repmgr-14.log'
-bash-4.1$ cat /var/log/repmgr/repmgr-14.log
[2023-04-14 17:52:05] [NOTICE] setting data directory to: /var/lib/pgsql/14/data
[2023-04-14 17:52:05] [HINT] use -D/--data-dir to explicitly specify a data directory
[2023-04-14 17:52:05] [NOTICE] starting backup (using pg_basebackup)...
[2023-04-14 17:52:05] [HINT] this may take some time; consider using the -c/--fast-checkpoint option
[2023-04-14 17:52:22] [NOTICE] standby clone (using pg_basebackup) complete
[2023-04-14 17:52:22] [NOTICE] you can now start your PostgreSQL server
[2023-04-14 17:52:22] [HINT] for example : /etc/init.d/postgresql start
Restart PostgreSQL, then SRE.
[root@sre-cp ~]# systemctl start postgresql-14
[root@sre-cp ~]# systemctl start sre
Next you need to force node registration with the following command, the parameter -h indicates the IP address of the master node.
[root@sre-cp ~]# su - postgres
-bash-4.1$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf -h 10.0.10.45 -U repmgr -d repmgr standby register --force
PostgreSQL Cluster Switchover
Automatic switchover
Automatic switchover is enabled by default. To check the enablement status of automatic switchover run the following command as user postgres:
[postgres@sre4-em1 ~]$ /usr/pgsql-14/bin/repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+-------+---------+--------------------
1 | em1 | primary | * running | | running | 24710 | no | n/a
2 | em2 | standby | running | em1 | running | 2365 | no | 0 second(s) ago
3 | cp1 | standby | running | em1 | running | 1936 | no | 1 second(s) ago
4 | cp2 | standby | running | em1 | running | 2373 | no | 0 second(s) ago
The paused column set to no denotes the activation of automatic switchover.
Disable automatic switchover
in order to disable automatic switchover run the following command as user postgres:
/usr/pgsql-14/bin/repmgr service pause
Enable automatic switchover
in order to enable again automatic switchover run the following command as user postgres:
/usr/pgsql-14/bin/repmgr service unpause
Doing manual switchver
Connect to the standby EM node and run the following command to check all the preconditions are met:
/usr/pgsql-14/bin/repmgr standby switchover --siblings-follow --dry-run
If no error are shown you can run the same command without dry-run flag.
/usr/pgsql-14/bin/repmgr standby switchover --siblings-follow
Rebooting the master postgres node
If reboot of the master postgres node is needed for maintenance and automatic switchover is enabled, first pause it by following this procedure and the reboot the server.
If automatic switchover is disabled, a reboot can be performed without requiring any additional commands.
Rebooting the standby postgres node
A reboot can be performed without requiring any additional commands in any case.
Failure of the master postgres node
With automatic switchover
If automatic switchover is enabled nothing needs to be done on the new master.
When the old master is recovered its database is no more synchronized with the new master.
The repmgr cluster show will show an output similar to this one (in this case em1 is the failed master and em2 is the new master):
-bash-4.2$ /usr/pgsql-14/bin/repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------
1 | em1 | primary | * running | | default | 100 | 11 | host=10.0.161.180 dbname=repmgr user=repmgr
2 | em2 | standby | ! running as primary | | default | 100 | 12 | host=10.0.161.181 dbname=repmgr user=repmgr
3 | cp1 | standby | running | ! em2 | default | 0 | 12 | host=10.0.161.182 dbname=repmgr user=repmgr
4 | cp2 | standby | running | ! em2 | default | 0 | 12 | host=10.0.161.183 dbname=repmgr user=repmgr
To rejoin the failed master (em1) run the following commands on that node:
[root@sre-em1 ~]$ systemctl stop postgresql-14
[root@sre-em1 ~]$ su - postgres
-bash-4.2$ /usr/pgsql-14/bin/repmgr node rejoin -d "host=<address of master> dbname=repmgr user=repmgr" --config-files=postgresql.local.conf,postgresql.conf --verbose --force-rewind
-bash-4.2$ exit
[root@sre-em1 ~]$ systemctl start postgresql-14
Without automatic switchover
If automatic switchover is disabled and the current master PostgreSQL instance is not available anymore and cannot be restored in a sensible time, a standby PostgreSQL instance can be promoted as master, usually the standby EM. This does not affect the application service and will restore the possibility to provision the system.
Note
In order to promote a standby node as a master, and instruct the other nodes to follow the new master, it is critical to ensure that the node previously master stays down until all operations have been carried out. Also, if the node is down, it is important that the master node is not restored while carrying out this procedure, since at no time there can be more than one master node to which the standby nodes replicate.
If the platform is prepared with ssh keys exchanged between EMs and CPs on the postgres user (so that an EM can connect to a CP using ssh keys), then the promote command can at the same time instruct all CP nodes to follow the new EM master. This is the suggested procedure and referred to here.
Alternatively, without ssh keys it's still possible for CP nodes to follow the new master, although this requires an explicit action on each CP node (see here).
Procedure with CP nodes automatically following
Connect on the standby to-become-master node (usually the standby EM) as postgres user. The first time you can do a dry run to ensure all the prerequisites are in place.
[root@sre-em ~]# su - postgres
bash-4.2$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf standby promote --siblings-follow --dry-run
INFO: node is a standby
INFO: no active primary server found in this replication cluster
INFO: all sibling nodes are reachable via SSH
INFO: 2 walsenders required, 10 available
INFO: node will be promoted using the "pg_promote()" function
INFO: prerequisites for executing STANDBY PROMOTE are met
If the test is ok, you can proceed with the real standby promotion
(removing the --dry-run)
If no error are shown you can run the same command without dry-run flag.
[root@sre-em ~]# su - postgres
bash-4.1$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf standby promote ---siblings-follow
NOTICE: promoting standby to primary
DETAIL: promoting server "sre-em" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "sre-em" (ID: 2) was successfully promoted to primary
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
After promoting the server to master role, you can check that all CP nodes are following the new master by performing:
[root@em2 ~]# su - postgres
-bash-4.2$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+---------------------------------------------
1 | em1 | primary | failed | ? | default | 100 | | host=10.0.161.180 dbname=repmgr user=repmgr
2 | em2 | standby | * running | | default | 100 | 13 | host=10.0.161.181 dbname=repmgr user=repmgr
3 | cp1 | standby | running | em2 | default | 0 | 13 | host=10.0.161.182 dbname=repmgr user=repmgr
4 | cp2 | standby | running | em2 | default | 0 | 13 | host=10.0.161.183 dbname=repmgr user=repmgr
WARNING: following issues were detected
- unable to connect to node "em1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
After promoting the server to master role, we can observe that two masters are present in the repl_nodes table. The old master is marked as inactive (the active parameter is set to f for id 1, name em1).
It is recommended to also restart the SRE software on CPs so that the database connection pool is re-initialized.
Proceed with restart of SRE on the new master with:
[root@sre-em2 ~]# systemctl restart sre
When the inactive node em1 becomes available again, follow the node-resynchronization in order to restore it in the DB cluster.
Procedure with manual CP follow
Note
This procedure is not recommeded for bigger deployments since the follow command must be launched on every standby node
Connect on the standby node (usually the standby EM) as postgres user and promote it as master.
[root@sre-em2 ~]# su - postgres
-bash-4.1$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf standby promote
Proceed with restart of SRE on the new master with:
[root@sre-em2 ~]# service sre restart
At this point, all standby nodes must be instructed to follow the new master.
On the CP nodes:
[root@sre-cp ~]# su - postgres
-bash-4.1$ /usr/pgsql-14/bin/repmgr -f /etc/repmgr/14/repmgr.conf standby follow
The command instructs the PostgreSQL to restart to follow the new master node. It is recommended to also restart the SRE software so that the database connection pool is re-initialized. Interruption of service can be minimized by isolating the CP nodes, one at a time.
When the inactive node em1 becomes available again, follow the node-resynchronization in order to restore it in the DB cluster.
Node and Site isolation
To isolate a node, or an entire site, from database updates, the admin needs to reconfigure the permissions linked to the Postgres database replication, so that the affected node / site doesn't get any more updates.
Also, if needed, the services (sip / enum / http) can be stopped so that the CP doesn't reply to such requests.
For the database isolation, namely on the master EM node the admin must reconfigure the file /var/lib/pgsql/14/data/pg_hba.conf and set the lines applicable to replication and repmgr of the affected nodes to "reject".
For example, to isolate the CP node with IP = 10.0.161.64, the pg_hba.conf should contain these lines:
host all sre 10.0.161.64/32 reject
host replication repmgr 10.0.161.64/32 reject
host repmgr repmgr 10.0.161.64/32 reject
The same applies to entire subnets, in order to isolate a full site.
The operation requires a restart of postgres on the master EM node:
[root@localhost ~]# systemctl restart postgresql-14
From the restart, any change done on the master EM node is not replicated on the isolated node / site.
To suppress node/site isolation and restore normal system operations, the admin must set back to "trust" the replication and repmgr lines on master EM and restart postgres.
Data Version Management
All SRE user data databases, created in the form of Data Models, are stored in two versions: A and B. This versioning system allows the operator to select the active version of the data in use for call processing or provision the data version not in service without affecting the service to subscribers. By default the A version is the Active version.
Version Selection
All data versioning is managed from the GUI in the System -> Data Versioning page.
In order to change the data version used for call processing, GUI or provisioning, the tab Data Lock must be used. Locking of data is required to be able to change the current active data version for call processing, GUI or provisioning.
If data is locked, the *Data Version Selection *page allows the operator to select the active data version, which is used for call processing. The GUI and provisioning can either be directed to the active version of data or to the standby version of data. This selection is performed by individual service.
The version selection logic (per service) is:
If data is unlocked:
If active version is A:
Call processing uses version A of the data
If GUI version is Active:
- GUI changes are directed to version A and data modification is allowed
If GUI version is Standby:
- GUI changes are directed to version B and data modification is allowed
if provisioning version is Active:
- batch provisioning is directed to version A and data modification is allowed
if provisioning version is Standby:
- batch provisioning is directed to version B and data modification is allowed
if REST API version is Active
- REST API to <DB>/active/<table> is directed to version A and data modification is allowed
if REST API version is Standby
- REST API to <DB>/standby/<table> is directed to version B and data modification is allowed
If active version is B:
Call processing uses version B of the data
If GUI version is Active:
- GUI changes are directed to version B and data modification is allowed
If GUI version is Standby:
- GUI changes are directed to version A and data modification is allowed
if provisioning version is Active:
- batch provisioning is directed to version B and data modification is allowed
if provisioning version is Standby:
- batch provisioning is directed to version A and data modification is allowed
if REST API version is Active
- REST API to <DB>/active/<table> is directed to version B and data modification is allowed
if REST API version is Standby
- REST API to <DB>/standby/<table> is directed to version A and data modification is allowed
If data is [locked]{.underline}:
If active version is A:
Call processing uses version A of the data
If GUI version is Active:
- GUI changes are directed to version A and data modification is allowed
If GUI version is Standby:
- GUI changes are directed to version B and data modification is allowed
if provisioning version is Active:
- batch provisioning is blocked
if provisioning version is Standby:
- batch provisioning is directed to version B and data modification is allowed
if REST API version is Active
- REST API to <DB>/active/<table> directed to version A is blocked
if REST API version is Standby
- REST API to <DB>/standby/<table> is directed to version B and data modification is allowed
If active version is B:
Call processing uses version B of the data
If GUI version is Active:
- GUI changes are directed to version B and data modification is allowed
If GUI version is Standby:
- GUI changes are directed to version A and data modification is allowed
if provisioning version is Active:
- batch provisioning is blocked
if provisioning version is Standby:
- batch provisioning is directed to version A and data modification is allowed
if REST API version is Active
- REST API to <DB>/active/<table> directed to version B is [blocked]
if REST API version is Standby
- REST API to <DB>/standby/<table> is directed to version A and data modification is allowed
The Data Version Node Override tab allows the operator to temporarily switch the version in use for a particular service on a particular call processor node to test it.
The version selection is presented as a matrix of CP nodes vs. service.
The option Default instructs the CP node to use the global version selected under the tab Data Version Selection.
The option Override: version A forces this CP node to use the version A of data, no matter the version selected under the tab Data Version Selection.
The option Override: version B forces this CP node to use the version B of data, no matter the version selected under the tab Data Version Selection.
Note
Once a CP node has been configured to use a version of the data different from what the other CP nodes use, the stats on the dashboard can be used to ensure that the CP node behaves correctly with this specific version of the data. Once this is confirmed, all the other CP nodes can be switched to the same version of the data under the tab Data Version Selection.
The tab Versions Comparison gives an overview of the records counts for the versions A and B, per service. The version highlighted in green is the version currently active for call processing.
Version Cloning
Under specific circumstances outside the normal maintenance operations, the operator might want to copy one version of a database on the other one. This can be achieved by using the pg_dump tool in the CLI to dump a particular service version and "pipe" it into psql connected to the other version of the database.
Note
The option -a must be used to dump the data only. Without this option, the source database schema would be dumped too.
The tables in the destination database should be empty to allow the copy.
Note
These commands operate at superuser level without any safeguards against human mistake such as mistyping or wrong versions. The operator must pay particular attention to the source version and destination version, to ensure that the destination version is not in use for call processing, GUI or provisioning. If this is the case, this could lead to catastrophic consequences as data is immediately replicated to all nodes.
On the master EM run:
[root@sre-em ~]# su - postgres
-bash-4.1$ pg_dump -a <service>_a|psql <service>_b