Summary:
This article provides information on how to upgrade a SRX cluster with minimal downtime.
Problem or Goal:
At no time can a cluster have mismatched code versions. This can result in network instability and unpredictable behavior. This means that to properly upgrade a cluster without ISSU (not supported on SRX Branch devices), you would need to ensure that both nodes are rebooted and do not attempt to connect to each other with different Junos code versions.
Zero downtime is not currently possible on SRX clusters. The goal of this article is to provide a means to upgrade an SRX cluster with the minimum amount of downtime possible. The following events can be expected during this process:
- All sessions, which have network address translation, will be lost.
- All sessions utilizing ALG (Like FTP, SIP, and so on) will be lost.
- Dynamic routing protocol adjacencies will need to be re-established upon failover between the devices.
- All other existing sessions will be able to fail between devices.
- Depending on the network configuration, traffic will failover between devices with mimimal packet loss.
Cause:
Solution:
There are several caveats to be aware before implementing this procedure:
- Synchronizing the routing engines must always be performed with a reboot on one or both units. If the two units are connected on the control plane (Control Port), without one unit being rebooted, it is possible for one routing engine to overwrite the other RE's configuration; which causes service outages.
- Synchronizing the data plane must always be performed with a reboot on one or both units. If the two units are connected on the data plane (Fab Port), without one unit being rebooted, it is possible for SPUs to enter a negative state. There is less risk with this, than with the control plane; but it should also be avoided.
- At no time should two devices of different Junos versions communicate across control or fabric links. This can cause negative scenario cases to occur. The following may occur: Routing Engine configuration loss, SPC reboots, IOC reboots, and loss of ability to pass traffic. If two devices of different versions communicate and strange behavior occurs, a simultaneous reboot of both devices must be performed to reset them. Before the reboot, it is necessary to first upgrade/downgrade the software of the unit, so that both devices are on the same version.
- If the fabric interfaces (fab0 or fab1) or their associated physical interfaces are disabled, the device needs to be rebooted to enable them. The device will become inoperable. To restore the device, enable the fabric ports and reboot. There is no other way to make the device eligible to enter the cluster, besides a reboot.
- Traffic can be failed over between devices, during the upgrade, with little traffic loss. This is done in a method, which is similar to many stateful firewalls. All non-network address translation (NAT) and non-application layer gateway (ALG) (FTP only) traffic will failover to the other device. New sessions will be created on the backup unit. This will be done by not checking to see if the connection is new. This may be considered less secure. Also, if security is still a concern, these session checking features can be enabled after the upgrade procedure is completed. Also it is possible to perform the upgrade, without disabling the TCP syn and sequence checking; although this will cause all sessions to end and they need to be restarted by the client applications.
- The Dynamic routing state is not synchronized between the two cluster members. Upon failover between the devices, new neighbor relationships for all protocols will need to be re-established.
- During testing performed by Juniper Networks, few second failover times were achieved. All non-NAT and non-ALG sessions were transferred between devices.
- During the upgrade, there may be configuration discrepancies, which will prevent a successful commit. At the points where this is critical, a commit check is suggested to ensure that a simultaneous commit needs to occur.
Upgrade Procedure Overview:
- Disable network interfaces on the backup device. This is performed to isolate the unit from the network, so it will not impact traffic, when the upgrade procedure is in progress.
- Disable SYN bit checking and TCP Sequence number checking. This allows the secondary firewall to take over stateful, non-NAT, and non-ALG traffic; without requiring a 3-way TCP handshake.
- The control and fabric links must be disabled or disconnected between the two devices. This will ensure that the nodes, which are running different Junos versions, will not communicate to each other.
- Upgrade software on backup firewall first. When upgrading, use the no-validate option to ignore the errors, which will occur for configuration bits that are related to the other cluster members. Once the upgrade is complete, reboot the backup device.
- Validate if the backup firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.
- Correct the control port and fabric port configuration, if necessary, only on the backup device. This will prepare the device to synchronize later in the process.
- The backup firewall is ready to take over for the primary. This is one of the crucial steps in the procedure. The traffic will now be switched between the two devices, by disabling the physical interfaces on the primary and enabling them on the secondary device at the same time. Traffic will immediately begin to flow on the secondary device.
- Ensure that the secondary device is handling the traffic, by looking at the session table and checking if the traffic is flowing through the device and that new sessions are being created.
- Upgrade software on the now isolated primary firewall. When performing the upgrade, use the no-validate option to ignore the errors, which will occur for configuration bits that are related to the other cluster members. Once the upgrade is complete, reboot the primary device.
- Validate if the primary firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.
- At this point, the primary firewall is ready to take over for the backup. This is the second crucial step in the procedure. The traffic will now be switched between the two devices, by disabling the physical interfaces on the backup and enabling them on the primary device, at the same time. Traffic will immediately begin to flow on the primary device.
- Ensure that the primary device is handling the traffic, by looking at the session table and checking if traffic is flowing through the device and that new sessions are being created.
- Now it is time to synchronize the cluster. First reboot the backup device. When it is rebooting, set the correct sync ports on the primary device. When the backup device comes back, it will synchronize with the primary device.
- Once the backup firewall is up and ready to process traffic, enable its physical interfaces. It will not process traffic; but be ready to process traffic, in the event that a failure occurs.
- If SYN Check and Sequence Check were disabled, before starting the activity, then re-enable them; if required.
Detailed upgrade procedure:
Assume that node0 is the primary device and node1 is the backup device.
- Upload the new Junos package to each node.
- From the node0 RE, disable all physical interfaces on the secondary chassis:
For example:set interfaces ge-21/0/0 disable
set interfaces ge-21/1/0 disable - Disable SYN bit and sequence checking:
set security flow tcp-session no-syn-check
set security flow tcp-session no-sequence-check - Commit the configuration.
- For SRX5000, change the control and fabric ports to erroneous numbers. They can be set to any FPC number (existing or not) on the chassis; except the correct ones. Commit this separately to both nodes:
delete chassis cluster control-ports
Note: Assume that fpc 7 and fpc 19 are erroneous FPCs for HA links. If configured for dual control links, you would need to also include the configuration change for the second control link. For SRX1400, SRX3400, and SRX3600, the control link(s) will need to be physically disconnected.
set chassis cluster control-ports fpc 7 port 0
set chassis cluster control-ports fpc 19 port 0
delete interfaces fab0
delete interfaces fab1
set interfaces fab0 fabric-options member-interfaces ge-7/0/0
set interfaces fab1 fabric-options member-interfaces ge-19/0/0 - Commit will need to be applied to both nodes independently. Once the commit is applied the routing engines will be disconnected from each other. This is why the configuration needs to be applied separately.
- Upgrade the software on the node1 unit. Do not validate the configuration, as errors may be generated, due to the broken cluster. Once the upgrade is complete, reboot the node1 unit.
request system software add <location-of-package>/ <junos-filename> no-validate no-copy
request system reboot When node1 comes up, verify if the new version of software is running, the device is in Primary state for redundancy groups 0 and 1, and that all the SPUs are online for SRX high end. Typically, it takes 3-4 minutes for all SPUs to come online from the moment you receive the login prompt.
show version
show chassis cluster status
show chassis fpc pic-status- Before failing over between the two devices, it is best to verify if the configuration change will occur successfully. To do so, disable the interfaces on node0 and enable the interfaces on node1, by entering the following configuration on both of the nodes. Then verify the configuration via a commit check on both of the devices. This will validate if configuration is ready to be applied to the devices. If there are any conflicts, they need to be resolved; so that the commit is successful. For example:
set interfaces ge-9/0/0 disable
set interfaces ge-9/1/0 disable
delete interfaces ge-21/0/0 disable
delete interfaces ge-21/1/0 disable
commit check At this point, the traffic will be failed between the two chassis to continue with the upgrade process. In the previous step, the configuration was verified. In this step, the configuration will be committed to the devices and they will immediately take effect. Commit the configuration simultaneously on both of the cluster members. This will cause all of the traffic to failover to the node1 firewall.
commit
11. Verify if the failover was successful. On the node1 unit, verify if sessions were created and that traffic is passing on node1.
show security flow session summary
monitor interface trafficUpgrade the software on the node0 unit. Do not validate the configuration, as errors may be generated due to the broken cluster. Reboot the device, when the software is upgraded.
request system software add <location-of-package>/<junos-filename> no-validate no-copy
request system rebootWhen node0 comes up, verify if the software was upgraded, the device is in primary state for redundancy groups 0 and 1, and that all the SPUs are online. Typically, It takes 3-4 minutes for all SPUs to come online, from the moment you receive the login prompt.
show version
show chassis cluster status
show chassis fpc pic-statusBefore failing back over between the two devices, it is best to verify that the configuration change will occur successfully. To do so, disable the interfaces on node1 and enable the interfaces on node0, by entering the following configuration on both nodes. Then verify the configuration via a commit check on both devices. This will validate that the configuration is ready to be applied to the devices. If there are any conflicts, they need to be resolved; so that the commit is successful. For example:
delete interfaces ge-9/0/0 disable
delete interfaces ge-9/1/0 disable
set interfaces ge-21/0/0 disable
set interfaces ge-21/1/0 disable
commit check- At this point, the traffic will be failed between the two chassis again to continue with the upgrade process. In the previous step, the configuration was verified. In this step, the configuration will be committed to the devices and it will immediately take effect. Commit the configuration simultaneously on both of the cluster members. This will cause all of the traffic to failover to the node0 firewall.
commit
For SRX5000, re-configure the correct control and fabric ports on node1; only for this step. This will prepare the device for the final synchronization of the cluster.
delete chassis cluster control-ports
For SRX3000 and SRX Branch you would need to reconnect control link ports.
set chassis cluster control-ports fpc 1 port 0
set chassis cluster control-ports fpc 13 port 0
delete interfaces fab0
delete interfaces fab1
set interfaces fab0 fabric-options member-interfaces ge-0/0/0
set interfaces fab1 fabric-options member-interfaces ge-12/0/0
commit
- After re-configuring/reconnecting node1 with the correct control and fabric ports in step 16, reboot node1. When node1 is rebooting, enable the correct control and fabric ports on node0 and ensure that node1’s physical interfaces are disabled in the configuration. When node1 comes up, it will utilize the configuration of node0. Reconfigure node0 to use the correct control and fabric ports, by making the following configuration changes on node0; when node1 is rebooting:
delete chassis cluster control-ports
set chassis cluster control-ports fpc 1 port 0
set chassis cluster control-ports fpc 13 port 0
delete interfaces fab0
delete interfaces fab1
set interfaces fab0 fabric-options member-interfaces ge-0/0/0
set interfaces fab1 fabric-options member-interfaces ge-12/0/0
commit - When node1 returns to the up state, verify if it has synchronized with node0. Then enable node1's firewall physical interfaces and enable TCP SYN bit and sequence checking. Commit the configuration.
delete interfaces ge-21/0/0 disable
delete interfaces ge-21/1/0 disable
delete security flow tcp-session no-syn-check
delete security flow tcp-session no-sequence-check
commit - Verify if the RG states are back online with the correct priority:
show chassis cluster status
Purpose:
Configuration
Implementation
Installation