Network-wide Transactions#

Many will find EDA's Atomic Transactions feature one the most exciting ones. And for the good reasons!

How many times have you been staring at a long-running configuration pipeline that goes and applies changes box by box, or pool by pool, only to have it fail at one device because of a minor misconfiguration, incompatibility or a transient resource issue?
How many hours have been invested in finding the root cause of such failures, coming up with the clean-up steps and rolling back the partially deployed changes?
How many hours have been invested to create an idempotent configuration pipeline that can perform a guaranteed rollback in case of a failure in a complex service deployment?
And how much easier would it be if these challenges were handled automatically by the system?

Network-wide Transactions in EDA were designed to solve exactly these challenges faced by every operator on a regular basis.

Reconciliation vs Transactions#

To appreciate the value of EDA's Network-wide Transactions, it is important to understand the difference between reconciliation-based and transaction-based approaches to network automation.
If you have been following the recent trends in network automation, you have probably noticed the increasing popularity of reconciliation-based approaches boosted by the popularity of Kubernetes, declarative paradigm, and GitOps.

The idea behind the reconciliation pattern is simple: instead of writing imperative steps to configure a service on a set of devices that run from some automation server in one-off fashion, you declare the desired state of the same service and let the "controller" reconcile the actual state of the network with the desired state.
The process of identifying the delta between the actual and desired states and applying the necessary changes to converge the two states is called reconciliation loop.

A popular implementation of the reconciliation loop is the Operator pattern popularized by Kubernetes. To illustrate the reconciliation loop, consider the following diagram where a user submits the desired state of a resource to the system, and the Operator reconciles (applies necessary changes to) the actual state of the resource to match the desired state.

Imagine, that a user wants to deploy a web server with three replicas across the worker nodes in a Kubernetes cluster. The user would submit a manifest declaring this intent and the Operator in Kubernetes would create the necessary components to realize this intent. And eventually the actual state of the cluster would converge to the desired state.
The emphasis on "eventually" has been put for a reason. The reconciliation loop may take time to converge the states, and sometimes it may never converge to the desired state if there are persistent issues preventing the deployment.

This case is illustrated above with the worker1 node not being able to start the web server pod due to, for example, insufficient resources. This would result in the actual state never matching the desired state of three replicas and only two web servers being available.

Running two web servers instead of three may sound acceptable in some scenarios, if there is ample time to fix the underlying issue and let the reconciliation loop converge the states. However, every network operator knows they can't afford partially applied changes.
In the network world, partially applied changes may lead to service outages, security vulnerabilities, and compliance violations. Therefore, network operators need stronger guarantees that either all changes are applied successfully, or none of them are applied at all. This is where EDA's Network-wide Transactions come into play.

In EDA, the declarative network management principles meet the strong consistency guarantees of atomic transactions. Whenever a user submits a resource manifest to EDA, the platform creates a transaction with the calculated changes required to converge the actual state of the network with the desired state declared in the manifest. The transaction is then executed across all affected devices in the network in an atomic, all-or-nothing fashion.
No guesswork and no unnecessary tokens thrown at a problem that can be solved with some good old engineering to guarantee the safety and reliability of operations.

In EDA, every action that results in the node config change benefits from the atomic transaction guarantees provided by the platform.

Let's work through an example of how Network-wide Transactions work in EDA. Imagine, that an operator needs to add a new customer in a data center network spanning multiple switches along with the necessary ACL entries to enforce security policies. The operator would add a set of resources reflecting the desired state of the network to the transaction basket in EDA and submit the transaction for execution.

Before the first network request is made, EDA will perform a set of validations.

It ensures that the provided input intents match the schema and no fat fingering happened to the input data.
It checks dependencies between the resources, e.g. the referenced IP pool is not depleted, the selected target node exists, etc.
Then it will calculate the necessary node-level changes to be applied on each target device and these changes will undergo the semantic and syntactic validation based on the targeted platforms and their YANG model.

If all these validation pass, the transaction will transition to the deploy phase.

In the deploy phase, the transaction will target all affected devices in the network . EDA will concurrently apply the changes across all devices in the transaction, ensuring that the changes will be attempted on all devices at the same time. Even though the extensive set of validations performed before the deploy phase significantly reduces the chances of failure during the configuration push, failures may still happen due to nature of model-based checks and hardware resource constraints, transient network issues, or self-inflicted communication path failures.

This is where EDA's Network-wide Transactions shine. When EDA applies the changes on each device, it uses gNMI Commit Confirmed extension so that each successful commit requires an explicit confirmation from EDA within a specified confirmation timeout window. Continuing with our example, let's imagine that first device successfully applied the changes and transitioned to the "waiting for confirmation" state. The third device hasn't yet processed its commit set, while the second device failed to apply the changes due exhausted TCAM resources:

Because one of the devices failed to apply the transaction change set, EDA is immediately notified of the failure and initiates the rollback phase of the transaction. The rollback in this case is as simple as sending the gNMI Commit Cancel message to all devices that either successfully applied the changes or are still in the process of applying them. Once the devices receive the Commit Cancel message, they will automatically revert to the previous configuration state, ensuring that no partial changes are left in the network.

The network remains in a consistent state, with no partial changes applied, and the operator can investigate the root cause of the failure without worrying about cleaning up after a partially applied configuration.

Did you know that Nokia EDA team contributed¹ the gNMI Commit Confirmed extension to the OpenConfig?

In summary, each operation in EDA is put through a rigorous validation and check process with multiple safety barriers to ensure the the malicious or accidental misconfigurations are minimized to the greatest extent possible. As all transactions share fate, no matter how big or small the change set is, how big or small the target set of devices is, EDA will ensure the atomicity of the changes and record the successful transaction in its persistent Git repository for future audits and rollbacks.

Working With Transactions#

With the good theoretical primer behind us, it is time to see the transactions in action. We will work through a simple task of enabling the ethernet-1/20 interface on both leaf1 and leaf2 switches in our data center fabric.
If you recall from the Nodes chapter, both leaf1 and leaf2 switches have ports from 1 to 12 enabled and acting either as uplink or server-facing ports, while the rest of the ports are not even configured².

Interface ethernet-1/20 has not been configured

--{ + running }--[  ]--
A:admin@leaf1# info /interface ethernet-1/20

--{ + running }--[  ]--
A:admin@leaf1#

Transaction basket#

To add an interface go to the IInterfaces resource page in the left side menu and click on "Create" to fill in the details for the new interface resource for leaf1:

Creating a new interface

In the editing mode, we will fill in the necessary information to define the new interface resource. To keep things simple, we will paste the following YAML snippet in the editor:

apiVersion: interfaces.eda.nokia.com/v1alpha1
kind: Interface
metadata:
  namespace: eda
  name: leaf1-ethernet-1-20 #(3)!
spec:
  enabled: true
  type: interface
  encapType: dot1q
  lldp: true
  members:
    - enabled: true
      lacpPortPriority: 32768
      interface: ethernet-1-20 #(1)!
      node: leaf1 #(2)!

We set the interface field to ethernet-1-20 which is the normalized, vendor-agnostic interface name that is transformed to the vendor-specific interface name by EDA during the configuration push.
We set the node field to leaf1 to target the first leaf switch in our fabric.
We set the resource name to leaf1-ethernet-1-20 to uniquely identify this interface resource in EDA.

When editing resources in EDA, the resource definition is matched against its API schema. This is the first validation step that ensures the resource definition adheres to its schema before it is added to the transaction basket.

After pasting the YAML snippet or filling the schema form, click on "Add" button to add the new interface resource to the transaction basket.

By adding a resource to the transaction basket you are staging the resource for inclusion in the next transaction. No configuration changes are being pushed yet. Continue with adding the second interface resource for leaf2 switch by repeating the same steps as above, but changing the name and node fields accordingly.

After adding both interfaces to the transaction basket, you should see the counter next to the basket icon in the top right corner of the EDA UI indicating that there are two staged resources waiting to be committed. Clicking on the basket icon will open the transaction basket popup where you can review the staged resources, perform operations on them (edit, delete), or proceed with transaction operations.

Transaction basket with two staged resources

EDA REST API and EDA Ansible collections has full support for managing transactions.

Dry run#

When you have the resources in your transaction basket workspace you could straight away commit them and let the platform do its magic. But among the things that operators don't want to have is magic and surprises when provisioning their networks.

Acknowledging this fact, EDA provides a "Dry run" feature that allows operators to preview the changes that would be applied to the network without actually applying them. This is a great way to validate the intended changes before committing them to the network.

To initiate a dry run, click on the dropdown selector next to the "Commit" button and choose "Dry run" option:

EDA will start processing the resources in the basket's workspace and run another validation as part of the dry run. This validation includes schema validation using node's YANG models that EDA is aware of. Things like data types, ranges, mandatory fields and other constraints defined in the YANG models of the target devices are checked during this validation step.

In a few moments, EDA will present the dry run results in the same popup window, from where a user can either proceed with committing the transaction upon a successful dry run, or look at the transaction details and diffs to understand the scope of the changes in the basket workspace.

Let's start with the diff view, and leave transaction details for when we submit the actual transaction.

Diff#

Clicking on the "Diffs" icon will pop up a new window with the calculated diffs that would look like this:

The diff view is split in two panes, on the left side we have the list of resources affected or created by the transaction and on the right side we have the familiar text diff view showing the exact changes in the "before/after" format for each resource.

The "Node Configuration" diffs are always at the top of the list as they represent the actual configuration changes that will be applied to the target devices. In our case, we have two new interfaces being created on leaf1 and leaf2 nodes respectively, and the diff shows the exact configuration snippets that will be added (in this case) to the running configuration of each device.

Being able to see the detailed diff as a result of a dry run removes another layer of uncertainty from the network operations process, showing the proposed changes in a familiar format.

Besides the Node Configuration diffs, we also have diffs for the resources being created or modified in this transaction. Like the Interface resources that we added to the transaction can be found in the list of resources in the left pane.

Committing the transaction#

Based on the dry run diff review an operator may want to edit the resources in the basket workspace to adjust the parameters and re-run the dry run until satisfied with the proposed changes. Once ready, the operator can proceed with committing the transaction by clicking on the "Commit" button in the transaction basket popup.

Pulling up the transaction basket will now show the "Commit" without the dropdown selector as we have already performed the dry run and haven't changed anything in the basket workspace since then.

The commit process will look similar to the dry run, however, this time EDA will proceed with pushing the changes to the target devices using the Network-wide Transactions mechanism described earlier.
After clicking on the "Commit" button, EDA will show the transaction progress in the same popup window and display the final transaction result once the transaction is confirmed by all target devices.

After the transaction is successfully committed, you can click "Done" to close the transaction basket popup and return to the main EDA UI.

If you were to go and check the configuration on both leaf1 and leaf2 switches, you would see that the ethernet-1/20 interfaces are now present and enabled:

--{ + running }--[  ]--
A:admin@leaf1# info /interface ethernet-1/20
    admin-state enable
    vlan-tagging true

Transaction list#

The Git repository that backs up EDA transactions deserves its own chapter which we will leave for later. For now, all that is important to know is that every committed transaction is persisted in the replicated EDA's Git repository.

You will find the transaction list right at the top section of the left side menu in the EDA UI, clicking on it will pull up the transaction list³:

Attentive readers will notice that the transaction list contains both the transactions that were run in the dry run mode as well as the committed ones. Also both successful and failed transactions are listed here for audit and troubleshooting purposes.
With every transaction having the incremental identifier it is easy to track the sequence of changes applied to the network over time.

Double-clicking on a transaction entry will pull up the transaction details view where all the information about the transaction can be found:

The detailed transaction view covers a lot of ground. First, on the panel with the big checkmark status you see the transaction ID and its commit status. The panel to the right shows transaction KPI-s, like

1 Number of input resources provided to the transaction. The transaction we are looking was recorded as a result of us adding two Interface resources a moment ago, hence the value of 2 here.

2 Number of EDA application runs that were involved in processing this transaction.

3 Number of emitted resources that were created as a result of EDA applications working on the input resources.

4 Number of resources that have been changed as part of this transaction. This includes both created and modified resources.

5 How many topology nodes were affected by this transaction. Read it as "on how many devices changes were applied".

6 In the Transaction Details tab you will find the actual resources for which the KPIs were calculated.

7 With the top bar toggle you can change the view from transaction details to diffs or transaction topology graph.

8 The "advanced" toggle will show additional internal resources participating in this transaction if available.

It is always a good idea to start with the transaction list when experiencing unexpected issues in the platform, since some transactions may be triggered automatically by the system and not directly by the user.

Restores and reverts#

We mentioned earlier that EDA persists every committed transaction in its Git repository. And many of you will immediately guess where we are going with this.

Git is perfect for maintaining the change history and being able to revert to any previous state when needed. EDA leverages this capability of Git to provide operators with the ability to rollback or revert transactions as needed.
No matter how far back in time you need to go, how many resources were created since, or how many nodes were added - EDA can effectively go back in time to any point in time and bring the network to state recorded at that time.

EDA exposes Revert and Restore actions for each committed transaction:

Revert sets all the input resources from a specific transaction back to the previous commit⁴
Restore sets all EDA resources, apps, and allocations to exactly as they were at the specified commit

Both actions are executed as a new transaction and committed with a new commit hash, i.e., the commit history always moves forward even if the transaction is a roll-back of changes.

The easiest way to access these actions is from the transaction details view where the Revert and Restore actions are available in the context menu of each committed transaction:

In the screenshot above, the selected transaction is the one where we created the two interfaces on leaf1 and leaf2 switches. Let's say we realized that we made a mistake and these interfaces are not needed after all. We could've gone to the interface list and deleted them, but in case your transaction involved multiple resources or changes were made to some existing resources, then reverting the transaction is a much easier and safer option.

If we were to click on the "Revert" we would see the dialog asking us to choose between a dry run or an actual revert operation.

Even rollback and revert operations can be first tried out to add that extra safety layer.

Click on "Revert" and EDA will create a new transaction that "reverts" the changes and you could check the config on the nodes to ensure that the ethernet-1/20 interfaces are no longer present.

As explained above, the Revert operation reverts a single transaction, while Restore brings the entire EDA platform back to the state recorded at the specified transaction.

Network-wide Transaction Example#

What about our promise to manage transactions in the all-or-nothing fashion with strong consistency guarantees and automatic rollback in case of failures? To see this in action we would need to simulate a sudden communication loss from EDA to one of the target devices during the transaction commit phase, or fabricate a wrong configuration input that the node would reject during the commit. The latter is easier to simulate, so let's do that.

Let's populate our transaction basket workspace with three Interface resources this time. If you rolled back the ethernet-1/20 interfaces on leaf1 and leaf2 switches, you can re-add them to the basket, and add a third Interface resource for spine1 that would look like this:

apiVersion: interfaces.eda.nokia.com/v1alpha1
kind: Interface
metadata:
  namespace: eda
  name: spine1-ethernet-1-99
spec:
  enabled: true
  type: interface
  encapType: dot1q
  lldp: true
  members:
    - enabled: true
      lacpPortPriority: 32768
      interface: ethernet-1-99
      node: spine1

Do you see the induced mistake? There is no ethernet-1/99 interface on the 7220 IXR D5 platform that spine1 switch is running, and this "error" won't be intercepted by the dry run or schema validation, since the port range constraints are not part of the YANG model of the device.

With three interfaces in our basket, let's first run a dry run to see the diffs:

The dry run completes successfully, and the diffs shows that the node configuration change targeting the spine1 switch will attempt to add the ethernet-1/99 interface. Let's see what happens if we were to proceed with the change:

Committing transaction with an invalid interface

As you can see, the whole transaction has been pronounced as failed as all the resources in the transaction share the same fate, and a failure on one target caused the entire transaction to fail. The failed transaction also did not let any partial changes to be applied to the network, that can be validated by checking the list of the interface in the EDA as well as on the nodes themselves.

Where to next?

The Tour of EDA is far from over, but we are busy working on the next chapters. In the meantime, feel free to explore other parts of the documentation, connect with the EDA PLM team and our lovely community in Discord or check out YouTube for video tutorials and webinars.

https://github.com/openconfig/reference/blob/master/rpc/gnmi/gnmi-commit-confirmed.md ↩
You can ensure this by checking the node configuration from the EDA UI. ↩
Check the Transactions documentation for more details on which transactions are listed and the details shown there. ↩
If a more recent transaction made changes to any of the input resources, revert will fail. This prevents 'undoing' changes that are not part of the selected transaction. ↩