Network Automation for ISPs

Tomasz Schwiertz
16 min readDec 16, 2020

--

Network Automation is the new “Next Thing” in the industry. I’ve seen many “Next Things” come and go, although this time — we might have a serious contender for our attention. As with everything new — it has to be learned. Therefore I built a small LAB with Juniper vMX Series Routers where a Python script, executed from a Linux host (off-box) updates BGP prefix-list filters. This is a POC, developed to aid Network Engineers in their daily work of operating an ISP network. In this article — I want to share the automated solution and my thoughts and findings I came across.

Let’s automate!

The whole industry is seeking engineers with automation skills. Every business is looking after someone who can add something to the automation capabilities of an ISP. Honing these skills will make Network Engineers a highly wanted asset in every company. As a seasoned Network Engineer — I see this not only as a trend that, but as something that will stay and will revolutionize the way we are offering services to customers.

Let us practice these skills. Let us take a manual chore and automate to an extent where we take out at least the repeatable tasks and interchange them with an automated “thing”. The problem we are willing to solve with automating part fo the day to day duties of an Operations Network Engineers is the repetitiveness of the manual chores an Engineer has to complete while onboarding new ISP customers. Here is what is mostly done by hand in all companies:

  • an ISP advertises customer prefixes outbound to its Transit Providers (through outbound filters),
  • once a new customer joins the ISP, their prefixes are added to the outbound filters,
  • this task is repeated on all ISP Edge Routers Facing Transit Providers.

Automatization of this part would eliminate the repetitiveness of manually updating the Edge Routers. That chore will be replaced with a centralized Automation Station that updates the outbound filters by running a config update script.

Now that we diagnosed the pain point, we can start drafting a solution that will save Engineers the manual tasks and help us achieve the same result. Now that we have the project requirements laid down — let us list the components we need:

  • a test network to get this solution tested and de-bugged — EVE-NG,
  • a Version Control System to store the prefixes and track adds and deletions of customer prefixes — Git,
  • an Automation Server that will poll data from Git and push the config to PE routers — achieved through Python script,
  • a secure communication channel and a mechanism to change the configuration on PE routers — NETCONF.

Let’s draw it:

Protocol data flow — conceptual drawing

1, the customer shares their prefixes that we (ISP) need to accept,

2, ISP saves the received prefixes in Github,

3, Automation Server polls the list of prefixes from Github,

4, Automation Server (via NETCONF) updates the ISP PE routers config (outbound prefix filters).

Now that we have the elements highlighted which we want to use in our automation attempt — we can place them in our ISP test network and go over the big picture, discussing where it will be applied. We will review the onboarding of an IP Transit (IPT) customer and the connections/connectivity mechanisms of the solution we are automating. We will start with understanding the why before we get into the how.

The “why”

Each ISP has it’s own IPT customer onboarding process. It’s a protocol that lays down the necessary steps needed to be taken to successfully connect with customer network and to advertise their numeric resources further up to the upstream transit providers. In this scenario, the validation undergoes the prefixes, owned by the customers. In this article — we simplify the handling of customer prefixes, where we rely solely on inbound/outbound prefix-filters (BGP communities are not in use). Once the prefixes to be accepted are agreed on, they are added into two places:

  • Inbound prefix filter on the BGP session facing customer network
  • Outbound prefix filter on the BGP session facing a transit provider network
ACME Telekom ISP

After we identified the places where prefixes are added — we can start to distinguish the tasks based on whether they are unique (configuration is changed on one router only) or repetitive (same configuration changes, repeated on different routers). Here we can see the value added by the automation aspect. We will add the prefixes to one list (GitHub repository), where the Automation Station will change the configuration on all Transit Provider facing PE routers for us.

Let’s look at the configuration of the BGP session facing the IPT Customer. Although the customer is configured under a BGP group hierarchy level, it’s still considered as a unique configuration (added manually, where each customer advertises a different set of prefixes and autonomous system numbers).

show protocols bgp                  

group IPT-CST-FULL-TABLE {
type external;
neighbor 192.0.2.21 {
description IPT-AS500-CUST-A;
import IPT-AS500-IN;
peer-as 500;
}
}


show policy-options policy-statement IPT-AS500-IN

term ACCEPT {
from {
policy CST-IN;
prefix-list-filter IPT-CST-AS500 orlonger;
}
then accept;
}
term DENY {
then reject;
}


show policy-options

prefix-list IPT-CST-AS500 {
10.128.16.0/22;
10.250.64.0/22;
}

Each time a customer gets boarded, the ACME Telecom IP Engineer configures a similar set of command on the PE router where the customer connects to.

Let’s now take a look at the BGP configuration of the peering sessions towards the Transit Providers. Although each Transit Provider has its own DIRECTIVES (required communities and supported functionalities), PNI (Private Network Interface) configurations. The general idea is similar across all of them — they are accepting valid prefixes from ACME Telecom. To not advertise an invalid prefix by mistake, ACME Telecom filters outbound prefixes. The ACME Transit connectivity (BGP config between ISP and Transit/Customers) looks as follows:

ACME Telekom IP addressing

Each of the BGP sessions uses the same logic and the same building blocks to achieve the same effect — to advertised allowed prefixes. The repetitive router config that is referenced in all policy-statements is the prefix-list-filter PXL-CUSTOMERS that holds ALL prefixes from ALL customers. And it’s the same across ALL ACME PE facing Transit Providers.

ACME PE1 router configuration

show protocols bgp 

group TRANSIT {
neighbor 203.0.113.10 {
description TRANSIT-LEVEL4;
export LEVEL4-OUT;
peer-as 200;
}
}


show policy-options policy-statement LEVEL4-OUT

term BADLENGTH {
from {
route-filter 0.0.0.0/0 prefix-length-range /25-/32;
}
then reject;
}
term ALLOWED-PXL {
from {
prefix-list-filter PXL-CUSTOMERS orlonger;
}
then accept;
}
term CATCHALL {
then reject;
}

ACME PE2 router configuration

show protocols bgp 

group TRANSIT {
neighbor 203.0.113.12 {
description "TRANSIT-ATLANTIS TELEKOM";
export ATLANTIS-TELEKOM-OUT;
peer-as 300;
}
}


show policy-options policy-statement ATLANTIS-TELEKOM-OUT

term BADLENGTH {
from {
route-filter 0.0.0.0/0 prefix-length-range /25-/32;
}
then reject;
}
term ALLOWED-PXL {
from {
prefix-list-filter PXL-CUSTOMERS orlonger;
}
then accept;
}
term CATCHALL {
then reject;
}

ACME PE3 router configuration

show protocols bgp 

group TRANSIT {
neighbor 203.0.113.14 {
description TRANSIT-BANANA;
export BANANA-OUT;
peer-as 400;
}
}


show policy-options policy-statement BANANA-OUT

term BADLENGTH {
from {
route-filter 0.0.0.0/0 prefix-length-range /25-/32;
}
then reject;
}
term ALLOWED-PXL {
from {
prefix-list-filter PXL-CUSTOMERS orlonger;
}
then accept;
}
term CATCHALL {
then reject;
}

The common denominator in all of the policy-statements is the prefix-list-filter PXL-CUSTOMERS which is precisely the same across all ACME Transit facing PEs.

set policy-options prefix-list PXL-CUSTOMERS 10.250.16.0/22
set policy-options prefix-list PXL-CUSTOMERS 10.250.64.0/22
set policy-options prefix-list PXL-CUSTOMERS 192.0.2.0/24

And this is the core concept, around this series of articles focuses on. This prefix-list is the part of the configuration that is updated by the Python script. Precisely the same updates are applied on all of ACME PE Transit routers. The Python script that will be discussed in Part 3 of the series will talk through the code used to update the prefix-list PXL-CUSTOMERS config lines. Without the Automation aspect — IP Engineer needed to log in into each PE router and needed to apply precisely the same configuration to each and every router individually. This change in the process allows shortening the time an Engineer spends on using repetitive configuration, allowing him to address other tasks.

The “how”

Let’s talk now about the “how” that gets our elements to communicate with each other. We will take a closer look at the method of delivery of the code to the server, and how the server communicates with edge routers.

Automation Station connectivity

The script is developed on the IDE station, where it’s pushed to the Computation Resource (Operations Support System or OSS). We develop your code on your local desktop, but we send the code over SSH to the Computation Resource where it’s run when required/scheduled. I decided to lay down this naming on purpose on the beginning of the article to not confuse with the naming used for the NETCONF protocol architecture, where the server running the python (NETCONF) script is the client, and the router is the NETCONF server 😉

NETCONF Client-Server data flow

The OSS requires Internet connectivity + DNS resolution to pull data from Git repositories. After this is set up, few additional packages need to be installed on the server: python3, ncclient, junos-ezc. We install Python3 to understand the script code, ncclient library to use the NETCONF protocol and junos-ezc to understand the Junos Operating System.

Enabling NETCONF on routers

A NETCONF session is seen by the router similarly like an incoming user connection. Once the connection is authenticated and authorized successfully, the external entity queries the Junos OS device operational status and influences the operating state of the device by applying configuration changes. In contrast to other management models like SNMP, NETCONF doesn’t manipulate individual data atoms to effect change. Instead, Junos OS makes use of the concept of a candidate configuration, which is applied to the various software demons when the candidate config is committed. In this respect, NETCONF and the traditional user-based CLI are consistent. Let’s configure the credentials for the Junos OS User Account (authentication). For simplicity — the NETCONF username will have super-user permissions (authorization).

set system login user netconfuser uid 2100
set system login user netconfuser class super-user
set system login user netconfuser authentication password netconf123

We will enable access to the NETCONF SSH subsystem by using the default NETCONF-over-SSH capability. To enable NETCONF daemon to listen on the well-known port tcp/830 — we configure the following:

set system services netconf ssh port 830

NETCONF protocol is another way to getting management access into your devices. If you have management plane protection configured, you have to whitelist source addresses (Automation Server) and allow tcp/830 in addition to your tcp/22 (SSH) connections.

You can verify connectivity by manually accessing NETCONF by initiating an SSH session to port 830 with the netconf subsystem.

t.schwiertz@automationstation:~$ ssh -p 830 netconfuser@192.0.2.1 -s netconf123
Password:

<!-- No zombies were killed during the creation of this user interface -->
<!-- user priv15, class j-super-user -->
<hello xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<capabilities>
<capability>urn:ietf:params:netconf:base:1.0</capability>
<capability>urn:ietf:params:netconf:capability:candidate:1.0</capability>
<capability>urn:ietf:params:netconf:capability:confirmed-commit:1.0</capability>
<capability>urn:ietf:params:netconf:capability:validate:1.0</capability>
<capability>urn:ietf:params:netconf:capability:url:1.0?scheme=http,ftp,file</capability>
<capability>urn:ietf:params:xml:ns:netconf:base:1.0</capability>
<capability>urn:ietf:params:xml:ns:netconf:capability:candidate:1.0</capability>
<capability>urn:ietf:params:xml:ns:netconf:capability:confirmed-commit:1.0</capability>
<capability>urn:ietf:params:xml:ns:netconf:capability:validate:1.0</capability>
<capability>urn:ietf:params:xml:ns:netconf:capability:url:1.0?scheme=http,ftp,file</capability>
<capability>urn:ietf:params:xml:ns:yang:ietf-netconf-monitoring</capability>
<capability>http://xml.juniper.net/netconf/junos/1.0</capability>
<capability>http://xml.juniper.net/dmi/system/1.0</capability>
</capabilities>
<session-id>12262</session-id>
</hello>

]]>]]>

Our router (NETCONF server) is listening to incoming NETCONF connections.

PRO TIP: if you are setting this up in a LAB — don’t forget to enable SSH service on your routers. If you don’t start an SSH server on your routers, you won’t be able to connect to the SSH NETCONF subsystem.

Connecting to GitHub

The GitHub repository was chosen as the Version Control System to store the list of customer prefixes. Each time a new customer gets onboarded, the Engineer adds the new prefixes into the GitHub repository.

The Python script running on the Automation Station requires internet connectivity and DNS resolution. The code refers to an URL to poll information (customer prefixes) from the GitHub repository. The prefixes (text at this point) is copied from the repository and processed by the script to be sent later to the PE routers as “set policy-options prefix-list PXL-CUSTOMERS” commands.

SSH connection

The SSH connection between the IDE station and Automation Station allows management access for admins, and to push code. The Automation Station accepts code from our IDE station (allows for Python PyCharm Remote Development via SSH protocol). We will develop our code on our local desktop, but we will use the Linux machine as the computation resource. The script is set up to be executed manually, by logging into the Linux machine and running the Python script.

The Python3 script

Let’s now assess the Python3 code that automates the prefix list update. In the “why” section, we discussed the BGP configuration element, where we identified a prefix-list named precisely the same way on all Transit facing routers. This unified BGP config is the fundamental assumption we made, on which the whole Python code was written. Each time a new customer joins ACME Telekom ISP — the following prefix-list (on all PE Transit facing routers) gets updated with new customer prefixes:

set policy-options prefix-list PXL-CUSTOMERS 10.250.16.0/22
set policy-options prefix-list PXL-CUSTOMERS 10.250.64.0/22
set policy-options prefix-list PXL-CUSTOMERS 192.0.2.0/24

And this is now done with the Python3 code that adds additional prefixes to the prefix-list above automatically:

#!/usr/bin/env python3

import requests
from jnpr.junos import Device
from jnpr.junos.utils.config import Config


url="https://raw.githubusercontent.com/tomaszschwiertz/lines/master/prefix"
read_data = requests.get(url).content.decode('utf8')


list = read_data.split()
set_cmd = """
"""


for x in range (0, len(list)):
set_cmd += ("set policy-options prefix-list PXL-CUSTOMERS " + list[x] + "\n")


U="netconfuser"
P="netconf123"
ROUTERS = ["192.0.2.1", "192.0.2.2", "192.0.2.3"]


for device in ROUTERS:
dev = Device(device, port=830, user=U, password=P)
dev.open()
conf = Config(dev)
conf.lock()
conf.load(set_cmd, format='set')
print('Updating ACME-' + dev.facts['hostname'] + ' | ' + device)
conf.pdiff()


if conf.commit_check():
conf.commit()
else:
conf.rollback()


conf.unlock()
dev.close()

We start by looking at the libraries. Requests hold the methods to obtain content from an URL. The user prefixes are stored in GitHub repository prefix list. The imported objects Device and Config enable working with the Junos OS system, and it’s configuration, respectively.

The prefixes (stored as text) are polled as RAW data from and converted into a list of strings. The strings (prefix) are appended to the Juniper command “set policy-options prefix-list PXL-CUSTOMER” which gives us a block of set commands in the variable “set_cmd”. We will use this later when we upload the configuration to the PE routers.

Finally, we see the for loop which cycles through the list of devices (PE lo0.0’s). The script opens a NETCONF session to each of the routers by using the given credentials and port number. The code acts similarly to a super-user connection. The script locks the configuration on the router (an equivalent to a configure exclusive configuration mode), loads the set commands stored in variable “set_cmd” (load merge). At this point, we have let the script generate some logs for visibility — which router is being updated and the candidate configuration to be added (show | compare).

Next, we have a sanity check mechanism in place — we check the correctness of the syntax. If the check result is positive — the change is committed. Otherwise — rejected (rollback 0). During the script operation — the device is locked; therefore, other users are not able to change and submit their candidate configuration to be committed. If the automated configuration update fails — all changes will be rolled back. Junos OS operates under the Batch Configuration Model, which basically allows committing all config lines at once or none at all. This is an essential fragment of the code as it follows the best practices of Change Management, and it separates human configuration input from an automated action. Last two lines take the lock off the device and close the NETCONF session releasing router resources.

The Case Study

Let’s now look at a case study to see the effect of running the script. ACME Telekom has only one IPT customer, his prefixes are advertising outbound to Transit Providers. The below routing table is the same on all ACME Transit facing PEs:

PE3# run show route advertising-protocol bgp 203.0.113.14

inet.0: 24 destinations, 26 routes (24 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
* 10.128.16.0/22 Self 500 I
* 10.250.64.0/22 Self 500 I
* 192.0.2.0/24 Self I

The 192.0.2.0/24 IP space is dedicated to ACME infrastructure, prefixes 10.128.16.0/22 and 10.250.64.0/22 belong to customer CUST A. Connectivity is in place, routing works as expected.

Now, a new customer (CUST B) joins ACME Telekom. The customer’s ASN is 600, and his address space is 172.20.0.0/21 and 172.27.128.0/17. From the ACME ISP point of view, the topology changes as follows:

ACME Telekom’s new IPT Customer

To PE4, the following BGP config is added to establish a peering session towards CUST B:

show protocols bgp

group IPT-CST-FULL-TABLE {
type external;
neighbor 192.0.2.23 {
description IPT-AS600-CUST-B;
import IPT-AS600-IN;
peer-as 600;
}
}


show policy-options policy-statement IPT-AS600-IN

term ACCEPT {
from {
policy CST-IN;
prefix-list-filter IPT-CST-AS600 orlonger;
}
then accept;
}
term DENY {
then reject;
}


show policy-options

prefix-list IPT-CST-AS600 {
172.20.0.0/21;
172.27.128.0/17;

}

Now — let’s look at the GitHub repository. Before CUST B joined — the entries in the repository looked as follows:

prefix-list repository — before CUST B gets onboarded

The GitHub repository holds entries about the IP space dedicated to ACME infrastructure and CUST A prefixes (10.128.16.0/22 and 10.250.64.0/22). ACME Engineer ads the new prefixes from CUST B to GitHub and runs the script.

prefix-list repository — after CUST B gets onboarded

And this was the final manual task the Engineer had to action to successfully provision a new IPT service for a new customer. As next — we will examine the logs that the script generates while updating the router config.

t.schwiertz@automationstation:/home/development$ python3 demo.py
Updating ACME-PE1 | 192.0.2.1

[edit policy-options prefix-list PXL-CUSTOMERS]
+ 172.20.0.0/21;
+ 172.27.128.0/17;

Updating ACME-PE2 | 192.0.2.2

[edit policy-options prefix-list PXL-CUSTOMERS]
+ 172.20.0.0/21;
+ 172.27.128.0/17;

Updating ACME-PE3 | 192.0.2.3

[edit policy-options prefix-list PXL-CUSTOMERS]
+ 172.20.0.0/21;
+ 172.27.128.0/17;

The script logged into each router and changed the prefix-list PXL-CUSTOMERS configuration element. For each router, the hostname is gathered from the “facts” data collection and displayed for easier tracking of changes that are introduced. The script successfully cycled through the whole list of IP addresses while committing new config lines. The prefix-list ins now updated, an nd looks as follows:

set policy-options prefix-list PXL-CUSTOMERS 10.128.16.0/22
set policy-options prefix-list PXL-CUSTOMERS 10.250.64.0/22
set policy-options prefix-list PXL-CUSTOMERS 172.20.0.0/21
set policy-options prefix-list PXL-CUSTOMERS 172.27.128.0/17
set policy-options prefix-list PXL-CUSTOMERS 192.0.2.0/24

A new set of BGP prefixes is being advertised to Transit Providers, and it has the new prefixes included.

PE3# run show route advertising-protocol bgp 203.0.113.14

inet.0: 24 destinations, 26 routes (24 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
* 10.128.16.0/22 Self 500 I
* 10.250.64.0/22 Self 500 I
* 172.20.0.0/21 Self 600 I
* 172.27.128.0/17 Self 600 I
* 192.0.2.0/24 Self I

A final test to confirm reachability — a traceroute from CUST B lo0 address to a simulated DNS address 8.8.8.8. This test concludes a successful IPT customer onboarding.

CUST-B# run traceroute 8.8.8.8 source 172.20.0.1 | no-more

traceroute to 8.8.8.8 (8.8.8.8) from 172.20.0.1, 30 hops max
1 192.0.2.22 (192.0.2.22) 2.319 ms 2.301 ms 1.752 ms
2 192.0.2.16 (192.0.2.16) 3.348 ms 3.098 ms 2.766 ms
MPLS Label=299936 CoS=0 TTL=1 S=1
3 192.0.2.10 (192.0.2.10) 3.431 ms 2.681 ms 3.442 ms
4 203.0.113.10 (203.0.113.10) 4.694 ms 4.233 ms 4.550 ms
5 8.8.8.8 (8.8.8.8) 5.932 ms 6.066 ms 5.681 ms

The Summary

And that’s a wrap on the Network Automation for ISP article. In this document— we examined the code itself and discussed a case study, where a new customer joins the ISP. We checked the BGP advertisements and prefix-list filters before and after running the automate prefix filter update. The idea described in the article can be expanded on by further developing the Python code and the process itself. To see the code, config, diagrams and more, visit my repository on GitHub Tom’s Network Automation Project. While writing this article — I came on a few ideas that could enrich the process and take the automation aspect even further. Three key concepts can be developed further:

  • Python Exception Handling, the PyEZ library provides the exception module which contains exception classes unique to Junos. Without processing exceptions, the program crashes and leaves a part of the routers updated, and part unchanged. In a production environment — we want to catch the exception and process sit, leaving a clear message on what went wrong or to re-try the config change. Exception Handling recognizes events like connection authentication error, maximum user limit reached, configuration DB lock error, timeouts and commit errors.
  • Customer Facing Configuration, how customers are connected to ACME ISP. In this case study, — we had a Single Homed customer. Customers that seek High Availability decide on Dual Homed connectivity. The latter option assumes having two BGP sessions on two separate PE routers. Here we can adopt the same principle of provisioning the same configuration on two different routes that will serve the same client. The repetitive part is customer ASN and prefixes.
  • Scheduled Script Execution, the script in the present form is triggered by an Engineer, right after he edits the GitHub repository. What if there are multiple customers onboarded during a day? The script could be run once a day by cron tool, let’s say at midnight. This would update prefixes after w the whole day of work. Additionally, an SMTP module could be added to the script that would send e-mails containing logs, generated by the Python script. Such confirmation could be a great aid in making sure that the customer was onboarded successfully.
Network Automation for ISPs

The demand for Network Automation grows from day by day. There will be even more demand for this kind of skills after the vendors release newer hardware, making it easier to work with API’s and on-box/off-box automation. The early adaptors of the automation mindset will be rewarded for taking this step early on their journey towards Network Automation. I enjoy working with automation, as well as writing about it. Like, share and let me know in the comments on what is of interest to you, so I can start preparing the next article😉

--

--

Tomasz Schwiertz

ISP Network Engineer, Architect, CCIE Candidate London based CISCO Trained Professional | follow me on IG: @tomaszschwiertz https://taplink.cc/tomaszschwiertz