Considerations in Physical or Cloud-based Environments

by Michael V. and Kate R.

Designing a Network Operations Center (NOC) to support various customer missions will inevitably include customer-specific considerations. However, ActioNet experience and expertise in this area has provided for some best practice considerations when establishing and operating a NOC that aside from the physical space design, can be considered if you are supporting your customers with the instantiation of a NOC. In this example, you will note how ActioNet provides a state-of the-art Network Operations Center for a high-profile federal customer we support. ActioNet’s NOC implements a suite of vendor agnostic and proprietary tools to monitor and support the network. The NOC tools are implemented based on their capabilities for monitoring data center and cloud networks.

Team Structure

The team is structured with multiple tiers of escalation support to provide rapid responses for network incidents and events. Tickets are escalated based on predefined metrics to expedite mean-time-to-repair objectives and exceed Service Level Agreements expectations. We are able to exceed the customer’s expectations by developing nimble NOC workflow processes which can be engaged by various means. In other words, the customer has the option to open a ticket though the custom-built Service Now support portal or the NOC can take proactive initiative to respond to anomalous events alerted by our monitoring tools. Ultimately, the NOC is in the middle of all network activities and provides valuable insight into the health of the network.

The NOC is overseen by the NOC Watch Officer, who’s primary role is to monitor the health of the NOC itself and address any obstacles the NOC may encounter.

Tools

The NOC utilizes various tools to monitor and support the network. The NOC tools consist of traditional and modern solutions. Specifically, where legacy nodes have limited monitoring capabilities, the NOC utilizes traditional monitoring protocols. On the other hand, in cases where deeper telemetric monitoring is required, the NOC implements modern API driven monitoring tools.

Furthermore, the NOC tools approach is it use a suite of multi-vendor monitoring tools, rather than a single vendor in order avoid vendor security and supply chain failures as the industry has experienced in recent years.

Log aggregation is often a common function that is implemented across all services in an organization. Enough information about the collection agent, destination, and application log formats should be included.

Direct links to the log aggregation web interface should be provided whenever possible, including links to commonly used saved searches. Any commonly run queries should be documented here, along with a brief description of how and when they can be used. Anything that makes it easier for Operations to identify issues or narrow their investigation will save time during an outage.

Most applications will implement some sort of authentication and access control to ensure that only valid users have access to information that is appropriate for their role. At a minimum, this section should describe how the application is configured to perform access control. For example, it might provide the LDAP connection information, location of the configuration, and any special roles or permissions required for administration of the application.

The objective of this section is to make it quick and easy for Operations to identify what could have gone wrong with the system if someone complains that they are not able to authenticate or do not have access to the necessary resources. It should also identify what group of administrative users can be contacted if special permissions are needed to investigate an issue.

Applications which receive or produce data often have automated cleanup processes that remove obsolete data to ensure that the system continues to perform well over time. For example, a time series database might have a process which deletes data older than 30 days, or a binary repository might purge artifacts that conform to a specific set of rules. This section should describe those automated processes and the rules that determine what they delete.

When a disk alert is received from your monitoring system, this section should provide instructions about what actions can be taken to provide immediate short-term relief. If the filesystem is 100% full it may be necessary to take immediate action to cleanly shut down the application, increase the storage, and bring the application back on-line. In other cases, it may be possible to clear caches or execute cleanup scripts to bring disk, memory, or CPU usage back under control. Documenting how and when these cleanup activities should be executed will save critical time when responding to system alerts.

Application tuning can take many forms. In the Java world, it is typically a set of Java Virtual Machine (JVM) arguments that define the memory limits or the garbage collection strategy. In the database world it may be a set of configuration parameters that define the number of concurrent network connections, long running query restrictions, or other characteristics. This section should provide enough information for the reader to understand where and how those parameters can be changed, as well as any rules of thumb for how they can be tuned for this application to resolve common issues. For example, if the application owners have developed guidelines for how to optimize the memory allocation based on the number of users, concurrent requests, or other observable data, that calculation can be provided here to provide the Operations team with some guidelines for what is or is not appropriate.

Automation

Automation is used to respond and take action on critical events, specifically where instant response is required. ActioNet’s DevSecOps practice team, extends their operations principals into the NOC services. DevSecOps allows the NOC to deliver custom built services and work around limitations typically seen in monitoring tools.

Some examples include, CloudWatch Logs are used to track various log activities. The log dashboard was broken into 3 sections:

Expected API calls to monitor expected traffic
Error logs to identify issues that need to be fixed
Potential attacks which can be immediately addressed to prevent break ins and intrusions

Alerts are associated with the different sections which notify the appropriate teams in real time should something need action.

Another example is metric utilization monitoring. Here ActioNet used CloudWatch to monitor critical metrics for EC2 instances, setting alerts at specific levels to initiate appropriate actions to prevent service degradation or outages. Additionally, these metrics allow for future planning for growth or alternatively to determine reduction in resources to gain efficiency and to save costs to the customer.

ActioNet implemented Nagios to have a single pane of glass view of the up/down state of client environments. Alerting is also associated with Nagios dashboards and provide NOC personnel a means of identifying potential failures rapidly.

ActioNet also planned for the future by selecting tools that integrate well with other monitoring and security tools and which allow for expansion and flexibility.

Finally, ActioNet is delivering the ServiceNow IT Operations Manager (ITOM) capability to the client, which will predict issues, reduce user impacts, and automate resolutions with AIOps with API connections into ServiceNow of the client’s current and future monitoring capabilities.

Proactive Monitoring

ActioNet’s NOC’s mission is to provide a proactive monitoring and response service. The NOC strives to monitor the environment and respond proactively before something breaks. One of our objectives is to be ready to match a customer reported issue with an issue that we had already identified. So when the customer reports a problem we can quickly correlate the issue and provide a factual response, rather than learn about the issue for the first time.

For example, if NOC identifies that an interface started dropping packets, the application performance may be slightly degraded. This issue would not cause an immediate performance impact and it can go unnoticed for months. However, over time as more traffic traverses that interface the performance degradation would become more noticeable. This anomaly can be identified by a properly configured monitoring tool and fixed proactively. However, usually this type of issue gets escalated, then identified and resolved by higher tier engineers, after weeks of troubleshooting.

Performance Baselining

The ActioNet NOC monitors network performance by baselining the normal state of the network. The purpose of the NOC assuming this role is for capacity planning, where the NOC correlates the current state baseline performance with future projects and changes. For example, if a firewall’s CPU utilization runs at an average rate of 50%, and there is a plan to add additional AWS VPC’s, the NOC has the visibility to flag this ahead of time and bring this risk to the leadership and application owners through proper collaboration.