From Sai Sindhur Malleni
One of my responsibilities as a performance engineer working on OpenStack is to make sure that OpenStack scales, no matter what the use case is or the backend technologies being used. The flexibility to use multiple open source and proprietary backends to support the various services in OpenStack such as Neutron, Cinder and Glance is what makes OpenStack a force to reckon with in the cloud ecosystem.
As OpenStack adoption increases across verticals, the SDN controller which manages all of the virtual networking is taking more of a center stage. OpenDaylight, with its flexible and extensible architecture, support for multivendor networking devices, network programmability enabling control of the underlay and overlay, and tight integration with OpenStack is becoming the de facto choice for Neutron backend in NFV deployments.
Over the last 6 weeks, several colleagues and I were involved in a massive effort, spanning multiple teams and time zones, to test and improve the scale and performance of OpenDaylight. The scope of this work is in line with the objectives of the S3P WorkGroup and we are quite happy with the progress made thus far. This blog post goes over all the hardening that went into OpenDaylight in the scale and performance realms for the Nitrogen Release.
Our lab inventory consisted of 13 Dell R630 Nodes, with Intel Haswell processors (28 cores and 56 threads), 128G of memory and Intel x710 quad port NIC. We were using custom-built Carbon SR-2 RPMs (custom built to test some patches before they were merged). We deployed in both a clustered and standalone configuration. Two different configurations were used:
- 3 OpenStack controllers, 3 ODLs clustered, 1 OpenStack undercloud and the rest of the nodes as compute nodes
- 1 OpenStack controller, 1 ODL, 1 OpenStack undercloud and the rest of the nodes as compute nodes
Browbeat was used to orchestrate the tests with Rally, monitor the environment using Collectd/Graphite/Grafana stack and store test results in Elasticsearch. Browbeat takes a simple YAML-based configuration file of the control plane tests you want to run and orchestrates them on the OpenStack cloud. Some of the Rally scenarios that were run included creating networks, subnets, neutron ports, routers, security group rules, and booting VMs on subnets, with each done 500 “times” at varying concurrencies of 8, 16 and 32. While the “times” denotes the total number of resources you want to create of each type, the concurrency denotes the number of resources you want to create concurrently. So, one could effectively vary the load on OpenDaylight using these two knobs.
One of the biggest issues on which we focused during this round of testing was out-of-memory errors and subsequent death of the OpenDaylight process when creating several Neutron resources at scale. Using our automation tools (Ansible/ Collectd/Graphite and Grafana), we were able to actively monitor the heap memory usage and use Eclipse MAT to analyze the hprof files dumped on OOM. Ansible was used to install collectd (light-weight daemon to monitor system resource usage) on the nodes, Graphite was the data store for this time series data and Grafana was used to visualize the data as graphs. This is a great example of using several open source technologies to make another open source technology better. The memory leaks we were observing were traced back to unclosed transactions in the openflowplugin. Our test team took this information back to the ODL upstream teams, who promptly addressed the issue with fixes included in the Nitrogen release. It was inspirational to see that level of collaboration in the community.
We also looked into the networking-odl v2 driver which is the glue that connects OpenStack Neutron to Opendaylight. A journal table is used by the driver to keep a queue of the operations that occurred on neutron and need to be mirrored to the OpenDaylight controller. We identified several optimization possibilities in the way journaling was done. We also profiled the Galera database cluster used to house this table. With a few optimizations, we were able to achieve a 20x reduction in the CPU consumption of the mysqld process.
Clustering was another area of focus. We identified an issue where OpenStack VMs wouldn’t boot when clustered OpenDaylight configuration was being used. To give a bit of context, OpenStack considers a VM active only when the actual plumbing of the OVS interfaces and flows happens on the hypervisor, so OpenStack’s compute service Nova waits on the Neutron port to be set to active for a VM to be considered “active”. When using OpenDaylight, a websocket is used to communicate the port status information from the OpenDaylight controller to networking-odl. This in turns sets the port status in the Neutron database. In a clustered OpenDaylight setup there are 3 ODLs and 3 instances of networking-odl, one on each OpenStack controller. Haproxy assigns a Virtual IP (VIP) to the ODL cluster; this VIP could be on any one of the OpenDaylight nodes (not necessarily ODL cluster leader). However, Neutron events that trigger flow creation and activation of the operational port occur only on the leader, so when the VIP is not on the leader the websocket notifications aren’t established against the leader, causing a failure in communication of port status information between OpenDaylight and Neutron via networking-odl. The issue was fixed by having each networking-odl instance on the OpenStack controller establish a websocket with the local OpenDaylight member. This was a major win in hardening the integration points between OpenDaylight and OpenStack. A good amount of failover testing was also done to test the clustering feature of ODL.
This testing also led to the overall improvement of OpenStack as we were able to tune certain kernel parameters to help scale OpenStack to hundreds of networks and instances. Overall, we are extremely confident that all the work that went into improving the scalability, stability and performance of OpenDaylight, means an OpenDaylight release that is better than ever before. In my colleague Daniel Farrell’s words, “It seems to me that this was one of the more important stability improvements in ODL’s history. The combination of expert performance testers, dedicated hardware, close support from experts with direct access the testing environment and tight connections to the relevant upstream projects had never happened so well.”
I have to say that I have had a very positive experience working with the upstream OpenDaylight community. It is really heartening to see the developers, packaging team and performance engineers come together and do more than what an individual can ever do. As a performance engineer, I was thrilled to see the OpenDaylight community treat performance and scale as first class citizens and focus on them pre-release vs post-release.