Data Center Analytics: A Can’t Live Without Technology

** There is a disclaimer at the bottom of this blog. You may want to read that first. **

Earlier in June (2016) Chuck Robbinâ€™s announced the Cisco Tetration Analytics Appliance. This is a major push into a much needed space: Data Center Analytics. The appliance itself is a Big Data appliance, purpose built from the ground up to provide enhanced real-time visibility into the transactions occurring in a data center. Before we get into what that actually means, let me set the stage.

IT organizations have a lot on their hands with the rapid change in applications, and the fast-paced new demands for IT service. Some of these include:

Disaster Recovery Planning
Public and hybrid cloud migration
Enhancing security

A first step in successfully executing on any of these is having an understanding of the applications, and their dependencies. Most IT shops donâ€™t have any idea exactly how many applications run in their data center, much less what all of their interdependencies are. What is required is known as an Application Dependency Mapping (ADM).

Quick aside on â€˜application dependencyâ€™:

Applications are not typically a single container, VM, or server. They are complex ecosystems with outside dependencies. Letâ€™s take a simple example of an FTP server. This may be a single VM application, yet it still has various external dependencies. Think of things like: DNS, DHCP, IP Storage, AD, etc. (if you donâ€™t know the acronyms you should still see the point.) If you were to migrate that single VM running FTP to a DR site, cloud etc. that did not have access to those dependencies, your â€˜appâ€™ would be broken.

The reason these dependency mappings are so vital for DR/cloud migration, is that you need to know what apps you have, and how theyâ€™re interdependent before you can safely move, replicate, or change them.

From a security perspective the value is also immense. First, with full visibility into what apps are running, you can make intelligent decisions on which apps you may be able to decommission. Removing unneeded apps, reduces your overall attack vector. This may be something as simple as shutting down an FTP server someone spun up for a single file move, but never spun down (much less patched, etc.) The second security advantage is that once you have visibility into everything that is working, and needs to be, you can more easily create security rules that block the rest.

Traditionally getting an Application Dependency Mapping is a painful, slow and expensive process. It can be done using internal resources, documenting the applications manually. More traditionally itâ€™s done by a 3rd party consultant using both manual and tooled processes. Aside from cost and difficulty, the real problem with these traditional services is that theyâ€™re so slow and static, the results become useless. If it takes six months to document your applications in a static form, the end result has little-to-no value, because the data center changes so rapidly.

The original Tetration project was set out to solve this problem first, automatically and in real-time. Iâ€™ll discuss both what this means, and the enhanced use-cases it provides because of the method Tetration uses.

Data Collection: Pervasive Sensors

First, letâ€™s discuss where we collect data to begin the process of ADM. Tetration uses several:

Network Sensors:

One place with great visibility is the network itself. Each and every device (server, VM, container) hereafter known as â€˜end-pointâ€™ must be connected through the network. That means the network sees everything. When looking to collect application data in the past, the best tool available would be Netflow. Netflow was designed as a troubleshooting tool, and provides great visibility into â€˜flow headerâ€™ info from the switch. While Netfflow is quite useful, it has limitations when utilized for things like security or analytics.

The first limitation is collection rates. Netflow is not a hardware (ASIC level) implementation. That means that in order to do its job on a switch, it must push processing to the switch CPU. To quickly simplify this, it means that Netflow requires heavy lifting, and data center switches canâ€™t handle full-rate Netflow. Because of this, Netflow is set to sample data. For troubleshooting, this is no problem, but when trying to build an ADM, youâ€™d ideally look at everything, not a sampling.

The next problem with Netflow for our purposes is things it doesnâ€™t look at. Remember, Netflow was really designed to assist in â€˜after-actionâ€™, troubleshooting problems. Because of that intent, Netflow was never designed to collect things like time-stamps from the switch. When looking at ADM and application-performance time-stamping becomes very important, so having that, and other detailed information Netflow canâ€™t provide became very relevant.

Knowing this, the team chose not to rely on Netflow for our purposes. They needed something more robust, specifically when speaking about the data center space. Instead they designed a next generation of switching Silicon technology that can provide what I lovingly call â€˜Netflow on steroids.â€™ This is â€˜line-rateâ€™ data about all of the transactions traversing a switch, along with things like time-stamping and more.

That becomes our network â€˜sensor.â€™ using those sensors gives us an amazing view, but itâ€™s not everything. What those sensors are really doing is not ADM, theyâ€™re simply telling us â€˜whoâ€™s talking to whom, about what.â€™ For network engineers this is to/from IP combined with TCP/UDP port plus some other info. Think of this as a connectivity mapping. To make this into an application mapping more data was needed.

As of the time of this writing, these â€˜network sensorsâ€™ are built into the hardware of the Cisco Nexus 9200, and 9300 EX series switches.

Server Sensors:

To really understand an application, you want to be close to the app itself. This means sitting in the operating system. The team needed data that is just simply not seen by the switch port. Therefore they had to build an agent that could reside within the OS, and provide additional information not seen by the switch. These are things like: service name, who ran the service, what privileges was the service run with, etc. These server sensors provide an additional layer of application information. The server agents are built highly secured, and very low-overhead.

Additional Data:

So far weâ€™ve got a connectivity map, that can be compared against service/daemon name, and user/privileges. Thatâ€™s still not quite enough. We donâ€™t think of applications as the â€˜service nameâ€™ in the OS. We think of applications like â€˜expense systemâ€™, â€˜definethecloud.netâ€™, etc. To be able to turn the sensor data into real application mappings the team needed to cross-reference additional information. They built integration for systems like AD, DNS, DHCP, and existing CMDB systems to get this information. This allows the connectivity map and OS data to be cross-referenced back to the business level application descriptions.

Which sensors do I need?:

Obviously, the more pervasively deployed your sensors are, the better the data available. That being said, neither sensor has to be everywhere. Letâ€™s go through three scenarios:

Ideal:

Ideally you would use a combination of both network and server agents. This means you would have the OS agent in any supported operating system (VM or physical.) You would also be using supported hardware for network sensors. Every switch in the data center doesnâ€™t need this capability to be in an â€˜ideal modeâ€™. As long as every flow/transaction is seen once, you are in an ideal mode. This means that you could rely solely on leaf or access switches with this capability.

Server only mode:

This mode relies solely on agents in the servers. This is the mode most of Tetrationâ€™s early field trial customers ran in. This mode can be used as you transition to the ideal mode over the course of network refreshes, or can be a permanent solution.

Network only mode:

In instances where there is no desire to run a server agent, Tetration can still be used. In this operational mode the system relies solely on data from switches with the built in Tetration capability.

Note: The less pervasive you sensor network, the more manual input or data manipulation required. The goal should always be to move towards the ideal mode described here over time.

So that sounds like a sh** load of data:

The next step is solving the big data problem all of these sensors created. This is a lot of data, coming in very fast, and it has to be turned over as usable very, very quickly. If the data has to sit while itâ€™s being processed it will become stale and useless. Tetration needed to be able to ingest this data at line-rate and process it in real-time.

To solve this the engineers built a very specialized big data appliance. The appliance runs on â€˜bleeding edge IT buzzword soupâ€™: Hadoop, Spark, Kafka, Zookeeper, etc. etc. It also contains quite a bit of custom software developed internally at Cisco for this specific task. On top of this underlying analytics engine, there is an easy to use interface that doesnâ€™t require a degree in data science.

The Tetration appliance isnâ€™t intended to be a generalist big data system, where you can through any dataset at it, and ask any question you want. Instead itâ€™s fine-tuned for data center analytics. The major advantage here is that you donâ€™t need a handful of big data experts and data scientists to use the appliance.

Now what does this number crunching monster do?:

The appliance provides the following five supported use-cases. Iâ€™ll go into some detail on each.

Application Dependency Mapping (ADM)
Automated white-list creation
Auditing and compliance tools
Policy impact simulation
Historical forensics

Thatâ€™s a lot, and some of it wonâ€™t be familiar, so letâ€™s get into each.

ADM:

First and foremost the Tetration Appliance provides an ADM. This baseline mapping of your applications, and their dependencies is core to everything else Tetration does. This ADM on itâ€™s own is extremely useful, as mentioned in the opening of this blog above. Once you have visibility into your apps, you can start building that DR site, migrating apps, or parts of apps to the cloud, and assessing which apps may be prime for decommission.

Automated white-list creation:

If youâ€™re looking to implement a â€˜micro-segmentationâ€™ strategy their are several products like Ciscoâ€™s Application Centric Infrastructure that can do that for you. These are enforcement tools that can implement tighter, tighter security segmentation down to the server NIC, or vNIC. The problem is figuring out what rules to put into these micro-segmentation tools. Prior to Tetration, nobody had a good answer to this. The issue is that without a current ADM, itâ€™s hard to figure out what you can block, because you donâ€™t know what you need open. In steps Tetration.

Once Tetration builds the initial ADM, you have the option to automatically generate a white-list. Think of the ADM as the original (what needs to talk), the white-list is the opposite, or negative of this. Since Tetration knows everything that needs to be open for your production apps, it can convert this to a list of things to block using your existing tools, or new-fangled fancy micro-segmentation tools.

Auditing and Compliance:

Auditing and compliance regulations are always a time, money, and frustration challenge, but theyâ€™re both necessary and required. There are two issues with traditional audit methodologies that Tetration helps with. Auditing is typically done by pulling configurations from multiple devices (network, security, etc.) and then verifying the security rules in those devices meets the compliance requirements.

The two ways Tetration helps are; centralization (single source of truth), and real-time accuracy. Because Tetration is viewing all transactions on the network, it can be the tool you audit against. This alleviates the need to pull information from multiple different devices in the data center. This stream-lines the audit process significantly, from both a collection and correlation perspective.

What I find the more interesting aspect is that using Tetration as the auditing tool lets you audit reality, rather than theory. Let me explain, when you do a traditional audit, youâ€™re looking at the configuration of rules in security devices, and making the assumption that those rules and devices are doing there job, and nobodies gotten around them. On the other hand, when you do your audit using Tetration your auditing against the real-time traffic flows in your data center, â€˜the reality.â€™

Policy impact simulation:

One of the things the Tetration appliance is doing as it collects data is extremely relevant to the following two use-cases. As the appliance ingests data, it may receive multiple copies of the same transaction. Think server A talking to server B across switch Z, all three reporting that transaction. As this occurs, the appliance de-duplicates this data and stores one master copy of every transaction down to the cluster file system. This means, the appliance keeps a historical record of every transaction in your data center. Donâ€™t start worrying about space yet, remember this is all metadata (data about the data) not payload information, so itâ€™s very lightweight.

The first thing the appliance can do with that historical record is policy simulation. Picture yourself as the security team wanting to implement a new security rule. You always have one challenge, the possibility of an adverse effect of a rule you implement. How do you ensure you wonâ€™t break something in production if you donâ€™t have full visibility into real-time traffic? The answer is, you donâ€™t.

With Tetration, you do. Tetrationâ€™s policy impact simulation allows you to model a security change (FW rule, ACL, etc.) and then have the system do an impact analysis. The system assesses your proposed change against the historical transaction records and letâ€™s you know the real-world implications of making that change. I call this a â€˜parachuteâ€™ for new security policies. Rather than waiting for a change window, hoping the rule works, and rolling it back if it breaks something, they can simply test against the real-time traffic.

Historical Forensics:

As stated above, Tetration maintains a de-duplicated copy of every transaction that occurs in the data center. On top of that unique source of truth, theyâ€™ve built both an advanced, granular search capability and a data-center â€˜DVRâ€™. What this means is that you can go back and search anything that happened in the data center post Tetration installation, or even go and playback all transaction records for a given period of time. This is an extremely powerful tool in the area of security forensics.

Summary:

Tetration is a very unique product, with a wide range of features. Itâ€™s a purpose-built data center analytics appliance providing visibility, and granular control not formerly possible. If you have more time than sense, feel free to learn more by watching these white board videos Iâ€™ve done on Tetration:

Overview: https://youtu.be/bw-w3T7JN-0

App Migration: https://youtu.be/KkehzpCXL70

Micro-segmentation: https://youtu.be/fIQhOFc5h2o

Real-time Analytics: https://youtu.be/iTB6CZZxyY0

** Disclaimer **

I work for Cisco, with direct ties to this product, therefore you may feel free to consider this post biased and useless.

This post is not endorsed by Cisco, nor a representation of anyoneâ€™s thoughts or views other than my own.

** Disclaimer **