I recently did a short webcast that talked about Oracle Linux & Containers and some suggestions around best practices and some security considerations.
The webcast had just a few slides and some of the feedback I received was that there could have been more textual assist to the talking so I promised I would write up a few things that came up during the webcast. Here it is:
We have been providing Oracle Linux along with great support for nearly 12 years. During those years, we have added many features and enhancements. Through upstream contributions, picked up by the various open source projects that are distributed as part of Oracle Linux (in particular UEK) or additional features/services such as Oracle Ksplice or DTrace (released under GPL), etc... we have been helping make Linux better.
In terms of virtualization, we’ve been contributing to Xen since 2005+. Xen is the hypervisor used in Oracle VM. A bit more recently, we are also heavily focus on kvm and qemu in Linux. Of course, we have Oracle VM VirtualBox. So a lot of virtualization work has been going on for a very long time and will continue to be the case for a very long time. We have many developers working on this full time (and upstream).
Container work:
We were early adopters of lxc and were one of the first, if not the first, to certify lxc with enterprise applications such as our database or applications. This was before Docker existed.
Lxc was the initial push to mainstreaming container support in Linux. It helped push a lot of projects in the Linux kernel around resource management, namespace support, all the cgroups work,... lots of isolation support really got a big start around this time. Many developers contributed to it and certainly a bunch of openvz concepts got proposed to get merged into the mainline kernel.
A few years after lxc, Docker came to the forefront and really made containers popular - talk about mainstream… and again, we ended up providing Docker from the very beginning and saw a lot of potential in the concept of lightweight small images on Linux for our product set.Today - everyone talks about Kubernetes, Docker or Docker-alternatives such as Rkt and microservices. We provide Oracle Container Services for use with Kubernetes and Oracle Container Runtime for Docker support to our customers as part of Oracle Linux subscriptions. Oracle also has various Oracle Cloud services that provide Kubernetes and Docker orchestration and automation. And, of course, we do a lot of testing and support many Oracle products running in these isolation environments.
The word isolation is very important.
For many years I have been using the world isolation when it comes to containers, not virtualization. There is a big distinction.
Running containers in a Linux environment is very different from running Solaris Zones, or running VMs with kvm or Xen. Kvm or Xen, that’s "real" virtualization. You create a virtual compute environment and boot an entire operating system inside (it has a virtual bios, boots a kernel from a virtual disk, etc). Sure- there are some optimizations and tricks around paravirtualization but for the most part it’s a Virtual Machine on a real machine. The way Solaris Zones is implemented is also not virtualization, since you share the same host kernel amongst all zones etc, But - the Solaris Zones implementation is done as a full fledged feature. It’s a full-on isolation layer inside Oracle Solaris top to bottom. You create a zone and the kernel does it all for you right then and there: it creates a completely separate OS container for you, with all the isolation provided across the board. It’s great. Has been around for a very long time, is used widely by almost every Oracle Solaris user and it works great. It provides a very good level of isolation for a complete operating system environment. Just like a VM provides a full virtual hardware platform for a complete operating system environment.
Linux containers, on the other hand, are implemented very differently. A container is created through using a number of different Linux kernel features and you can provide isolation at different layers. So you can create a Linux container that acts very, very similar to a Solaris zone but you can also create a Linux container that has a tremendous amount of sharing amongst other containers or just other processes. The Linux resource manager and various namespace implementations let you pick and choose. You can share what you want, and you can isolate what you want. You have a PID namespace, IPC namespace, User Namespace, Net namespace ,... each of these can be used in different ways or combined in different ways. So there’s no CONTAINER config option in Linux, no container feature but there are tools, libraries, programs that use these namespaces and cgroups to create something that looks like a complete isolated environment akin to zones.Tools like Docker and lxc do all the "dirty work" for you, so to speak. They also provide you with options to change that isolation level up and down.
Heck, you can create a container environment using bash! Just echo some values to a bunch of cgroups files and off you go. It’s incredibly flexible.
Having this flexibility is great as it allows for things like Docker (just isolated a process, not a whole operating environment). You don’t have to start with /bin/init or /bin/systemd and bring up all the services. You can literally just start httpd and it sees nothing but itself in its process namespace. Or… sure… you can start /bin/init and you get a whole environment, like what you get by default with lxc.I think Docker (and things like Docker - Rkt,..) is the best user of all these namespace enhancements in the Linux kernel. I also think that, because the Linux kernel developers implemented resource and namespace management the way they did, it allowed for a project like Docker to take shape. Otherwise, this would have been very difficult to conceive. It allowed us to really enter a new world of… just start an app, just distribute the app with the libraries it needs, isolate an app from everything else, package things as small as possible as a complete standalone unit…
This,in turn, really helped the microservices concept because it makes micro really... micro... Docker-like images give a lot more flexibility to application developers because now you can have different applications running on the same host that have different library needs or different versions of the same application without having to mess with PATH settings and carving out directories and seeing one big mess of things… Sure, you can do that with VMs… but the drawback of a VM is (typically) that you bring in an entire OS (kernel, operating environment) to then start an app. This can cause a lot of overhead. Process isolation along with small portable images gives you an incredibly amount of flexibility and...sharing...
With that flexibility also comes responsibility - whereas one would have in the order of 10-20 VMs on a given server, you can run maybe 30-40-50 containerized OS environments (using lxc) but you could run literally 1000s of application containers using Docker. They are, after all, just a bunch of OS processes with some namespaces and isolation. And if all they run is the application itself, without the surrounding OS supported services, you have much less overhead per app than traditional containers.
If you run very big applications that need 100% performance and power and the best ‘isolation’... you run a single app on a single physical server.
If you have a lot of smaller apps, and you’re not worried about isolation you can just run those apps on a single physical server. Best performance, harder to manage.
If you have a lot of smaller environments that you need to host with different OSs or different OS levels,.. You typically just run tons of VMs on a physical server. Each VM boots its own kernel, has its own virtual disk, memory etc. and you can scale.. 4-16 typical.
If you want to have the best performance where you don’t need that high isolation of separate kernels and independent OS releases down the kernel version (or even something like Windows and Linux or Oracle Linux and Ubuntu etc)... then you can consider containers. Super light weight, super scalable and portable.The image can range from an OS image (all binaries installed, all libraries like a vm or physical OS install) or… just an app binary, or an app binary + libraries it needs. If you create a binary that is statically linked, you can have a container that's exactly 1 file. Isn't that awesome?
Working on Operating Systems at a company that is also a major cloud provider is really great. It gives us direct access to scale. Very, very large scale... and also a direct requirement around security. As a cloud provider we have to work very, very hard towards ensuring security in a multi-tenant environment. Protect customers data from one another. Deploying systems in isolation in an enterprise can be at a reasonable scale and of course security is very important or should be but the single tenancy aspect reduces the complexity to a certain extend.Oracle Linux is used throughout Oracle Cloud as the host for running VMs, as the host for running container services or other services, in our PaaS, SaaS stacks, etc. We work very closely with the cloud development teams to provide the fastest, most scalable solutions without compromising security. We want VMs to run as fast possible, we want to provide container services, but we also make sure that a container running for tenant A doesn’t, in any way, expose any data to a container running for tenant B.
So let’s talk a little bit about security around all this. Security breaches are up. A significant increase of data breaches every month, hacking attempts… just start a server or a VM with a public IP on the internet and watch your log files - within a few minutes you see login attempts and probes. It’s really frightening.
Enterprises used to have 100s maybe 1000s of servers - you have to keep the OS and applications current with security fixes. While reasonably large, still manageable… then add in virtualization and you increase by a factor the number of instances (10000+)… so you drastically increase your exposure … and then you go another factor or couple of factors up to microservices and containers - deployed across huge numbers of servers… security becomes increasingly more important and more difficult. 100000+... Do you even know where they run, what they run, who owns them?
On top of all that - in the last 8 or so months: Spectre and Meltdown. Removing years of assumptions and optimizations everyone has relied upon. We suddenly couldn't trust VMs on the same host being isolated well enough, or processes from snooping on other processes, without applying code changes on the OS side or even in some cases in the applications to prevent exposure.
Patches get introduced. Performance drops.. And it’s not always clear to everyone what the potential exposure is and where you have to really worry and where you might not have to worry too much.When it comes to container security, there are different layers:
Getting images / content from external (or even internal sites)There are various places where developers can download 3rd party container images. Whereas in the past one would download source code for some project or download a specific application… these container images (let’s call them docker images) are now somewhat magical blackboxes you download a filesystem layer, or a set of layers. There are tons of files inside but you don’t typically look around, you pull an image and start it… not quite knowing what’s inside… these things get downloaded onto a laptop.. Executed… and … do you know what’s inside? Do you know what it’s doing? Have these been validated? Scanned?
Never trust what you just download from random sites. Make sure you download things that are signed, or have been checksummed and come from reputable places. Good companies will run vulnerability scanners such as Clair or Qualys as part of the process, make sure developers have good security coding practices in place. When you download an image published on Oracle Container Registry, it contains code that we built, compiled, tested, scanned, put together. When you download something from a random site, that might not be the case.One problem: it is very easy to get things from the outside world.. # docker pull, by default, goes to Docker hub.. Companies can’t easily put development environments in place that prevent you from doing that. One thing we are working on with Oracle Containers Runtime using Docker is adding support for access control to Docker image repos. You can lock down which repos are accessible and which aren’t. . for instance: your Docker repo list can be an internal site only, not Docker hub.
When building container images you should always run some form of image scanner.We are experimenting with Notary - use Notary to digitally sign content so that you can verify images that are pulled down. We are looking at providing a Notary service and the tools for you to build your own.
Building imagesAside from using Clair or Qualys in your own CI/CD environment, you also have to make sure that you update the various layers (OS, library layer, application layer(s)) with the latest patches. Security errata are released on a regular basis. With normal OS’s whether bare metal or VMs, sysadmins run management software that easily updates packages on a regular basis and keeps things up to date. It’s relatively easy to do so and it is easy to see what is installed on a given server. There might be an availability impact when it comes to kernel updates but for the most part it is a known problem... Updating containers, while technically, you can argue, it’s easy… just rebuild your images… it does mean that you have to go to all servers running these containers and bring them down and back up. You can’t just update a running image. The ability to do anything at runtime is much more limited than when you run an OS instance with an application. From a security point of view, you have to consider that. Before you start deploying containers at scale, you have to decide on your patch strategy. How often do you update your images, how do you distribute these images, how do you know all the containers that are running and which versions they run, which layers are they running etc.. sorting this out after a critical vulnerability hits will introduce delays and have a negative impact and potentially create large exposure.
So - have a strategy in place to update your OS and application layers with security fixes, have a strategy in place on how to distribute these new image updates and refresh your container farm.
Lock down
If you are a sophisticated user/developer, you have the ability to really add very fine grained controls. With Docker you have options like privileged containers: giving extra access to devices and resources. Always verify that anything that is started privileged has been reviewed by a few people. Docker also provides Linux Capabilities control such as mknod or setgid or chroot or nice etc.. look at your default capabilities that are defined and where possible, remove any and all that are not absolutely needed.
Look into the use of SELinux policies. While SELinux operates at the host level only, it provides you with an additional security blanket. Create policies to restrict access to files or operations.
There is no SELinux namespace support yet. This is an important project to work on, we started investigating this, so that you can use SELnux within a container in its own namespace, with its own local container policies.
Something we use a lot as well inside Oracle: seccomp. Seccomp lets you filter syscalls (white list). Now, when you really lock down your syscalls and have a large list, there can be a bit of a performance penalty… We’re doing development work to help improve seccomp’s filter handling in the kernel. This will show up in future versions of upstream Linux and also in our UEK kernel.
What’s nice with seccomp is that if you have an app and you know exactly which few syscalls are required, you can enforce that it will only ever be allowed to access / execute those systemcalls and nothing else will get through in case a rogue library would magically get loaded and try to do something.So if you are really in need of the highest level of lockdown, a combination of these 3 is ideal. Use seccomp to restrict your system calls exposed to your container, use SELinux policies to control access to processes that are running and what they can do with labels, use capabilities alongside / on top of seccomp to prevent privileged commands to run and run everything non-privileged.
The third major part is the host OS.
You can lock down your container images and such, but remember that these instances all run (typically) on a Linux server. This server runs an OS kernel, OS libraries (glibc)... and security vulnerability fixes need to be applied. Always ensure that you apply errata on the host OS… I would always recommend customers to use Oracle Ksplice with Oracle Linux
Oracle Ksplice is a service that provides the ability for users to apply critical fixes (whether bugs or vulnerabilities) while the system is up and running with no impact to the applications (or containers).
While not every update can be provided as an online patch, we’ve had a very, very high success rate. Even very complex code changes have been fixed or changed using Ksplice.
We have two areas that we can address. Kernel – the original functionality since 2009 and also since a number of years, a handful of userspace libraries. We are in particular focused on those libraries that are in the critical path – glibc being the most obvious one along with openssl.
While some aspects of security are the ability to lock down systems and reduce the attack surface, implement best practices, protect source of truth, prevent unauthorized access as much as possible, etc… if applying security fixes is difficult and have a high impact on availability, most companies / admins will take their time to apply them. Potentially waiting weeks or months or even longer to schedule downtime. Keep in mind that with Ksplice we provide the ability to ensure your host OS (whether using kvm or just containers) can be patched while all your VMs and/or containers continue to run without any impact whatsoever. We have a unique ability to significantly reduce the service impact of staying current with security fixes.
Some people will be quick to say that live migration can help with upgrading VM hosts by migrating VM guest off to another server and reboot the host that was freed up – while that’s definitely a possibility, it’s not always possible to offer live migrate capabilities at scale. It’s certainly difficult in a huge cloud infrastructure.
In the world of containers where we are talking about a 10-100 fold or even more number of instances running per server, this is even more critical. Also, there is no live migration yet for containers. There’s some experimental work but not production quality to migrate a container/Docker instance / Kubernetes pod from one server to another.
As we look more into the future with Ksplice: we are looking at more userspace library patching and to see how can make that scale on a container level - the ability to apply , for instance, glibc fixes within container instances directly without downtime. This is a very difficult problem to solve because there can be 100’s of different versions of glibc running and we also have to ensure images are updated on the fly so that a new instance will be ‘patched’ at startup. This is a very dynamic environment.
Project Kata is a hybrid model of deploying applications with the flexibility and ease of use (small, low overhead) of containers and with the security level of VMs. The scalability of Kata containers is somewhere in between VMs and native containers. Order of low 1000s not high 1000s. Startup time is incredibly fast. Starting a VM typically take 20-30 seconds, starting Docker instances takes in the order of few milliseconds. Starting a Kata container takes between half a second and 3 seconds depending on the task you run. A Kata container effectively creates a hardware virtualization context (like kvm uses) and boots a very, very optimized Linux kernel, that can start up in a fraction of a second, with a tiny ramdisk image that can execute the binaries in your container image. It provides enough sharing on the host to scale but it also provides a nice clean virtualization context that helps isolation between processes.
Most, if not all, cloud vendors run container services inside VMs for a given tenant. So the containers are isolated from other tenants through a VM context. But that provides a bit more overhead than is ideal. We would like to be able to provide containers that run as native and low overhead as possible.,.. We are looking into providing a preview for developers and users to play with this. Oracle Linux with UEKR5. We have a Kata container kernel built that boots in a fraction of a second and we created a tiny package that executes a Docker instance on an Oracle Linux host. It’s experimental, we are evaluating the advantages and disadvantages (how secure is the kernel memory sharing, how good is performance at scale, how transparent is it to run normal docker images in these kata containers, are they totally compatible etc etc).
Lots of exciting technology work happening.