BACKGROUND
Being a Cisco Unified Communications (aka Collaboration) engineer, I have a SOHO lab where I stage various Cisco application servers and practice scenarios that I deploy for my clients. I started delving into virtualization back in 2009 – several years before Cisco decided to virtualize Unified Communications applications. Back then, Cisco Unified Communications applications ran directly on bare metal, but with a few tweaks, it was possible to install them as virtual machines for lab purposes. Initially, I used VMware Server and VMware Workstation as my hypervisors and ran the VMs on my laptop, but in 2010 I bought my first lab server to be used as an ESXi host.
The sever I chose was a Dell PowerEdge T410 with a Xeon 5600 series CPU and a PCI-based hardware RAID controller that was on the VMware ESXi HCL (Hardware Compatibility List). The server was enormous, extremely loud, and power hungry. I had to dedicate a spare room in my house to my lab because of the amount of noise generated by my lab equipment. I was able to run up to four VMs concurrently on the Dell T410 equipped with 12 GB of RAM, 1 quad-core Xeon 5600 series CPU, and two SATA drives configured in RAID1. If I tried to power up a fifth VM, some of my VMs would start crashing. Being new to virtualization, I did not know if this was something to be expected, but it surely did not feel right.
After using the Dell PowerEdge T410 server for 2 1/2 years, I replaced it with a Late 2012 Quad-Core i7 2.6 GHz Mac Mini, which I upgraded with 16 GB of 1600 MHz RAM. The size of the Mac Mini was about 1/50 of the Dell PowerEdge T410 server, the Mac Mini was completely silent, and I could run four VMs concurrently on the Mac Mini, which was the same number of VMs as I was able to run on the Dell T410 server. I was impressed with the the fact that the little computer was as powerful as a huge server, but I wanted to break the limit of four VMs running concurrently.
I experimented quite a bit with the Mac Mini, trying to understand what was causing the VMs to slow down and crash when I attempted to run more than four VMs concurrently. Additionally, I wanted my lab to be hosted on a RAID-protected storage so that a hard-drive failure would not destroy hundreds of hours I had invested in configuring my VMs. Even though the Mac Mini could have a second HD added, and OS X (MacOS) supported software RAID, VMware ESXi had no support for software RAID. Therefore, the only solution to host VM datastore on redundant storage was to use a NAS for hosting VM datastores.
Fast forward to 2013 when I purchased a QNAP TS-569L, which was certified as an iSCSI SAN with VMware ESXi 5. Once I configured the QNAP TS-569L with an iSCSI target and provisioned iSCSI LUNs on it, I moved my VMs datastores to the iSCSI LUNs hosted by the QNAP TS-569L. Even though this solution provided data protection by having VM datastores hosted on redundant storage (RAID5), my new limit of concurrently running VMs increased only marginally (from 5 to 6 VMs). Powering up a 7th VM would cause every VM in the lab to crash. Suspecting that the bottleneck was the Mac Mini and not the storage, I added a second Mac Mini to my lab and migrated half of the VMs from one Mac Mini to the other one. However, the total number of VMs running concurrently without crashing remained at 6 VMs. It was at that time that I decided to finally get serious and figure out what was going on.
After much research and experimentation, I realized that the problem with my lab was a low IOPS yield in the storage devices I had been using. The Mac Mini’s internal drive span at 5,000 RPM and had a relatively low IOPS yield (probably in the range of 60-80 IOPS). The RAID5-based storage I used in my NAS had a marginally better IOPS yield, but due to the drives also being low-RPM HDDs, the yield of RAID5 was no more than 150 IOPS. This was enough to support I/O from six concurrently running VMs, but as I tried to add more VMs, the IOPS load was too much for the storage array to handle.
TYPES OF STORAGE FOR VM DATASTORES
There are three schools of thought on the type of storage for hosting VM datastores in an ESXi lab:
- Local Storage: The most straightforward (and expensive) way to proceed is to get local storage for each ESXi host. At this point (in 2016), there should be little argument that the local storage should be SSD (not HDD) due to the fact that even consumer-grade SSDs can yield IOPS that are two orders of magnitude higher than fastest enterprise-class HDDs. However, a significant shortcoming of SSD-based storage for VM datastores is the high price of SSDs compared to HDDs.
- SAN-based Storage: In a SOHO VMware ESXi lab, the two SAN technologies used most frequently are iSCSI and NFS (which is not exactly a SAN protocol). There are numerous sites and blogs on the web that discuss benefits and shortcomings of either protocol, but, in my opinion, either protocol is suitable for connecting to a NAS or SAN.
- vSAN: vSAN is a relatively new technology supported by VMware. With vSAN, it is possible to create a virtual SAN consisting of discrete SSDs installed in several ESXi hosts.
Note: Whereas vSAN is possible with 1 Gbps network links interconnecting ESXi hosts, the network becoming a bottleneck is definitely an issue. Creating a vSAN that uses a 1 Gbps network is more of a proof-of-concept-type ESXi lab for VMware engineers than a viable storage option for hosting datastores of multiple VMs running on several ESXi hosts. In my opinion, for vSAN to become a viable storage solution for a SOHO ESXi lab, a 10 Gbps network infrastructure must be in place, and in 2016, having a 10 Gbps network infrastructure is still cost-prohibitive for most SOHO ESXi labs.
PROS AND CONS OF VM DATASTORES ON SSD-BASED STORAGE
If VM datastores are hosted on SSD-based storage, the issue of the IOPS yield being a bottleneck immediately goes away as even consumer-grade SSDs could yield tens of thousands of IOPS. However, using SSD-based storage for VM datastores when terabytes of redundant storage may be required is cost prohibitive for a SOHO ESXi lab.
BENEFITS OF VM DATASTORES ON HDDs BEHIND SSD CACHE
Fortunately, there exists a solution called SSD caching, whereby one or several SSDs could be used for write and/or read caching of data stored on HDDs. For SSD caching, one or two SSDs of a relatively small storage capacity are used to cache frequently used data for I/O operations between the ESXi host and the storage that hosts VM datastores. For write IOPS, the yield improves dramatically because all the data is first written to SSDs, and then the NAS copies the data from SSDs to HDDs as a background operation. For read IOPS, the NAS can cache and keep most commonly used data on SSDs, so when the VM addresses its storage, most of the data is loaded into RAM from the SSD cache instead of being loaded from the HDD where the VM datastore is hosted.
Even though placing VM datastores on SSD-based redundant storage is the most robust solution, due it its high cost, the second best solution is placing VM datastores on HDD-based redundant storage behind a read-write SSD cache.
SOHO-CLASS iSCSI STORAGE WITH READ-WRITE SSD CACHE
QNAP was not the first NAS manufacturer to enable SSD caching in their devices (Synology was the first one to the market with this solution), but QNAP was the first manufacturer that enabled both read and write SSD caching in their mid-range NAS devices, which happened in 2015. Being convinced that the limit of six VMs running concurrently in my lab was due to the low IOPS of the storage solution I was using, I decided to upgrade my NAS to a model that supported read-write SSD caching. In early 2016, I purchased a mid-range QNAP that supports this feature: QNAP TS-563. This model has a quad-core AMD CPU and can be upgraded to 16 GB of RAM. With upgraded RAM, this NAS can run VMs in Linux KVM natively on its own hardware, using QNAP’s front-end to KVM called Virtualization Station. So, by getting this NAS, not only did I acquire an iSCSI SAN with SSD caching but also another host on which I could run a few VMs.
I transferred three of my WD RED (NASware) drives from QNAP TS-569L to TS-563 and purchased two 250 GB Samsung 850 Pro drives to be used for read-write SSD caching. The QNAP SSD caching solution requires that two SSDs be used in RAID1 for read-write SSD caching. With one SSD installed, only read SSD caching is supported.
Once I installed the two Samsung 850 Pro drives in my TS-563 and enabled SSD caching for iSCSI LUNs, I powered up my VMs one by one. For read SSD caching to be effective, the SSD cache must “warm up” first, which means that as the data is read from the HDDs, the internal SSD caching algorithm fills the SSD cache with the data that traverses the SSDs. Once I powered up my VMs, I watched via the QNAP storage utility the SSD cache “warm up” before my eyes. When I restarted my VMs, I was blown away by how fast the VMs started up and loaded their services. What used to take 15 minutes for the Cisco Unified Communications Manager VM to power up and load all of its services, now took no more than 4-5 minutes. I was able to log in to that VM via SSH within less than a minute after powering it up, whereas it used to take at least 5 minutes without SSD caching for the VM to start accepting SSH connections. Additionally, I noticed a significant improvement in how responsive the VMs became. Navigating among Web GUI configuration pages was lightning fast compared to a very sluggish admin web GUI interface that I had gotten so used to.
Because I had two 2012 quad-core i7 Mac Minis with 16 GB of DDR3 1600 MHz RAM in each, I increased the number of VMs running on each Mac Mini to six for a total of 12 VMs, whose datastores were hosted on the QNAP TS-563 serving as an iSCSI SAN. With 12 VMs running concurrently, the iSCSI-based storage provided sufficient IOPS for all 12 VMs to exhibit good responsiveness in their operation as well as in their web GUI administration pages.
BEST SOLUTION FOR SOHO LAB: ISCSI SAN WITH SSD READ-WRITE CACHE
After years of experimentation with HDD storage local to ESXi hosts and with iSCSI-based SAN storage, I have come to the conclusion that the most cost-effective solution is a mid-range NAS certified as an iSCSI SAN with VMware ESXi and capable of read-write SSD caching. The QNAP TS-563 upgraded to 16 GB of RAM (not necessary if used only as an iSCSI SAN) and equipped with three WD RED (NASware) drives in a RAID5 volume as well as with two Samsung 850 Pro SSD drives (for read-write SSD cache) cost me about $1,200. This NAS also provides other services (FTP, sFTP, TFTP, LDAP, file storage via SMB and AFP, etc.) and is capable of running VMs in Linux KVM (rebranded by QNAP as Virtualization Station). With the SSD-based RAID1 used as the read-write cache for the RAID5-based HDD storage, this type of iSCSI-based SAN can yield enough IOPS to host datastores for a few dozen VMs running on multiple ESXi hosts.
The iSCSI-based SSD-cache-capable SAN storage solution for VM datastores starts to show its benefits once the storage capacity required for VM datastores exceeds 1 TB, whereby it becomes much more cost-effective to increase redundant storage capacity by adding HDDs while keeping SSD cache size at just a fraction of the total HDD storage. For example, currently my SAN’s iSCSI LUN storage capacity is 4 TB, whereas the SSD RAID1 read-write cache consists of two 250 GB Samsung 850 Pro drives in RAID1. It is probably wise to keep the size of the RAID1 SSD volume used as read-write cache at about 10% of the total amount of storage behind this SSD cache, so I should probably have purchased two 500 GB Samsung 850 Pro drives for the SSD cache, but I wanted to save some money and instead purchased two 250 GB Samsung 850 Pro drives.
When choosing SSD drives to be used as read-write cache, it is important to pay attention to the metric called write endurance. In consumer-grade SSDs, write endurance is often no more than a few dozen terabytes, which is enough for consumer applications, but not nearly enough for SSD cache, where a constant flow of data traverses the SSD in both read and write directions. On the other end of the write-endurance scale are datacenter-class SSDs with write endurance of several petabytes (1 PB = 1000 TB). Whereas one should stay away from consumer-grade SSDs when choosing SSDs to be used as write cache, buying enterprise-grade SSDs (e.g Intel DC S3700) for use in SOHO-class iSCSI SANS is cost-prohibitive. In my opinion, the Samsung 850 Pro is a good compromise of high-enough write endurance and low-enough price per GB for SOHO labs.