The Rise of MacGyver
How we built our deep learning computer to work efficiently from home during the COVID-19 crisis.
For most of us, the COVID-19 crisis means staying at home and thus working from home. Doing so, we not only start to miss our colleagues and the conversations at the coffee machine, but we also no longer have access to some crucial office resources. In our case, this could partly be solved: we can access the custom-built high-performance PC that we named “MacGyver” after the well known 1980s American TV series. According to the Wiki, MacGyver:
“….possess a genius-level intellect, proficiency in multiple languages, superb engineering skills, excellent knowledge of applied physics, military training in bomb disposal techniques, and a preference for non-lethal resolutions to conflicts” – MacGyver Wikipedia Page
We thought this would make an excellent name for our high-performance computer. A machine that we need to continue our work that involves deep learning. We can now train deep learning models that regular laptops would never be capable of training. In this blog, we describe how we assembled MacGyver and will provide you with tips and information on how to build your deep learning PC and how to set it up for working remotely.
Local Machine vs Cloud Solution
One might ask: "Why would you put effort into building your own machine if you can also train your models in the cloud?" Besides the fact that we like to build our own PC and gain knowledge into the inner working of hardware components, it is more cost-friendly in the long run to have your own machine. Amazon Web Services (AWS) EC2 g4dn.4xlarge (1 GPU), for example, which is a cloud instance and has almost the same specs as "MacGyver" costs $1.50 per hour (~ €1.37 per hour). "MacGyver" costs in total €2717.52. Assuming we will use the machine only during working hours which is 8 hours per day, 40 hours per week, 160 hours per month and 8320 hours per year. Doing a quick calculation, we will break even within 12 months. This may seem quite long but be aware that we often keep our models training after working hours, so the usage is more than 8 hours per day.
Moreover, such PCs will at least work for three years before components need to be replaced. On a three year basis, the PC will save up to 12.5 times the cost compared to AWS. However, a drawback of having a local machine is that components can break or require an upgrade, but with the money, you save up you can easily invest in it.
Our "MacGyver" contains the following components:
|CPU||AMD 1920 TR X||€ 239.05|
|CPU Cooler||Fractal Celsius S24||€ 109.95|
|Motherboard||MSI X399 SLI PLUS||€ 323.95|
|Memory/RAM||4x16gb G.Skill Aegis F4-3000C16S-16GISB||€ 315.40|
|Storage||1Tb Samsung EVO SSD M.2.||€ 145.00|
|GPU||Nvidia GEFORCE RTX 2080TI||€ 1130.00|
|Case||Lian-Li PC-O11AIR||€ 137.95|
|Power Supply||EVGA SuperNOVA P2 1200||€ 316.22|
Since we use GPU for training deep learning models, the CPU will be much less important than on a regular laptop. The primary use for the CPU is data preprocessing such as batch scheduling. Therefore, a mid-range CPU should be sufficient. We chose the AMD Threadripper 1920X because it has a lot of cores and threads compared to other CPU's. These multiples cores and threads allow us to run multiple scripts and programs in parallel without any problems.
A good CPU cooler is crucial for a PC build. A CPU cooler keeps your CPU from overheating. It also increases the lifetime of your components by keeping them at a safe temperature. We chose a liquid cooling system, which is the best way to cool a CPU because water transfers heat much more efficiently than air.
When choosing a motherboard, make sure to check its compatibility with all other components. PCPartPicker is a useful source where you can compare prices of parts and the compatibility between the components. We chose the MSI X399 SLI PLUS because this motherboard supports our AMD Threadripper CPU and has four PCIe slots with the possibility to install up to four GPUs if needed.
RAM (memory) is one of the most important aspects of a computer because a big part of the PC's performance and speed is dependent on the amount and speed of the RAM. The more RAM your CPU has access to, the faster it can process its tasks. That is why we selected a 64GB DDR4 with 3000 MHz clock speed which works well for us because we can keep a lot of data in memory.
For storage, we recommend an SSD over HDD because SSDs are at least 15 times faster than HDD and a lot quieter. We chose a 1TB SSD as datasets nowadays can easily go up to many gigabytes. Moreover, 1TB is sufficient for us as the data needed for training and validating is temporarily stored and will be removed after a model is fully trained and tested. The reason for doing this is because we have Network Attached Storage (NAS) available where all data are permanently stored. Accessing data directly from an SSD, within the machine itself, is always faster than accessing the data from a network (NAS). We recommend researching the amount of storage needed. However, storage can always be easily upgraded if it turns out to be insufficient.
GPUs are considered a crucial element for performing deep learning tasks. In the end, the choice of GPU is what determines the performance for training deep learning models. Training such models involve many matrix operations. To do this, GPUs have small computation units (called cores and threads) to perform this much faster than CPUs. Moreover, the memory bandwidth of GPUs enables it to train on large batches of data. It allows for faster computations which are good for gaming, complex simulations, but also deep learning. So apart from training models, we have an incredible gaming machine as well.
Lambdalabs made a lovely performance comparison between different GPU's for state-of-the-art (SOTA) deep learning models. When choosing a GPU, it is important to select one with enough VRAM (video RAM) as you are not able to train some models if you do not have enough VRAM. Also, make sure to choose an Nvidia GPU as most deep learning packages need Nvidia CUDA to run on the GPU. We decided to go with the GEFORCE RTX 2080TI, which has 11GB, 4352 cores and 544 tensor cores because it can train on most of the SOTA models with decent performance. Moreover, it has Tensor Cores that can accelerate large matrix operations.
When it comes to choosing a case, any case will do as long as it can fit in all your components. Since we wanted to add up to 4 GPUs in the future, we needed at least a mid-tower case. Based on reviews and price, we ended up with the case from Lian LI. Other popular brands are Cooler Master and Corsair.
The total power usage of the machine can easily be calculated. You have to add up the power consumption in watts for every component you currently have and the extra parts you plan on, including in the future. In formula:
total power = Pmotherboard + n x PGPU + PCPU + PCPU_cooler + m x Pstorage
Here n is the total maximum number of GPUs you plan on including. Variable m is the total maximum number of drives you plan on including. Insert these numbers into the equation, and you'll have your power supply. However, make sure to add an extra 10% watt as a safe margin. This ensures that the PC will have enough power. For our case, we concluded that a supply of 1200 watt would be sufficient. It is recommended to choose a power supply that can generate more power than is used. This will prove useful in cases where more GPUs are (planned to be) added later or when components are replaced for more resource-demanding components.
Operating System (OS)
The Operating System we went with is Ubuntu 18.04. One of the reasons we chose this OS is because most of the data science related code is written for Linux based Operating Systems. Ubuntu is also highly flexible. The amount of resources that it takes to run is much less than when using Windows. Ubuntu uses less RAM for example. Also, Ubuntu allows the use of docker.
Remote Working and Security
To make the "MacGyver" accessible in the office, we decided to give it a static IP address which means that the IP address in the office network won't be able to change. We also allowed Secure Shell (SSH) connections to the "MacGyver". SSH is a protocol that will enable us to securely log into "MacGyver" from a different device within the same network. Now it will always be accessible through the same IP address over an SSH connection. Moreover, we set up a VPN to access the office network from anywhere. This is necessary, especially during the corona pandemic, where everyone is working from home. The VPN is protected behind credentials, so it would be hard for anyone to access our office network. In case someone gets into our office network an SSH key is still needed to get into the "MacGyver".
In this blog, we provided an overview for building our deep learning PC, discussing the choice of specs and things to be aware when building a PC. Moreover, we demonstrated how we set up the PC for remote working and showed the advantages of using a local machine over a cloud solution. We hope that we have helped you on the journey of assembling your deep learning PC in this blog. If you are interested in the (deep learning) projects that we have worked on at Gyver, you can read all about them over here.