You, too, should run a few machines

For Science!

Jun 29, 2021

The giant multi-story lens array of a lighthouse.

Many many many years ago, back when I was in grad school in the early/mid 2000’s, I started learning to use Unix systems — some Linux and a lot more FreeBSD. I had extremely little idea what I was doing, but I had finally managed to get a spare computer salvaged from a computer lab recycle pile so I could risk breaking one system and having a backup device to look up how to fix it. (This was in the age before smartphones…) That machine wound up hosting SVN out of my room and served as Yet Another Backup point for my thesis. Ah, memories.

Over the decades, that small fitful start slowly snowballed. Over years, it transformed itself into a critical body of knowledge I rely on at work. Even if I’m not spinning up systems for production, it’s useful for when I interact with engineers, debugging telemetry, as well as generally understanding how stuff is used in my current job at a big cloud infrastructure provider. Every so often, it gets me the job title “Data Engineer”.

But I come from a fairly nerdy path (how many social science grad students run their own server?), so I honestly don’t know what the population at large is doing. Has everyone switched to doing everything on their laptops? Is everyone using the cloud or something else? So I asked people on Twitter… and I got a ton of responses from people from all around, it was great!

Randy Au 🙃 @Randy_Au

Yo, data science twitter. I'm noodling on a post. How many folk out there have ever stood up and run their own small linux server for personal use/experimentation/learning, at home or on cloud, just to do small stuff? Is this common or am I out of touch?

It appears, from the massively self-selected and biased sample of “people who saw my tweet and decided to reply”, that standing up a server to poke around with is a very common occurrence amongst people who self identify as data scientists on Twitter.

Most data scientists are expected to have the tech skills to write code, build data pipelines, etc., so it makes some sense. While our duties almost never extend into running large highly-available data center deployments, the basic skills are more than enough to put together some small servers for personal use.

But what if you’re sitting here reading this and you’ve never done it before? Maybe you’re coming from another field that never required you to touch server infrastructure, or just haven’t gotten to the point of learning this stuff yet. Should you give it a try?

Yes!

Standing up your own server from scratch (on the software side, unless you feel like playing with hardware too!) forces you to touch a huuuuuge amount of little tiny details that get glossed over.

You’ll also develop more empathy for people whose job it is to build and run the large complicated data systems that we use in our daily work. You’ll also get a better understanding of why people are willing to pay a premium to have various systems like databases run for them in a managed cloud service.

But since this topic is WAY too broad to cover in a single post, and I’m not very good at doing multi-week series, I’m going to attack this problem at a higher level. Say you have no idea where to start.

What’s the most important basic decisions you need to understand and make to get started, no matter what you’re doing?

Getting started - Pick something you want to do

Projects work out best when you have an end goal in mind, it brings focus and helps you filter out all the noise you don’t want. It’ll also dictate what kind of equipment you’ll need.

Maybe you want to build a big computer to do heavy duty machine learning stuff. That’s going to require significantly more hardware, in the form of expensive GPUs, than someone who just wants to host a web site, scrape some data, run a database. If you want to do Internet of Things type work, you’re going to want lots of tiny cheap computers instead.

If you have no idea what your use case is right now other than “I’m going to get my feet wet and try some stuff, maybe host a web site”, then you just want any cheap old computer you can get. 8 year old laptop is perfect. A $35 raspberry pi is actually more than enough. You can upgrade to a new machine, or the cloud, at any point. You can even migrate most of your work!

Other interesting ideas is hosting your own git repository on the server instead of relying on a service like Github. Using the machine to collect telemetry from weather a weather station or IoT devices. Take automatic backups of your hard drive. Other people use theirs for software development, or figuring out how to use new software.

Copy/paste will be your friend throughout this

We’re going to be doing this stuff for educational purposes. We’re also going to be covering a HUGE amount of technical ground. Ground that entire careers and fields can be built upon. We don’t have the time to understand every minute detail of everything we do. So we’re going to speedrun through things by doing a ton of copy/paste from guides found on the internet until we get things working.

Breadth over depth! Quantity over quality! We’ll fill in the gaps later!

While this sounds completely the opposite of “learning”, it isn’t actually as mindless as it initially sounds. We’re learning by mimicry first. But that’s how children learn, it’s how many of us actually learn skills. You don’t read a bunch of books and suddenly know how to paint a museum piece or draw a portrait. You mimic, trace, learn the motions without fully understanding, until the pieces click. The path you take doesn’t matter.

What’s actually going to happen is you’re going to find that the guides you’ll copy are outdated, designed for a slightly different setup, uses settings you don’t want, or you simply make a mistake and accidentally skip a step. Things WILL break and not work, even if you’re copy pasting stuff. The actual learning will come when something doesn’t work right and you have to figure out how to fix it.

Okay, dig up a sacrificial lamb (machine)

If available, pick something can use for free. Something lying around and simply don’t use anymore, like an old laptop. If you must buy something, unless you know exactly the specs you need, go cheap. Used equipment is definitely an option here. Everything from the past 10 years will run a Unix command line without much fuss.

If you don’t have hardware laying around, check out your nearest cloud provider. There’s usually tiny instances available for free or very low cost. You can always scale up or down as you see fit. Just remember to turn off your instances before you get billed.

If the Big Cloud providers of AWS, Azure, GCP are a bit pricy for your tastes, you can get virtual private servers for just a couple of dollars a month. Those providers don’t provide all the infrastructure of the big clouds, but they’ll get you virtual machines.

For now, don’t worry too much about differences between x86 (Intel/AMD) and ARM (most other stuff these days, including the raspberry pi) architectures. ARM typically has less software ported to it compared to x86, so it’s occasionally a problem but one that is getting less every day as more people adopt ARM-based systems. You’ll want to avoid microcontroller stuff like Arduinos, they don’t run general purpose operating systems and are used for other things.

Pretty much the important thing to have access to the ability to hard reset/reinstall the thing when you accidentally mess up beyond your ability to repair.

Next, pick an OS

There are tons of free open source operating systems out on the market, available to download. In this particular space, you’re going to want to use the most popular distributions around for your hardware, for two reasons:
1 - Popular distributions will have the most users, meaning the most posts on the internet about issues. You don’t want to run some niche thing only 100 people use and be unable to search for help for any problems you encounter.
2 - You’re most likely only going to find a handful of major distributions at work. Redhat Enterprise/CentOS, Debian, Ubuntu, etc. It’s also the same list that major cloud providers host by default. Pick any one of those, because they’re super popular.

(If you’re running an ARM based system, the distributions might be slightly different, for example Raspberry Pis often run Raspian, though you can run other things like Ubuntu).

Do NOT overthink the OS choice!!!

Picking an OS to use seems like an important decision, but for our purposes it’s almost inconsequential. Everything out there are roughly trying to imitate proprietary Unix systems back in the day and while they have taken on their own flavor over the years, lots of things are similar.

The choice of distribution primarily affects parts of the user interface, the tools used to manage and install software, and the fiddly details about where certain files and configurations are kept. Most of your time will be spent interacting with software that you install yourself (Postgres, nginx, docker, whatever) so your end experience will be very similar no matter what you pick.

Personally, I’m an oddball in that I actually prefer to use FreeBSD, which isn’t even Linux, on my personal machines. Their documentation is very thorough, and they have a good separation between the “userland” (aka, all the software users install and run) versus the “kernel” (all the software the OS needs itself to run). It makes it harder for clumsy folk like myself from accidentally bricking the whole computer in a botched system update (which I’ve done to Linux systems before).

For many standard server operations like hosting, routing, etc. BSDs are perfectly fine to use. The problem is that modern enterprises have poured tons of investment into the Linux ecosystem, and some things like Docker just don’t work as of this writing. So feel free to play with them if you’d like. Don’t go with BSD if you want bleeding edge tech because it might not be well supported yet.

For Linux, I generally either use Debian or Ubuntu. 99% of the time, the differences between the two don’t matter to me as a casual user.

Install the OS

This varies a ton. If you’re using a cloud, you pretty much click a few buttons to select a distribution and it’ll auto-install. They’ll also give you all the settings needed to connect to the machine. Just follow the instructions.

For actual hardware, you’re going to have to do some reading. Maybe it’ll involve saving an image in a specific way to a memory card. Maybe you’re going to have to burn a DVD (wait, do people do that still?), or set up a USB stick. Installation instructions are usually quite good because it’s the most used feature of any OS.

Then, figure out how to connect to the thing you’ve installed

Normally that means using either SSH and logging in with a user account that’s been set up during installation, or logging in as root via console. Unless you’ve made a windows machine and thus you probably need some form of remote desktop.

You’ll need an IP address that you point your SSH client to and connect. Until you figure that out (or are given it by your cloud provider), you’ll have to keep the monitor and keyboard you used to install the OS connected.

While not going into the extremely big field that is networking, one important concept that I almost never find mentioned in networking tutorials is that every single network interface (think, ethernet jack or wifi card) is given an address so that data can be sent to that specific interface.

Most networking discussions talk about “hosts” like they all have one address. It makes sense in certain contexts, but that doesn’t match up with modern computers that often carry multiple interfaces. If you keep this in mind, a lot of common networking issues make a bit more sense.

Next, plan out your software installs by working backwards

Let’s say you’re like me and can’t code up a web site more complex than “Hello World!” in plain HTML, with no CSS or Javascript. How can we proceed without learning a whole bunch of programming?

We’ll just have to use someone else’s work. There’s lots of software packages that will give you a wiki or a blog if you just install it onto your server. Just find some software you’d like to run, for example use Mediawiki to create your personal Wikipedia clone, or use Wordpress to make a blog.

Then you look up what it takes to install this software. Luckily they usually list their requirements and/or the OS’s package manager will install the dependencies you need. You’ll need a web server, a database, PHP, etc. Read a bunch of posts written by others debating whether WebServer Alpha is better than WebServer Beta. Apache? Nginx? Postgres? MySQL? If you again stick to stuff that lots of other people seem to be using, you should be able to slowly piece together your system one by one.

Can’t tell what’s popular? That’s okay! Due to how search engines work, just about any search you do about software, the top couple of choices will be the most popular. Look them over and choose whatever strikes your fancy.

If you persevere, you’ll eventually figure things and achieve your intended goal while learning a bunch of operational details you’ve never thought of before.

Reflect on what you’ve done, make adjustments

In your mad dash to install software and get it configured just barely enough to function, you’ll have probably copy-pasted a ton of code from all sorts of places. There will be configuration flags that were a mystery to you You’ll want to keep records of what those commands are. I often dump important seeming ones that I might want to reuse later into a text file.

Eventually you’re going to want to look back on those decisions and make changes away from the defaults. Do you want SSL and https for your blog? Do you want to swap to another database? Do you want to point a domain name at your new shiny blog?

Every little change will point you to new things to learn. Adding a domain name means you’ll suddenly be messing with DNS records. Adding SSL will throw you into the deep end of setting up and using security certificates.

Fear not! Things will break!

The great thing about sacrificial systems is that you build them pretty much assuming they’re going to break. I lost count of the “interesting” ways I accidentally broke my systems. Whether it was botching up a newly compiled kernel for an upgrade, updating the system incorrectly and causing a dependency loop that couldn’t be resolved, misconfiguring important settings, locking myself out with bad firewalls or logins…

And every time you break it all, you’ll get to learn. Maybe you’ll learn how to actually fix the problem. Sometimes you’ll learn when certain situations are hopeless and when it’s worth taking a backup and just reinstalling the system. You’ll also learn how to recover your files from such situations.

Have fun with it.

About this newsletter

I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Supporting this newsletter:

This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to the folks who occasionally send a donation! I see the comments and read each one. I haven’t figured out a polite way of responding yet because distributed systems are hard. But it’s very appreciated!!!
Buy one of my photo prints
If shirts and swag are more your style there’s some here