Special thanks to the person on twitter who messaged me with this question and is letting me use it as a starting point of a post. Poking at real scenarios is real fun, and I can always take a bit of creative liberty in anonymizing details.
Here’s their (paraphrased) problem statement.
I recently became a data analyst at a company. It looks like I need to do a lot of organization database creation work first. A lot of data is in Excel files in different systems. I want to gather everything, organize it, make it queryable and visualizable for users. Do you have any advice for a DB, tools, ways to design?
So there's two parts to this question: 1) the overt question: things to do if I was going to embark on such an epic quest to increase my chances of success, and 2) the implied question: is the best thing to do right now?
The general form that problems like this resemble a big RPG game, you're dropped into a brand new world, everything is shiny and important-seeming, and lacking any stronger storyline quests thrust upon you by the gods, you are sent to fetch objects from all over the world in exchange for unspecified rewards. This is how I wind up spending a hundred hours doing side-quests and level grinding, which may be good as a leisure activity, but perhaps not ideal for your career.
I think people with some project management experience behind them are likely to have some alarm bells going off in their head “Warning! Unbounded scope!”
So let’s go into it, from most important questions first:
Should I be doing this to begin with?
This is probably the key question for this endeavor. There are ways to be effective and impactful without embarking on a giant mission with no true end.
Putting everything into “one centralized place that everyone shares and is more powerful” is a seemingly natural goal to have. “Everything is so hard right now because it’s scattered around. It takes work to put things together to even begin doing analysis work, this is why everything is horrible!”
Such projects are also, in my personal experience, where projects go to die. I firmly believe that the second you attach the term “data warehouse" to a project, your chances of failure go up a ton. And I say this as someone who has designed and built/prototyped three of the damned things in my career. I hope there isn’t a fourth.
The reason DW projects are hard has little to do with technical difficulty. They all fit well-known patterns. The tech part is relatively easy.
Data is routed to a centralized system designed for analytics.
The centralized system is just a big database-like service.
There’s a lot of annoying ETL jobs for data ingestion to be written.
There’s interface(s) for pulling data back out and doing analysis.
Finally, there’s lots of maintenance and upkeep.
It’s hard to draw the line between a long line of older data stores/analytics systems and what would resemble a modern data warehouse, but by the 1990s, all the pieces and scale were ready. That’s at least 30 years of technological refinement. Modern systems just take the core idea and do it at better scale, cost, and ease than before.
Humans created the problem, humans will be needed to solve it
The main reason these projects are hard is because the assumptions ignore the very human part of why things came to be as “messy” as they are in the first place. The desire for a grand “let’s sweep all the ugliness away and get everyone on the same page!” vision partly stems from not understanding that disparate systems were created by different people to fulfill a specific set of needs. Those systems then evolved until they worked reasonably well for their limited scopes. But because they were never coordinated at the beginning, getting them to work together now comes with a huge human cost.
Every input dataset has these things:
an owner with unique needs and a different level of willingness to cooperate
new or slightly different definition of terms, and measurements that need to be understood and reconciled
different access permissions that need to be overcome
different human processes surrounding the data (collection, entry, consumption)
It takes a lot of skill, patience, and resources to work with all these people and get the desired result. Many of those skills have nothing to do with software development or data analysis. So when you’re thinking about gathering literally everything into one system, this is what you’re signing up for.
You’ll be working with Alice from engineering to get access the analytics side of production systems, and have to debug all the data issues there as you use the system. You’ll then talk to Bob of customer support to learn about the 3rd party support ticketing system with custom in-house tooling, and the engineer maintaining it left last month. Then Eve from finance is totally swamped working on the Series B preparation and doesn’t have time to explain how any of the financial data works this quarter.
Once you overcome the hurdles, understand, and merge all these data sources together into once place. You’re then trapped maintaining the system because you’re the only one who knows how the thing works and no one is invested in keeping it up to date.
No good deed goes unpunished.
Shrink the scope, work in pieces
Nothing says a data warehouse must spring forth, fully formed from your forehead, ready to provide wisdom to the world. As I went into above, to do so is generally fatal. You might call it, head-splitting. Instead, by invoking the time honored project management spell of “limiting scope” there are ways to survive.
Identify some core business questions, aim for those first
While any organization will have a many functions and processes within it, an endless sea of analytic possibilities, there should be only a small handful of core functions that everything else supports. Whether it’s production of widgets or sales of subscriptions, some things are inherently more important than others. If it significantly threatens the life of the business when it goes bad, it’s probably worthy of attention.
Identify those systems, and think of some good analyses that could help people understand them better. The goal is to build out something that supports just those core business questions first. It could be understanding costs, sales, repeat sales, getting new customers, keeping current customers, etc. It is most likely not how many coffee filters are used every month in the office (I’m sure you can find some data on this in the office manager’s data.)
More often than not, the data that surrounds those core business questions are located in only a small handful of systems. They might even be on a single system, because this is the core of the business and everything grew up around the core. You only really want to work on a single system at a time at the best, and up to two systems at the absolute worst.
Do some analysis before you start coding
Now that your scope is limited down to a handful of questions targeting a handful of questions, the next thing you need to do is work on doing some basic analysis.
Pick a question, whatever makes you feel excited. You’re going to need that motivation for the next step. You’re going to make a one-off analysis to examine that question. Maybe it’s about customer retention, or time-to-sale, or cost of customer acquisition. Go through the actual mechanics of getting the data, understanding how it works, and most importantly, using the data to create an artifact that someone else finds useful.
This is your prototype.
You’re going to learn so much about how those systems work just by doing this, it will make the rest of the process easier. You’ll most likely have to talk to people, ask them what various bits of data mean, where are things unreliable, where can data collection be improved. You’ll find weird bugs in the data that will break your analysis pipelines. It will take surprisingly longer than you would expect, even to do the hackiest, jankiest, throw-away analysis. You will likely be fixing bugs for weeks along the way.
But do not despair, this is all work you would have had to do to build the data warehouse to begin with, but now you are doing it for a concrete purpose instead of a vague abstract system in the future. There is an end in sight with a single deliverable.
Finally, once you create your analysis result, you can show people. Hopefully they’re excited about what you created because it helped them understand something better, or offered insight. They might even have follow-up questions that you’ll have to go analyze further. But guess what you just did? You’ve built excitement for your project. You’ve shown them how you can help them, and they’re going to be a lot more willing to help you in the future if you need them to change their data processes for your data warehouse.
Automate your analysis
Now that you’ve actually done an analysis end-to-end, you should be familiar enough with the system to have a decent idea of how to automate it. So go ahead and do it.
You’re going to find new issues and hiccups when you do. The most important one being “wait, how do I automate this stuff, on what system?” If you’re the first person, there may not be any infrastructure for this. Time to put on your data engineer hat and figure stuff out (while working together w/ the operations folk).
This is when you can start evaluating your technical needs. Do you use a simple relational database? (Answer: yes in 99% of cases). What languages do you want to use? Who’s responsible for keeping the systems up?
There’s a huge amount of details and potential choices to be made here. The only thing I can stress is, consult your engineering partners! You want to avoid being the odd one out.
If you’re the only person who knows R and everyone else is using Java, please don’t use R without a super, important, pressing reason. Doing so means volunteering to either be the only person to maintain the system forever, or you have to teach others a language they don’t know. This applies to your other infrastructure too. If you’re a SQL Server shop, don’t spin up a PostgreSQL instance without a good reason.
Luckily, data warehouse solutions are plentiful. Every cloud service provider offers entire solutions that can cost as much money as you’re wiling to spend. You can spin up your own using open source tech, including just a simple PostgreSQL database. Pick what works for your environment and budget.
Once you do all this stuff and you finally finish automating, congratulations, you’ve created a data pipeline! You even have users who are waiting to consume the output.
Repeat until you’re tired of all this
Now that you have one system working with an analysis, you can think about expanding. Find a new analysis, a new outcome you want, test it, build it, automate it, and integrate it into your system. Maybe this will involve integrating a new data source into your system, or you might just do a new analysis with the existing data.
Repeat this over and over. It’s sorta exhausting, but again, thanks to having limited scope there’s always an end in sight. You also will have to do less data engineering the second time around.
What about those big data warehouse analytics databases like AWS Redshift, GCP BigQuery?
You could use those, if they fit your need. They’re designed for large volumes of data. They’re often nice in that you can throw CSV files into those systems and they’ll ingest but presents you with a very zippy SQL interface. Since CSV is the lingua franca of data, it can make a lot of your ETL jobs easier. I used to have to write raw MapReduce jobs to do the things RedShift and BQ do in SQL. From that background, those new tools are amazing.
At the same time, you’re going to be paying for those amazing features, both in terms of log storage, as well as the cost of running queries. Those bills can be nontrivial. They’re marketed as being the foundation tech of data warehousing solutions, but they’re complete overkill if your data sources are primarily Excel. (Excel only supports ~1 million rows per worksheet, that easily fits in your laptop’s RAM.)
My recommendation is to stick with the much cheaper small local relational databases to start, you can migrate and scale up with some medium amount of discomfort later.
Any other tips?
Plan Ahead
You can plan ahead a bit in your programming and system architecture. You can avoid writing too much throwaway code because you know your data pipelines and systems are going to be reused for other systems.
Design in a certain amount of flexibility from the start. A few extra layers of abstraction here and there will save you some refactoring in the future. It’s a bit of an art form to balance the level of abstraction, you can certainly go too far, but it only takes a few minutes to split things out into functions and methods instead of one giant mess of spaghetti. Avoid having your present self inflicting pain on your future self.
A decent rule of thumb I use is be fairly UNIX-y in the separation of concerns, with an eye towards making modules of functionality that will represent different data systems in the future. The modules can then be swapped around/iterated over as needed.
Work with people
People will make or break the process. If what you’re building doesn’t interface with their world, they won’t use it. You’re going to depend on them to give you data, so don’t disrupt their processes without their input. When your pipelines are complete, you’re going to depend on them to warn you if changes are incoming.
If people are invested in your new system because they’re directly getting value out of it, then they’ll be interested in helping you keep all the moving parts running smoothly. Managing this relationship is critical. If people stop caring, things will start breaking.
It will take time
This is not a one-week project. Make sure everyone involved has realistic expectations. Make sure they know things can derail due to bugs and issues.
Then it’s just a matter of continuing on. Good luck!