In the first part of this series, focused on data pipelines, we touched on how running a business means you are generating tons of data.
As we discussed, every business activity results in data being created from a multitude of sources - and in order to extract and move all that information from a variety of sources to a single place, you can use a data pipeline.
But a data pipeline is just the first step in data-driven business intelligence. After configuring a data pipeline, you need to figure out how to store the data that your business is creating. To do this, you use a data warehouse.
Here, we’ll explore what a data warehouse is, data warehouses vs. databases, data lakes and other methods of data storage.
What is a data warehouse?
A data warehouse is a system to help aggregate and store your company’s data and information. And the term “data warehouse” is actually a great descriptor for what the system is.
For example, imagine an Amazon warehouse (or more specifically, the “Fulfillment By Amazon” network) - there are tons of shops, merchants, and vendors sending their products to an Amazon warehouse to ultimately get sold. Amazon needs to receive all these products, sort them, and add them to their inventory. This sorting process is essentially what a data pipeline does (but for data).
Once Amazon has received and sorted these products, they need to make sure they are organized and stored safely in order to be able to complete the rest of the ecommerce lifecycle. This organization and storage is how a data warehouse is structured - but instead of a warehouse of physical products, a data warehouse is a repository of data and information.
Like a physical warehouse, a data warehouse has a number of common elements:
- A relational database: You can think of this like shelves in an Amazon warehouse - a relational database stores data and makes it straightforward and simple to access. Just like how a physical warehouse doesn’t throw all their products and packages into a massive pile and hope for the best but rather, incorporates a system of bins, shelves, and rows, a relational database has data organized in sets of data tables (think of a data table like an organized Google or Excel spreadsheet) with rows and columns with pre-defined relationships between them. That means when you have to retrieve your data, you can do so quickly and efficiently.
- An ETL or ELT framework: ETL stands for Extraction, Loading, and Transformation or (Extraction, Transformation and Loading, depending on the order). You can think of this like a conveyor belt in an Amazon warehouse. You need to be able to clean, organize, and move your data efficiently in order to analyze it down the road - an ETL process does just that.
- While not all data warehouses have all the same features and processes, many also contain various capabilities for statistical analysis, data visualization, and machine learning as well.
Data warehouses vs. databases vs. data lakes
Before digging deeper into the specifics of data warehouses and what they’re used for, it’s worthwhile to differentiate between data warehouses, databases, and data lakes. While these three data management systems are closely related and often referred to interchangeably, there are a few key differences.
At a high level, data flows into a data warehouse from a myriad of sources. In order for businesses to extract actionable and valuable insights from their data and monitor business processes, businesses can connect various business intelligence tools to their data warehouse and utilize data science and analytics methodologies to create charts, graphs, reports, and dashboards.
“Data science and analytics” is a major component that differentiates a data warehouse from a database. With a data warehouse, you can store historical data from many sources and use that data to find relationships and trends, and to better understand your customers and how your business is doing. A data warehouse exists as a sort of layer above a database.
A database, on the other hand, is used with real-time, single-sourced (usually shorter-term) data, often in a transactional format. For example, if you’re an ecommerce merchant, you might use a database to help you create and track your online orders. You might use a data warehouse for things like storing all your marketing, customer, and order data so you can try to better understand your customers and where they are coming from, and whether or not there are pain points in your checkout flow.
Now - what about a data lake? With a data warehouse, there are strict requirements in how the data is stored - but if you relax these requirements, you can get something called a data lake. In a data lake, you can have both structured and unstructured data stored together.
By allowing unfiltered and raw data, a data lake can be more flexible - however, this can make relatively routine data work more difficult for engineers, developers, data scientists and data analysts. In most cases, if you’re an ecommerce merchant or a brand (or have a similarly structured business) you are most likely going to only be concerned about databases and data warehouses, not necessarily data lakes.
What are data warehouses used for?
Let’s go back to our first chapter on building a data pipeline and discuss the same ecommerce company, for example. To recap: if you are an e-commerce merchant, you might have customer and purchase-level data from your Shopify store, email marketing campaign data in Mailchimp, and advertising performance data in Facebook, Instagram and Google Ads (not to mention data on your website traffic and behavior, returns and shipping details, customer success, and other information from other sources).
Ultimately, your goal might be to optimize your email and advertising efforts. You’ve built a robust data pipeline to extract, reformat, and wrangle your data, so the next step for you is storage - and this is where the necessity for a data warehouse comes into play.
Instead of having your company’s data littered across multiple sources, a data warehouse acts like a central repository for your data and information. This is vitally important when you want to analyze your business’s data. In conjunction with a robust data pipeline, a data warehouse that is structured and organized gives you the power to quickly query, analyze, and generate reports, dashboards, and insights using data from across multiple connected sources.
Contained in your data warehouse you might have a customer-level data table, where each row in that table is a customer and each column is an attribute associated with that customer - like their name, email, phone number, location, unique customer ID, etc.
Another table might have your purchase-level (or order-level) data, where each row is a purchase and each column has attributes about that purchase. For instance, you might have columns for the purchase date, the customer ID who made that purchase, a product ID for what was purchased, an order amount, a tracking ID for where that customer came from to make that purchase (like email campaign, social media campaign, etc.) and many other components.
Organizing your data into this format makes it simple to gauge marketing campaign success, segment purchase data by customer type, look at which products are selling well and more - and this analysis can be done fairly easily by writing only a few basic queries when you have your data stored and organized in a data warehouse.
How does a data warehouse connect/relate to a data pipeline?
Understanding the relationship between a data pipeline and a data warehouse can get a bit tricky. Sometimes, your data pipeline and data warehouse are integrated together in the same platform. Other times, your data pipeline and data warehouse are different platforms relying on a connection. It can be helpful to think about the entire business intelligence lifecycle and how a data warehouse fits into that process. You have numerous sources of data - so step one is to connect all your data sources and integrate your data (this is your data pipeline). Step two is to effectively and efficiently store and maintain your data (this is your data warehouse). Step three is to use and analyze your data to help your business make better decisions (this is with reports, dashboards, and other types of data science and analytics techniques).
Simply, you have data coming in (via a data pipeline) and data going out (via data science and analytics techniques and methodologies). Your data warehouse functions like a middle ground or bedrock for this process: receiving, holding, organizing and distributing this data so you can use it to your advantage.
Data warehouse options and other things to consider
If you’re looking for a data warehousing solution, there is a lot to consider.
Without getting too much in the weeds, your data warehousing solution depends significantly on what you need and the type of business that you have. Unless you are generating massive amounts of data or planning to run computationally complex machine learning algorithms (if you’re an ecommerce merchant or similar business, you most likely won’t be), there are a few major considerations when choosing your solution:
- First, maintenance: Are you looking to design, set up, and maintain your data warehouse? Or would you prefer a pre-built solution that requires less technical overhead? Do you already have people on your team qualified to manage the warehouse?
- Second, connections and flexibility: Some data warehousing solutions come with simple and easy-to-set-up connections for data sources and business intelligence tools. However, this can come at the expense of flexibility. For example, an ecommerce merchant or a brand may prefer the ease of a data warehousing solution that easily connects to their data sources and offers a pre-built data science and analytics component. However, a company that generates a large amount of data or a lot of intricate data from a variety of sources may need more flexibility than can come from an out of the box solution.
- Lastly, security and access: Storing data is important, but storing data securely is even more critical. No matter what data you have, you need to make sure to take steps to store your data securely. However, depending on what type of data you have, you might need to take extra precautions. For example, letting your customer’s birthdays, email addresses, or passwords fall into the wrong hands could be very damaging to them and your business.
Why are data warehouses important to businesses?
A well-organized data warehouse can provide immense value to your business. Making data-driven decisions is incredibly important - but your insights are only as good as the data you have. An ordered and well-equipped data warehouse ensures that you can access and explore your data when you need it and how you need it.
A well-structured data warehouse means that you can quickly identify relationships between variables in your business. For example, it allows you to easily get your purchase level data and social media insights side-by-side in one visualization. You might notice a seasonality to your business that you haven’t explored before. A discovery like this can help you zero in on effective timing for advertising campaigns and even help in developing new and seasonal products or features for your customers.
What makes a data warehouse a powerful tool for your business is how it can store your company’s data cheaply, securely, and efficiently. You can easily join data and run complex analyses instantaneously, view historical records, perform complicated queries to dig deeper into your data, and understand how your business is doing and how you can improve - without shuffling through multiple sources, juggling spreadsheets, or (potentially worst of all) creating errors within your datasets that lead to wrong conclusions. For most businesses in this day and age, a data warehousing solution is no longer a luxury but rather, a cost of doing business.
Plus, further reading about other elements of the business intelligence process: