Feb. 26, 2021
Staying ahead of the competition in the startup world means that processes need to be optimized and scaled effectively. In an effort to be more data-driven, companies are hiring data professionals at alarming rates. Data engineering job interviews have grown by 40 percent in the past year according to Interview Query. While this number signifies an increased need for experienced data engineers, there are some myths about the role that need to be busted.
"AI" and "Big Data" are hot buzzwords in the industry right now. However, most companies don't even have access to the volume of data being described as "Big Data," and I would argue most companies don't need that amount of data; acquiring more data doesn't automatically translate into better performing data models or better business decisions. The questions your team should be asking is what data do you need to answer the incoming business questions, and can you satisfy those requirements with data already accessible by your team. Unless a robust team is in place that can wrangle messy disparate sources and produce sophisticated data models most organizations can unlock incremental value without dipping into the elusive world of "Big Data."
From an outsider's perspective it may seem like a data engineering team's job is primarily to move data from source to target. While this is an accurate description of what the job entails there are more nuances to this process than meets the eye. In addition to technical sophistication around latency, storage, and cost, data engineers ensure an efficient data model and have to weigh the cost and tradeoffs of all potential solutions. A narrow view of the job at face value ignores that a data warehouse is a product in itself and needs to be maintained the same way as the larger product it supports. If infrastructure tech debt is pushed off for too long, the house of cards can come crumbling down and a business could lose insight into time-sensitive performance.
Database and schema design are a portion of a full-stack software engineer's tool kit, but the gap between a product database and downstream analysis in a visualization tool is huge. Software engineers are focused on building new features and aren't primarily concerned with efficient reporting layer data models that analysts can use to tell a story to senior leadership. Data engineers construct the bridge between how a product is performing and the decisions to be made as the result of that performance. I explore the differences between a data engineer and software engineer in a previous blog post.
Despite what software salespeople will tell you data can not "just flow in" 99 percent of the time. Each tool has it's own configuration and preferred data format, and odds are your company's data model does not inherently support the specific requirements of that tool. This is independent of the data security checkpoints you will need to put in place to ensure policy compliance. Data security and privacy is becoming increasingly important as more businesses are hacked and sensitive customer information becomes increasingly compromised. When the curtain is drawn back on the backend of a flashy UI the implementation may be more complex than what was sold. It would be nice if we could just snap our fingers and our data magically populates into some fancy chart that was shown in the demo!
Data is a living and breathing entity. It can grow, be modified, and deleted historically. This intrinsic fact about data is commonly misunderstood and makes things difficult for historical reporting. However, there are methods mentioned in Kimball's The Data Warehouse Toolkit book that can help provide clarity and structure to a constantly changing data source. There is no "right" way to capture data that is constantly changing and evolving, and a lack of a single source of truth can be frustrating, but the data engineer and the stakeholder have to agree on the definition and tradeoffs of the final solution.
One may assume that because a data engineer brings in multiple data sources across an organization that they are experts on every single table in the warehouse. Fleshing out a robust data dictionary for the warehouse is extremely useful, but a project of this type is not always the highest priority in a high-growth, agile company. Therefore data engineers don't always understand what data is being populated in the warehouse without further business context. Unless they personally helped generate a model or have worked with the type of data before, the data engineer knows as much about this random table as the stakeholder does.
Every job has misconceptions about it depending on the organization and the function of the team and data engineering is no different. Are there any other data misconceptions that you'd like me to debunk? Contact me
James Roselle is a data engineer based in Boston.Learn more!