The need to use the Big Data to make businesses progress towards data-driven decision making created data engineering and it is evolving at a rapid pace. “Data engineering” & “data engineer” are the relatively new terms and are extremely popular since the last decade.
Tracing the History to Data
Before we dive deeper into Data Engineering and how it is impacting both centuries old businesses and startups equally, let's see the brief history of events to know how it evolved over a period of time. There’s a fascinating timeline by the World Economic Forum, and I am picking some critical moments from that list:
1958: Seed of Business Intelligence
IBM researcher Hans Peter Luhn defines Business Intelligence as “the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.”
1965: Birth of First Data Center
The US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tapes.
1970 - The Birth of RDBMS
IBM mathematician Edgar F Codd [father of SQL] presents his framework for storing and retrieving the data called “relational database”
1976 - The Birth of ERP
Material Requirements Planning (MRP) systems are becoming more commonly used across the business world and evolved as Enterprise Resource Applications (ERP) as we know today, like SAP, Oracle Financials.
1991 - Birth of the Internet
Tim Berners-Lee posted a summary of the World Wide Web project on several internet newsgroups, including alt.hypertext, which was for hypertext enthusiasts. The move marked the debut of the web as a publicly available service on the internet.
Michael Lesk publishes his paper "How Much Information is there in the World?" estimating 12,000 petabytes and the data is growing at the 10X per year and says that this data is just collected and not been accessed by anyone and states that no insights can be derived from it.Foundations of Big Data
Association for Computing Machinery published an Article “Visually Exploring Gigabyte Datasets in Real Time" where the first time the world big data appeared and it quotes that “Purpose of computing is insights not numbers”
Peter Lyman and Hal Varian tried to quantify the digital information and its growth in the world they concluded that “The world’s total yearly production of print, film, optical and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on Earth.”
From 2005 the web 2.0 had played a key role for increasing the quantum if data collected on a daily basis.
The world’s servers process 9.57 zettabytes (9.57 trillion gigabytes) of information – equivalent to 12 gigabytes of information per person, per day), according to the "How Much Information? 2010 report".
Working with Big Data
With that said, it is understood that big data is a possible future which businesses cannot ignore and it is fundamentally changing the way how businesses operate. And it's time to believe that “data is new fuel” that runs the businesses. But this also means they have to figure out the answers to some of the common challenges around Big Data:
How to store big Data?
Traditional methods of storing the data in RDBMS is not possible. It may store the volume & velocity Data, but it falls short in the variety aspect, because RDBMS is designed as transactional store (OLTP) and it expects the data to be in particular schema
How to Process Big Data?
Traditionally data warehouses were designed to mine large amounts of data for insights and they work on the principle of ETL (Extract, Transform & Load). It expects data to be a particular schema; hence it also shortfalls in a variety of data storage.
How to collect real-time Big Data?
Real-time data is crucial is businesses want to shit from instinct, to predictive and proactive decision making.
The need for busineese to leverahe BigData gave rise to NoSQL (document, key value & graph-based), the cloud, data lakes etc. which were unheard of a decade earlier, and a need for professionals like data scientists, data engineers and cloud architects.
Understanding the Data Engineer
Data Engineering is a practice which creates a structure for how Big data is collected, maintained and shared with different stakeholders, so they can derive business value from it.
A Data Engineer is the person responsible for the building these structures.
Data Engineer Vs Data Scientist
Data Scientist is a role that people often confuse with Data Engineer.
It is an agreed fact that there are some technologies they commonly use between them, but they are very different role serving different purpose
Basic differences between Data Engineer & Scientist
1. Data Engineer creates/provides the structure & process how the data will be collected, processed and shared with the stakeholder. A Data Scientist on the other hand, is the stakeholder who uses that data to provide insights that business, by leveraging statistical models.
2. Data Engineer core competencies include distributed computing, data pipelines, and advance programming. For data scientists, core competencies include machine learning, artificial intelligence, statistics & mathematics.
A data scientist role is much fancied these days. But a data scientist is only as good as their data. That that’s taken care of by the data engineer.
With some of the basic definitions and differences out of the way, the next part of this blog post will discuss “How a Data Engineer can use Cloud to create Data Lakes”, which is at the core building a Big Data practice at any enterprise.