Introduction to Data Engineering on Google Cloud Platform

Raw data is replicated and processed inside a lake of data. To get the data accessible, you’ll use the extract transform load or ETL pipelines to make the data functional, and store it in a data warehouse. When deciding between different data warehouse choices, let ‘s consider what are the key considerations. We need to ask those questions to ourselves. The data warehouses are sure to act as a sink (Google Cloud LLC, 2020).

BigQuery offers pathways for automated data transfer and uses technologies such as SQL that are already known and used by teams to drive business applications. This way, everybody has access to insights into the results. Use user-friendly tools such as Google Sheets, Looker, Tableau, Qlik, or Google Data Studio, you can generate read-only open data sources that both internal and external users can view and query results.

BigQuery can also be used in Cloud SQL to directly query database data which is maintained by relational databases such as PostgreSQL, MySQL, and SQL Server. so long as your files are in formats like CSV Apache, you can also use BigQuery to directly query data and Cloud Storage. The real power comes when you can keep the data in place, and yet enter it in the data center against other data. 

Federated Queries with BigQuery

An external data source (also known as a federated data source) is a source of data that can be queried directly regardless of the state of data being stored or not in BigQuery (Google Cloud LLC, 2020). Rather than loading or streaming the data, you create a table referencing the original source of the data.

Transactional Databases vs Data Warehouses

Databases store transactional data effectively, making it accessible to end-users and other systems. Data warehouses collect data from databases and other sources to create a single repository which can serve as the basis for advanced reporting and analytics.

Figure 1. Relational Database Management System (source: Google Cloud)

Governance policy

Data administration is an important part of managing your cloud infrastructure, particularly if you’re benefiting from multiple cloud providers. For certain sectors, to comply with regulations, you need to display where data has been stored, and how it is used. Additionally, the use of access controls and other mechanisms for data governance helps ensure that only those that wish to use those data can do so.

Google data warehousing application BigQuery integrates Google Data Governance Security Primitives. For example, column-level fine-grained access controls — provides the ability to manage data by assigning a policy tag dependent on the nature of the data itself (i.e., personal identifiable information, or PII) and tracking it through several data containers (tables, datasets, and more). Moreover, with BigQuery table-level ACLs, a table-sized data container can be granted permissions.

Build production-ready pipelines

Ensure pipeline health and data cleanliness

Google Cloud has a fully managed Apache Airflow version known as Cloud Composer.

Apache Airflow is an open-source framework for handling workflows. It began at Airbnb in October 2014 as a method for handling the ever more complicated workflows of the business. Airflow development enabled Airbnb to programmatically create and schedule their workflows, and track them through the built-in Airflow user interface (Apache, 2020).

Cloud Composer allows the data engineering team orchestrate the pieces to the data engineering puzzle we’ve been exploring up to now, and many some that we haven’t yet come across. For example, if a new CSP file is dropped into Cloud storage, you can make it cause instantly an event it kicks off a data processing workflow and moves the data in the CSP file immediately into your data warehouse.

References

Apache.org, 2020. Apache Beam Overview. [Online]
Available at: https://beam.apache.org/get-started/beam-overview/
[Accessed 2020].

Apache, 2020. Apache Airflow. [Online]
Available at: https://airflow.apache.org/
[Accessed 2020].

daCosta, F. & Santa, C., 2014. Rethinking the Internet of Things. s.l.:s.n.

Google Cloud LLC, 2020. Google Cloud. [Online]
Available at: https://cloud.google.com
[Accessed 2020].

Google LLC, 2014. Better data centers through machine learning. [Online]
Available at: https://googleblog.blogspot.com/2014/05/better-data-centers-through-machine.html
[Accessed 2020].

Statista, 2020. Internet of Things (IoT) active device connections installed base worldwide from 2015 to 2025*. [Online]
Available at: https://www.statista.com/statistics/1101442/iot-number-of-connected-devices-worldwide/
[Accessed 2020].

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.