Official Google
Cloud Certified

Professional Data Engineer

Study Guide

Dan Sullivan

to Katherine

Acknowledgments

I have been fortunate to work again with professionals from Waterside Productions, Wiley, and Google to create this Study Guide.

Carole Jelen, vice president of Waterside Productions, and Jim Minatel, associate publisher at John Wiley & Sons, continue to lead the effort to create Google Cloud certification guides. It was a pleasure to work with Gary Schwartz, project editor, who managed the process that got us from outline to a finished manuscript. Thanks to Christine O’Connor, senior production editor, for making the last stages of book development go as smoothly as they did.

I was also fortunate to work with Valerie Parham-Thompson again. Valerie’s technical review improved the clarity and accuracy of this book tremendously.

Thank you to the Google Cloud subject-matter experts who reviewed and contributed to the material in this book:

Name Title
Damon A. Runion Technical Curriculum Developer, Data Engineering
Julianne Cuneo Data Analytics Specialist, Google Cloud
Geoff McGill Customer Engineer, Data Analytics
Susan Pierce Solutions Manager, Smart Analytics and AI
Rachel Levy Cloud Data Specialist Lead
Dustin Williams Data Analytics Specialist, Google Cloud
Gbenga Awodokun Customer Engineer, Data and Marketing Analytics
Dilraj Kaur Big Data Specialist
Rebecca Ballough Data Analytics Manager, Google Cloud
Robert Saxby Staff Solutions Architect
Niel Markwick Cloud Solutions Architect
Sharon Dashet Big Data Product Specialist
Barry Searle Solution Specialist - Cloud Data Management
Jignesh Mehta Customer Engineer, Cloud Data Platform and Advanced Analytics

My sons James and Nicholas were my first readers, and they helped me to get the manuscript across the finish line.

This book is dedicated to Katherine, my wife and partner in so many adventures.

About the Author

Dan Sullivan is a principal engineer and software architect. He specializes in data science, machine learning, and cloud computing. Dan is the author of the Official Google Cloud Certified Professional Architect Study Guide (Sybex, 2019), Official Google Cloud Certified Associate Cloud Engineer Study Guide (Sybex, 2019), NoSQL for Mere Mortals (Addison-Wesley Professional, 2015), and several LinkedIn Learning courses on databases, data science, and machine learning. Dan has certifications from Google and AWS, along with a Ph.D. in genetics and computational biology from Virginia Tech.

About the Technical Editor

Valerie Parham-Thompson has experience with a variety of open source data storage technologies, including MySQL, MongoDB, and Cassandra, as well as a foundation in web development in software-as-a-service (SaaS) environments. Her work in both development and operations in startups and traditional enterprises has led to solid expertise in web-scale data storage and data delivery.

Valerie has spoken at technical conferences on topics such as database security, performance tuning, and container management. She also often speaks at local meetups and volunteer events.

Valerie holds a bachelor’s degree from the Kenan Flagler Business School at UNC-Chapel Hill, has certifications in MySQL and MongoDB, and is a Google Certified Professional Cloud Architect. She currently works in the Open Source Database Cluster at Pythian, headquartered in Ottawa, Ontario.

Follow Valerie’s contributions to technical blogs on Twitter at dataindataout.

Introduction

The Google Cloud Certified Professional Data Engineer exam tests your ability to design, deploy, monitor, and adapt services and infrastructure for data-driven decision-making. The four primary areas of focus in this exam are as follows:

Designing data processing systems involves selecting storage technologies, including relational, analytical, document, and wide-column databases, such as Cloud SQL, BigQuery, Cloud Firestore, and Cloud Bigtable, respectively. You will also be tested on designing pipelines using services such as Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer. The exam will test your ability to design distributed systems that may include hybrid clouds, message brokers, middleware, and serverless functions. Expect to see questions on migrating data warehouses from on-premises infrastructure to the cloud.

The building and operationalizing data processing systems parts of the exam will test your ability to support storage systems, pipelines, and infrastructure in a production environment. This will include using managed services for storage as well as batch and stream processing. It will also cover common operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. As a data engineer, you are expected to understand how to provision resources, monitor pipelines, and test distributed systems.

Machine learning is an increasingly important topic. This exam will test your knowledge of prebuilt machine learning models available in GCP as well as the ability to deploy machine learning pipelines with custom-built models. You can expect to see questions about machine learning service APIs and data ingestion, as well as training and evaluating models. The exam uses machine learning terminology, so it is important to understand the nomenclature, especially terms such as model, supervised and unsupervised learning, regression, classification, and evaluation metrics.

The fourth domain of knowledge covered in the exam is ensuring solution quality, which includes security, scalability, efficiency, and reliability. Expect questions on ensuring privacy with data loss prevention techniques, encryption, identity, and access management, as well ones about compliance with major regulations. The exam also tests a data engineer’s ability to monitor pipelines with Stackdriver, improve data models, and scale resources as needed. You may also encounter questions that assess your ability to design portable solutions and plan for future business requirements.

In your day-to-day experience with GCP, you may spend more time working on some data engineering tasks than others. This is expected. It does, however, mean that you should be aware of the exam topics about which you may be less familiar. Machine learning questions can be especially challenging to data engineers who work primarily on ingestion and storage systems. Similarly, those who spend a majority of their time developing machine learning models may need to invest more time studying schema modeling for NoSQL databases and designing fault-tolerant distributed systems.

What Does This Book Cover?

This book covers the topics outlined in the Google Cloud Professional Data Engineer exam guide available here:

cloud.google.com/certification/guides/data-engineer

Chapter 1: Selecting Appropriate Storage Technologies  This chapter covers selecting appropriate storage technologies, including mapping business requirements to storage systems; understanding the distinction between structured, semi-structured, and unstructured data models; and designing schemas for relational and NoSQL databases. By the end of the chapter, you should understand the various criteria that data engineers consider when choosing a storage technology.

Chapter 2: Building and Operationalizing Storage Systems  This chapter discusses how to deploy storage systems and perform data management operations, such as importing and exporting data, configuring access controls, and doing performance tuning. The services included in this chapter are as follows: Cloud SQL, Cloud Spanner, Cloud Bigtable, Cloud Firestore, BigQuery, Cloud Memorystore, and Cloud Storage. The chapter also includes a discussion of working with unmanaged databases, understanding storage costs and performance, and performing data lifecycle management.

Chapter 3: Designing Data Pipelines  This chapter describes high-level design patterns, along with some variations on those patterns, for data pipelines. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer are used to implement data pipelines. It also covers migrating data pipelines from an on-premises Hadoop cluster to GCP.

Chapter 4: Designing a Data Processing Solution  In this chapter, you learn about designing infrastructure for data engineering and machine learning, including how to do several tasks, such as choosing an appropriate compute service for your use case; designing for scalability, reliability, availability, and maintainability; using hybrid and edge computing architecture patterns and processing models; and migrating a data warehouse from on-premises data centers to GCP.

Chapter 5: Building and Operationalizing Processing Infrastructure  This chapter discusses managed processing resources, including those offered by App Engine, Cloud Functions, and Cloud Dataflow. The chapter also includes a discussion of how to use Stackdriver Metrics, Stackdriver Logging, and Stackdriver Trace to monitor processing infrastructure.

Chapter 6: Designing for Security and Compliance  This chapter introduces several key topics of security and compliance, including identity and access management, data security, encryption and key management, data loss prevention, and compliance.

Chapter 7: Designing Databases for Reliability, Scalability, and Availability  This chapter provides information on designing for reliability, scalability, and availability of three GPC databases: Cloud Bigtable, Cloud Spanner, and Cloud BigQuery. It also covers how to apply best practices for designing schemas, querying data, and taking advantage of the physical design properties of each database.

Chapter 8: Understanding Data Operations for Flexibility and Portability  This chapter describes how to use the Data Catalog, a metadata management service supporting the discovery and management of data in Google Cloud. It also introduces Cloud Dataprep, a preprocessing tool for transforming and enriching data, as well as Data Studio for visualizing data and Cloud Datalab for interactive exploration and scripting.

Chapter 9: Deploying Machine Learning Pipelines  Machine learning pipelines include several stages that begin with data ingestion and preparation and then perform data segregation followed by model training and evaluation. GCP provides multiple ways to implement machine learning pipelines. This chapter describes how to deploy ML pipelines using general-purpose computing resources, such as Compute Engine and Kubernetes Engine. Managed services, such as Cloud Dataflow and Cloud Dataproc, are also available, as well as specialized machine learning services, such as AI Platform, formerly known as Cloud ML.

Chapter 10: Choosing Training and Serving Infrastructure  This chapter focuses on choosing the appropriate training and serving infrastructure for your needs when serverless or specialized AI services are not a good fit for your requirements. It discusses distributed and single-machine infrastructure, the use of edge computing for serving machine learning models, and the use of hardware accelerators.

Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models  This chapter focuses on key concepts in machine learning, including machine learning terminology and core concepts and common sources of error in machine learning. Machine learning is a broad discipline with many areas of specialization. This chapter provides you with a high-level overview to help you pass the Professional Data Engineer exam, but it is not a substitute for learning machine learning from resources designed for that purpose.

Chapter 12: Leveraging Prebuilt ML Models as a Service  This chapter describes Google Cloud Platform options for using pretrained machine learning models to help developers build and deploy intelligent services quickly. The services are broadly grouped into sight, conversation, language, and structured data. These services are available through APIs or through Cloud AutoML services.

Interactive Online Learning Environment and TestBank

Learning the material in the Official Google Cloud Certified Professional Engineer Study Guide is an important part of preparing for the Professional Data Engineer certification exam, but we also provide additional tools to help you prepare. The online TestBank will help you understand the types of questions that will appear on the certification exam.

The sample tests in the TestBank include all the questions in each chapter as well as the questions from the assessment test. In addition, there are two practice exams with 50 questions each. You can use these tests to evaluate your understanding and identify areas that may require additional study.

The flashcards in the TestBank will push the limits of what you should know for the certification exam. Over 100 questions are provided in digital format. Each flashcard has one question and one correct answer.

The online glossary is a searchable list of key terms introduced in this Study Guide that you should know for the Professional Data Engineer certification exam.

To start using these to study for the Google Cloud Certified Professional Data Engineer exam, go to www.wiley.com/go/sybextestprep and register your book to receive your unique PIN. Once you have the PIN, return to www.wiley.com/go/sybextestprep, find your book, and click Register, or log in and follow the link to register a new account or add this book to an existing account.

Additional Resources

People learn in different ways. For some, a book is an ideal way to study, whereas other learners may find video and audio resources a more efficient way to study. A combination of resources may be the best option for many of us. In addition to this Study Guide, here are some other resources that can help you prepare for the Google Cloud Professional Data Engineer exam:

The best way to prepare for the exam is to perform the tasks of a data engineer and work with the Google Cloud Platform.

 Exam objectives are subject to change at any time without prior notice and at Google’s sole discretion. Please visit the Google Cloud Professional Data Engineer website (https://cloud.google.com/certification/data-engineer) for the most current listing of exam objectives.

Objective Map

Objective Chapter
Section 1: Designing data processing system
1.1 Selecting the appropriate storage technologies 1
1.2 Designing data pipelines 2, 3
1.3 Designing a data processing solution 4
1.4 Migrating data warehousing and data processing 4
Section 2: Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems 2
2.2 Building and operationalizing pipelines 3
2.3 Building and operationalizing infrastructure 5
Section 3: Operationalizing machine learning models
3.1 Leveraging prebuilt ML models as a service 12
3.2 Deploying an ML pipeline 9
3.3 Choosing the appropriate training and serving infrastructure 10
3.4 Measuring, monitoring, and troubleshooting machine learning models 11
Section 4: Ensuring solution quality
4.1 Designing for security and compliance 6
4.2 Ensuring scalability and efficiency 7
4.3 Ensuring reliability and fidelity 8
4.4 Ensuring flexibility and portability 8

Assessment Test

  1. You are migrating your machine learning operations to GCP and want to take advantage of managed services. You have been managing a Spark cluster because you use the MLlib library extensively. Which GCP managed service would you use?

    1. Cloud Dataprep
    2. Cloud Dataproc
    3. Cloud Dataflow
    4. Cloud Pub/Sub
  2. Your team is designing a database to store product catalog information. They have determined that you need to use a database that supports flexible schemas and transactions. What service would you expect to use?

    1. Cloud SQL
    2. Cloud BigQuery
    3. Cloud Firestore
    4. Cloud Storage
  3. Your company has been losing market share because competitors are attracting your customers with a more personalized experience on their e-commerce platforms, including providing recommendations for products that might be of interest to them. The CEO has stated that your company will provide equivalent services within 90 days. What GCP service would you use to help meet this objective?

    1. Cloud Bigtable
    2. Cloud Storage
    3. AI Platform
    4. Cloud Datastore
  4. The finance department at your company has been archiving data on premises. They no longer want to maintain a costly dedicated storage system. They would like to store up to 300 TB of data for 10 years. The data will likely not be accessed at all. They also want to minimize cost. What storage service would you recommend?

    1. Cloud Storage multi-regional storage
    2. Cloud Storage Nearline storage
    3. Cloud Storage Coldline storage
    4. Cloud Bigtable
  5. You will be developing machine learning models using sensitive data. Your company has several policies regarding protecting sensitive data, including requiring enhanced security on virtual machines (VMs) processing sensitive data. Which GCP service would you look to for meeting those requirements?

    1. Identity and access management (IAM)
    2. Cloud Key Management Service
    3. Cloud Identity
    4. Shielded VMs
  6. You have developed a machine learning algorithm for identifying objects in images. Your company has a mobile app that allows users to upload images and get back a list of identified objects. You need to implement the mechanism to detect when a new image is uploaded to Cloud Storage and invoke the model to perform the analysis. Which GCP service would you use for that?

    1. Cloud Functions
    2. Cloud Storage Nearline
    3. Cloud Dataflow
    4. Cloud Dataproc
  7. An IoT system streams data to a Cloud Pub/Sub topic for ingestion, and the data is processed in a Cloud Dataflow pipeline before being written to Cloud Bigtable. Latency is increasing as more data is added, even though nodes are not at maximum utilization. What would you look for first as a possible cause of this problem?

    1. Too many nodes in the cluster
    2. A poorly designed row key
    3. Too many column families
    4. Too many indexes being updated during write operations
  8. A health and wellness startup in Canada has been more successful than expected. Investors are pushing the founders to expand into new regions outside of North America. The CEO and CTO are discussing the possibility of expanding into Europe. The app offered by the startup collects personal information, storing some locally on the user’s device and some in the cloud. What regulation will the startup need to plan for before expanding into the European market?

    1. HIPAA
    2. PCI-DSS
    3. GDPR
    4. SOX
  9. Your company has been collecting vehicle performance data for the past year and now has 500 TB of data. Analysts at the company want to analyze the data to understand performance differences better across classes of vehicles. The analysts are advanced SQL users, but not all have programming experience. They want to minimize administrative overhead by using a managed service, if possible. What service might you recommend for conducting preliminary analysis of the data?

    1. Compute Engine
    2. Kubernetes Engine
    3. BigQuery
    4. Cloud Functions
  10. An airline is moving its luggage-tracking applications to Google Cloud. There are many requirements, including support for SQL and strong consistency. The database will be accessed by users in the United States, Europe, and Asia. The database will store approximately 50 TB in the first year and grow at approximately 10 percent a year after that. What managed database service would you recommend?

    1. Cloud SQL
    2. BigQuery
    3. Cloud Spanner
    4. Cloud Dataflow
  11. You are using Cloud Firestore to store data about online game players’ state while in a game. The state information includes health score, a set of possessions, and a list of team members collaborating with the player. You have noticed that the size of the raw data in the database is approximately 2 TB, but the amount of space used by Cloud Firestore is almost 5 TB. What could be causing the need for so much more space?

    1. The data model has been denormalized.
    2. There are multiple indexes.
    3. Nodes in the database cluster are misconfigured.
    4. There are too many column families in use.
  12. You have a BigQuery table with data about customer purchases, including the date of purchase, the type of product purchases, the product name, and several other descriptive attributes. There is approximately three years of data. You tend to query data by month and then by customer. You would like to minimize the amount of data scanned. How would you organize the table?

    1. Partition by purchase date and cluster by customer
    2. Partition by purchase date and cluster by product
    3. Partition by customer and cluster by product
    4. Partition by customer and cluster by purchase date
  13. You are currently using Java to implement an ELT pipeline in Hadoop. You’d like to replace your Java programs with a managed service in GCP. Which would you use?

    1. Data Studio
    2. Cloud Dataflow
    3. Cloud Bigtable
    4. BigQuery
  14. A group of attorneys has hired you to help them categorize over a million documents in an intellectual property case. The attorneys need to isolate documents that are relevant to a patent that the plaintiffs argue has been infringed. The attorneys have 50,000 labeled examples of documents, and when the model is evaluated on training data, it performs quite well. However, when evaluated on test data, it performs quite poorly. What would you try to improve the performance?

    1. Perform feature engineering
    2. Perform validation testing
    3. Add more data
    4. Regularization
  15. Your company is migrating from an on-premises pipeline that uses Apache Kafka for ingesting data and MongoDB for storage. What two managed services would you recommend as replacements for these?

    1. Cloud Dataflow and Cloud Bigtable
    2. Cloud Dataprep and Cloud Pub/Sub
    3. Cloud Pub/Sub and Cloud Firestore
    4. Cloud Pub/Sub and BigQuery
  16. A group of data scientists is using Hadoop to store and analyze IoT data. They have decided to use GCP because they are spending too much time managing the Hadoop cluster. They are particularly interested in using services that would allow them to port their models and machine learning workflows to other clouds. What service would you use as a replacement for their existing platform?

    1. BigQuery
    2. Cloud Storage
    3. Cloud Dataproc
    4. Cloud Spanner
  17. You are analyzing several datasets and will likely use them to build regression models. You will receive additional datasets, so you’d like to have a workflow to transform the raw data into a form suitable for analysis. You’d also like to work with the data in an interactive manner using Python. What services would you use in GCP?

    1. Cloud Dataflow and Data Studio
    2. Cloud Dataflow and Cloud Datalab
    3. Cloud Dataprep and Data Studio
    4. Cloud Datalab and Data Studio
  18. You have a large number of files that you would like to store for several years. The files will be accessed frequently by users around the world. You decide to store the data in multi-regional Cloud Storage. You want users to be able to view files and their metadata in a Cloud Storage bucket. What role would you assign to those users? (Assume you are practicing the principle of least privilege.)

    1. roles/storage.objectCreator
    2. roles/storage.objectViewer
    3. roles/storage.admin
    4. roles/storage.bucketList
  19. You have built a deep learning neural network to perform multiclass classification. You find that the model is overfitting. Which of the following would not be used to reduce overfitting?

    1. Dropout
    2. L2 Regularization
    3. L1 Regularization
    4. Logistic regression
  20. Your company would like to start experimenting with machine learning, but no one in the company is experienced with ML. Analysts in the marketing department have identified some data in their relational database that they think may be useful for training a model. What would you recommend that they try first to build proof-of-concept models?

    1. AutoML Tables
    2. Kubeflow
    3. Cloud Firestore
    4. Spark MLlib
  21. You have several large deep learning networks that you have built using TensorFlow. The models use only standard TensorFlow components. You have been running the models on an n1-highcpu-64 VM, but the models are taking longer to train than you would like. What would you try first to accelerate the model training?

    1. GPUs
    2. TPUs
    3. Shielded VMs
    4. Preemptible VMs
  22. Your company wants to build a data lake to store data in its raw form for extended periods of time. The data lake should provide access controls, virtually unlimited storage, and the lowest cost possible. Which GCP service would you suggest?

    1. Cloud Bigtable
    2. BigQuery
    3. Cloud Storage
    4. Cloud Spanner
  23. Auditors have determined that your company’s processes for storing, processing, and transmitting sensitive data are insufficient. They believe that additional measures must be taken to prevent sensitive information, such as personally identifiable government-issued numbers, are not disclosed. They suggest masking or removing sensitive data before it is transmitted outside the company. What GCP service would you recommend?

    1. Data loss prevention API
    2. In-transit encryption
    3. Storing sensitive information in Cloud Key Management
    4. Cloud Dataflow
  24. You are using Cloud Functions to start the processing of images as they are uploaded into Cloud Storage. In the past, there have been spikes in the number of images uploaded, and many instances of the Cloud Function were created at those times. What can you do to prevent too many instances from starting?

    1. Use the --max-limit parameter when deploying the function.
    2. Use the --max-instances parameter when deploying the function.
    3. Configure the --max-instance parameter in the resource hierarchy.
    4. Nothing. There is no option to limit the number of instances.
  25. You have several analysis programs running in production. Sometimes they are failing, but there is no apparent pattern to the failures. You’d like to use a GCP service to record custom information from the programs so that you can better understand what is happening. Which service would you use?

    1. Stackdriver Debugger
    2. Stackdriver Logging
    3. Stackdriver Monitoring
    4. Stackdriver Trace
  26. The CTO of your company is concerned about the rising costs of maintaining your company’s enterprise data warehouse. The current data warehouse runs in a PostgreSQL instance. You would like to migrate to GCP and use a managed service that reduces operational overhead and one that will scale to meet future needs of up to 3 PB. What service would you recommend?

    1. Cloud SQL using PostgreSQL
    2. BigQuery
    3. Cloud Bigtable
    4. Cloud Spanner