[UPDATED 2024] Read Databricks-Certified-Professional-Data-Engineer Study Guide Cover to Cover as Literally [Q12-Q34]

[UPDATED 2024] Read Databricks-Certified-Professional-Data-Engineer Study Guide Cover to Cover as Literally

100% Real & Accurate Databricks-Certified-Professional-Data-Engineer Questions and Answers with Free and Fast Updates

The Databricks Databricks-Certified-Professional-Data-Engineer exam consists of multiple-choice questions and hands-on tasks that test the candidate's practical knowledge of Databricks. Databricks-Certified-Professional-Data-Engineer exam covers a wide range of topics such as data engineering, data processing, ETL, data modeling, data warehousing, data governance, and data security. Databricks-Certified-Professional-Data-Engineer exam is designed to evaluate the candidate's ability to design and implement scalable data pipelines using Databricks.

NEW QUESTION # 12
Create a schema called bronze using location '/mnt/delta/bronze', and check if the schema exists before creating.

A. if IS_SCHEMA('bronze'): CREATE SCHEMA bronze LOCATION '/mnt/delta/bronze'
B. Schema creation is not available in metastore, it can only be done in Unity catalog UI
C. CREATE SCHEMA IF NOT EXISTS bronze LOCATION '/mnt/delta/bronze'
D. Cannot create schema without a database
E. CREATE SCHEMA bronze IF NOT EXISTS LOCATION '/mnt/delta/bronze'

Answer: C

Explanation:
Explanation
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
1.CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ LOCATION schema_directory ]

NEW QUESTION # 13
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

A. Cmd 4
B. Cmd 3
C. Cmd 6
D. Cmd 5
E. Cmd 2

Answer: C

Explanation:
Explanation
Cmd 6 is the command that should be removed from the notebook before scheduling it as a job. This command is selecting all the columns from the finalDF dataframe and displaying them in the notebook. This is not necessary for the job, as the finalDF dataframe is already written to a table in Cmd 7. Displaying the dataframe in the notebook will only consume resources and time, and it will not affect the output of the job.
Therefore, Cmd 6 is redundant and should be removed.
The other commands are essential for the job, as they perform the following tasks:
Cmd 1: Reads the raw_data table into a Spark dataframe called rawDF.
Cmd 2: Prints the schema of the rawDF dataframe, which is useful for debugging and understanding the data structure.
Cmd 3: Selects all the columns from the rawDF dataframe, as well as the nested columns from the values struct column, and creates a new dataframe called flattenedDF.
Cmd 4: Drops the values column from the flattenedDF dataframe, as it is no longer needed after flattening, and creates a new dataframe called finalDF.
Cmd 5: Explains the physical plan of the finalDF dataframe, which is useful for optimizing and tuning the performance of the job.
Cmd 7: Writes the finalDF dataframe to a table called flat_data, using the append mode to add new data to the existing table.

NEW QUESTION # 14
A data engineer needs to create a database called customer360 at the loca-tion /customer/customer360. The
data engineer is unsure if one of their colleagues has already created the database.
Which of the following commands should the data engineer run to complete this task?

A. CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION '/customer/customer360';
B. CREATE DATABASE customer360 DELTA LOCATION '/customer/customer360';
C. CREATE DATABASE customer360 LOCATION '/customer/customer360';
D. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360';
E. CREATE DATABASE IF NOT EXISTS customer360;

Answer: D

NEW QUESTION # 15
Kevin is the owner of the schema sales, Steve wanted to create new table in sales schema called regional_sales so Kevin grants the create table permissions to Steve. Steve creates the new table called regional_sales in sales schema, who is the owner of the table regional_sales

A. Kevin is the owner of sales schema, all the tables in the schema will be owned by Kevin
B. By default ownership is assigned DBO
C. Kevin and Smith both are owners of table
D. By default ownership is assigned to DEFAULT_OWNER
E. Steve is the owner of the table

Answer: E

Explanation:
Explanation
A user who creates the object becomes its owner, does not matter who is the owner of the parent object.

NEW QUESTION # 16
Which one of the following is not a Databricks lakehouse object?

A. Views
B. Database/Schemas
C. Functions
D. Catalog
E. Stored Procedures
F. Tables

Answer: E

Explanation:
Explanation
The answer is, Stored Procedures.
Databricks lakehouse does not support stored procedures.

NEW QUESTION # 17
Which statement regarding stream-static joins and static Delta tables is correct?

A. Stream-static joins cannot use static Delta tables because of consistency issues.
B. The checkpoint directory will be used to track state information for the unique keys present in the join.
C. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.
D. The checkpoint directory will be used to track updates to the static Delta table.
E. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

Answer: E

Explanation:
Explanation
This is the correct answer because stream-static joins are supported by Structured Streaming when one of the tables is a static Delta table. A static Delta table is a Delta table that is not updated by any concurrent writes, such as appends or merges, during the execution of a streaming query. In this case, each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch, which means it will reflect any changes made to the static Delta table before the start of each microbatch. Verified References:[Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Stream and static joins" section.

NEW QUESTION # 18
Which of the following developer operations in the CI/CD can only be implemented through a GIT provider when using Databricks Repos.

A. Commit and push code
B. Pull request and review process
C. Create a new branch
D. Trigger Databricks Repos pull API to update the latest version
E. Create and edit code

Answer: B

Explanation:
Explanation
The answer is Pull request and review process, please note: the question is asking for steps that are being implemented in GIT provider not Databricks Repos.
See below diagram to understand the role of Databricks Repos and Git provider plays when building a CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops.
Diagram Description automatically generated

Bottom of Form
Top of Form

NEW QUESTION # 19
How do you access or use tables in the unity catalog?

A. catalog_name.database_name.schema_name.table_name
B. catalog_name.table_name
C. catalog_name.schema_name.table_name
D. schema_name.catalog_name.table_name
E. schema_name.table_name

Answer: C

Explanation:
Explanation
The answer is catalog_name.schema_name.table_name
Graphical user interface, diagram Description automatically generated

Note: Database and Schema are analogous they are interchangeably used in the Unity catalog.
FYI, A catalog is registered under a metastore, by default every workspace has a default metastore called hive_metastore, with a unity catalog you have the ability to create meatstores and share that across multiple workspaces.

Diagram Description automatically generated

NEW QUESTION # 20
You had AUTO LOADER to process millions of files a day and noticed slowness in load process, so you scaled up the Databricks cluster but realized the performance of the Auto loader is still not improving, what is the best way to resolve this.

A. AUTO LOADER is not suitable to process millions of files a day
B. Copy the data from cloud storage to local disk on the cluster for faster access
C. Increase the maxFilesPerTrigger option to a sufficiently high number
D. Merge files to one large file
E. Setup a second AUTO LOADER process to process the data

Answer: C

Explanation:
Explanation
The default value of maxFilesPerTrigger is 1000 it can be increased to a much higher number but will require a much larger compute to process.
Graphical user interface, text, application, email Description automatically generated

https://docs.databricks.com/ingestion/auto-loader/options.html

NEW QUESTION # 21
Projecting a multi-dimensional dataset onto which vector has the greatest variance?

A. first eigenvector
B. second principal component
C. second eigenvector
D. first principal component
E. not enough information given to answer

Answer: D

Explanation:
Explanation
The method based on principal component analysis (PCA) evaluates the features according to the projection of
the largest eigenvector of the correlation matrix on the initial dimensions, the method based on Fisher's linear
discriminant analysis evaluates. Them according to the magnitude of the components of the discriminant
vector.
The first principal component corresponds to the greatest variance in the data, by definition. If we project the
data onto the first principal component line, the data is more spread out (higher variance) than if projected onto
any other line, including other principal components.

NEW QUESTION # 22
The default threshold of VACUUM is 7 days, internal audit team asked to certain tables to maintain at least
365 days as part of compliance requirement, which of the below setting is needed to implement.

A. ALTER TABLE table_name set TBLPROPERTIES (del-ta.deletedFileRetentionDuration= 'interval 365 days')
B. ALTER TABLE table_name set EXENDED TBLPROPERTIES (delta.vaccum.duration= 'interval 365 days')
C. MODIFY TABLE table_name set TBLPROPERTY (delta.maxRetentionDays = 'inter-val 365 days')
D. ALTER TABLE table_name set EXENDED TBLPROPERTIES (del-ta.deletedFileRetentionDuration=
'interval 365 days')

Answer: A

Explanation:
Explanation
1.ALTER TABLE table_name SET TBLPROPERTIES ( property_key [ = ] property_val [, ...] ) TBLPROPERTIES allow you to set key-value pairs Table properties and table options (Databricks SQL) | Databricks on AWS

NEW QUESTION # 23
A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

A. return spark.read.option("readChangeFeed", "true").table ("bronze")
B.
C. return spark.readStream.load("bronze")
D. return spark.readStream.table("bronze")

Answer: B

Explanation:
Explanation
This is the correct answer because it completes the function definition that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline. The object returned by this function is a DataFrame that contains all change events from a Delta Lake table that has enabled change data feed. The readChangeFeed option is set to true to indicate that the DataFrame should read changes from the table, and the table argument specifies the name of the table to read changes from. The DataFrame will have a schema that includes four columns: operation, partition, value, and timestamp. The operation column indicates the type of change event, such as insert, update, or delete. The partition column indicates the partition where the change event occurred. The value column contains the actual data of the change event as a struct type. The timestamp column indicates the time when the change event was committed. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Read changes in batch queries" section.

NEW QUESTION # 24
A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
B. Databricks has autotuned to a smaller target file size based on the amount of data in each partition
C. Databricks has autotuned to a smaller target file size based on the overall size of data in the table
D. Z-order indices calculated on the table are preventing file compaction C Bloom filler indices calculated on the table are preventing file compaction

Answer: A

Explanation:
Explanation
This is the correct answer because Databricks has a feature called Auto Optimize, which automatically optimizes the layout of Delta Lake tables by coalescing small files into larger ones and sorting data within each file by a specified column. However, Auto Optimize also considers the trade-off between file size and merge performance, and may choose a smaller target file size to reduce the duration of merge operations, especially for streaming workloads that frequently update existing records. Therefore, it is possible that Auto Optimize has autotuned to a smaller target file size based on the characteristics of the streaming production job. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Auto Optimize" section.

NEW QUESTION # 25
Which of the following is true, when building a Databricks SQL dashboard?

A. Only one visualization can be developed with one query result
B. More than one visualization can be developed using a single query result
C. A dashboard can only use results from one query
D. A dashboard can only connect to one schema/Database
E. A dashboard can only have one refresh schedule

Answer: B

Explanation:
Explanation
the answer is, More than one visualization can be developed using a single query result.
In the query editor pane + Add visualization tab can be used for many visualizations for a single query result.
Graphical user interface, text, application Description automatically generated

NEW QUESTION # 26
Which of the following locations hosts the driver and worker nodes of a Databricks-managed clus-ter?

A. Databricks Filesystem
B. Control plane
C. Databricks web application
D. JDBC data source
E. Data plane

Answer: E

Explanation:
Explanation
The answer is Data Plane, which is where compute(all-purpose, Job Cluster, DLT) are stored this is generally a customer cloud account, there is one exception SQL Warehouses, currently there are 3 types of SQL Warehouse compute available(classic, pro, serverless), in classic and pro compute is located in customer cloud account but serverless computed is located in Databricks cloud account.
Diagram, timeline Description automatically generated

NEW QUESTION # 27
The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A. "Can Restart" privileges on the required cluster
B. "Can Manage" privileges on the required cluster
C. Cluster creation allowed. "Can Restart" privileges on the required cluster
D. Cluster creation allowed. "Can Attach To" privileges on the required cluster
E. Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

Answer: A

Explanation:
Explanation
https://learn.microsoft.com/en-us/azure/databricks/security/auth-authz/access-control/cluster-acl
https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html

NEW QUESTION # 28
Which of the following python statement can be used to replace the schema name and table name in the query statement?

A. 1.table_name = "sales"
2.schema_name = "bronze"
3.query = f"select * from + schema_name +"."+table_name"
B. 1.table_name = "sales"
2.schema_name = "bronze"
3.query = f"select * from schema_name.table_name"
C. 1.table_name = "sales"
2.schema_name = "bronze"
3.query = f"select * from { schema_name}.{table_name}"
D. 1.table_name = "sales"
2.schema_name = "bronze"
3.query = "select * from {schema_name}.{table_name}"

Answer: C

Explanation:
Explanation
Answer is
table_name = "sales"
query = f"select * from {schema_name}.{table_name}"
f strings can be used to format a string. f" This is string {python variable}"
https://realpython.com/python-f-strings/

NEW QUESTION # 29
Which of the following data workloads will utilize a Bronze table as its source?

A. A job that queries aggregated data to publish key insights into a dashboard
B. A job that ingests raw data from a streaming source into the Lakehouse
C. A job that develops a feature set for a machine learning application
D. A job that enriches data by parsing its timestamps into a human-readable format
E. A job that aggregates cleaned data to create standard summary statistics

Answer: D

NEW QUESTION # 30
What is the main difference between AUTO LOADER and COPY INTO?

A. AUTO LOADER supports reading data from Apache Kafka
B. AUTO LOADER supports schema evolution.
C. COPY INTO supports file notification when performing incremental loads.
D. AUTO LOADER Supports file notification when performing incremental loads.
E. COPY INTO supports schema evolution.

Answer: D

Explanation:
Explanation
Auto loader supports both directory listing and file notification but COPY INTO only supports di-rectory listing.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.

Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
1.Directory listing - List Directory and maintain the state in RocksDB, supports incremental file listing
2.File notification - Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
When to use Auto Loader instead of the COPY INTO?
*You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches.
*You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.
Here are some additional notes on when to use COPY INTO vs Auto Loader
When to use COPY INTO
https://docs.databricks.com/delta/delta-ingest.html#copy-into-sql-command When to use Auto Loader
https://docs.databricks.com/delta/delta-ingest.html#auto-loader

NEW QUESTION # 31
What are the different ways you can schedule a job in Databricks workspace?

A. Cron, On Demand runs
B. Cron, File notification from Cloud object storage
C. Continuous, Incremental
D. Once, Continuous
E. On-Demand runs, File notification from Cloud object storage

Answer: A

Explanation:
Explanation
The answer is, Cron, On-Demand runs
Supports running job immediately or using can be scheduled using CRON syntax

NEW QUESTION # 32
A data engineer has written the following query:
1. SELECT *
2. FROM json.`/path/to/json/file.json`;
The data engineer asks a colleague for help to convert this query for use in a Delta Live Tables (DLT)
pipeline. The query should create the first table in the DLT pipeline.
Which of the following describes the change the colleague needs to make to the query?

A. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of the query
B. They need to add the cloud_files(...) wrapper to the JSON file path
C. They need to add a live. prefix prior to json. in the FROM line
D. They need to add a CREATE DELTA LIVE TABLE table_name AS line at the beginning of the query
E. They need to add a COMMENT line at the beginning of the query

Answer: A

NEW QUESTION # 33
The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following steps can be taken to resolve the issue?
Here is the example query
1.--- Get order summary
2.create or replace table orders_summary
3.as
4.select product_id, sum(order_count) order_count
5.from
6. (
7. select product_id,order_count from orders_instore
8. union all
9. select product_id,order_count from orders_online
10. )
11.group by product_id
12.-- get supply summary
13.create or repalce tabe supply_summary
14.as
15.select product_id, sum(supply_count) supply_count
16.from supply
17.group by product_id
18.
19.-- get on hand based on orders summary and supply summary
20.
21.with stock_cte
22.as (
23.select nvl(s.product_id,o.product_id) as product_id,
24. nvl(supply_count,0) - nvl(order_count,0) as on_hand
25.from supply_summary s
26.full outer join orders_summary o
27. on s.product_id = o.product_id
28.)
29.select *
30.from
31.stock_cte
32.where on_hand = 0

A. Increase the maximum bound of the SQL endpoint's scaling range.
B. Turn on the Auto Stop feature for the SQL endpoint.
C. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to "Reliability Optimized."
D. Increase the cluster size of the SQL endpoint.
E. Turn on the Serverless feature for the SQL endpoint.

Answer: D

Explanation:
Explanation
The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the que-ries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.

SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance.
Scale-out -> Add more clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
SQL endpoint
A picture containing diagram Description automatically generated

NEW QUESTION # 34
......

The Databricks Certified Professional Data Engineer Exam certification verifies that the candidate has significant experience in implementing big data solutions that operate on the Databricks Delta Architecture. To earn the certification, a candidate must pass a 90-minute exam consisting of up to 50 multiple-choice and multiple-select questions. The Databricks-Certified-Professional-Data-Engineer certification is valid for two years from the date of passing the exam.

Databricks Certified Professional Data Engineer certification is designed for data engineers who are responsible for building and maintaining data pipelines and data lakes on the Databricks platform. Databricks Certified Professional Data Engineer Exam certification exam covers a wide range of topics, including data engineering concepts, data modeling, data ingestion, data transformation, data processing, and data warehousing. Databricks-Certified-Professional-Data-Engineer exam is designed to assess a candidate's ability to design, build, and maintain scalable and reliable data pipelines on the Databricks platform.

Reliable Study Materials for Databricks-Certified-Professional-Data-Engineer Exam Success For Sure: https://www.actual4test.com/Databricks-Certified-Professional-Data-Engineer_examcollection.html

[UPDATED 2024] Read Databricks-Certified-Professional-Data-Engineer Study Guide Cover to Cover as Literally [Q12-Q34]

Related Articles

RECENT DISCUSSIONS

Useful Links

Contact Us