AWS Certified Machine Learning - Specialty (MLS-C01)

0 votes, 0 avg

This quiz randomly generates 30 questions as asked in AWS Certified Machine Learning - Specialty (MLS-C01)

Congratulations!

This quiz randomly generates 30 questions (in 60 mins) as asked in AWS Certified Machine Learning - Specialty (MLS-C01). The real MLS-C01 test has 65 questions and a total time of 180 minutes. Of these, 15 questions are underlined, and only 50 questions are scored. This test randomly generates 30 questions from our question bank. For best results, practice multiple times until you achieve 100% accuracy.

1 / 30

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Choose three.)

The training channel identifying the location of training data on an Amazon S3 bucket.

The validation channel identifying the location of validation data on an Amazon S3 bucket.

The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users.

Hyperparameters in a JSON array as documented for the algorithm used.

The Amazon EC2 instance class specifying whether training will be run using CPU or GPU.

The output path specifying where on an Amazon S3 bucket the trained model will persist.

2 / 30

A manufacturing company has a large set of labeled historical sales data. The manufacturer would like to

predict how many units of a particular part should be produced each quarter.

Which machine learning approach should be used to solve this problem?

Logistic regression

Random Cut Forest (RCF)

Principal component analysis (PCA)

Linear regression

3 / 30

A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training.

The dataset is stored in Amazon S3 and contains Personally Identifiable Information (PII).

The dataset:

Must be accessible from a VPC only.

Must not traverse the public internet.

How can these requirements be satisfied?

Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.

Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance.

Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance.

Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance

4 / 30

A Machine Learning Specialist has created a deep learning neural network model that performs well on the

training data but performs poorly on the test data.

Which of the following methods should the Specialist consider using to correct this? (Choose three.)

Decrease regularization.

Increase regularization.

Increase dropout.

Decrease dropout.

Increase feature combinations.

Decrease feature combinations.

5 / 30

A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time

streaming data.

The ingestion process must buffer and convert incoming records from JSON to a query-optimized,

columnar format without data loss. The output datastore must be highly available, and Analysts must be

able to run SQL queries against the data and connect to existing business intelligence dashboards.

Which solution should the Data Scientist build to satisfy the requirements?

Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.

Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

6 / 30

A Mobile Network Operator is building an analytics platform to analyze and optimize a company's

operations using Amazon Athena and Amazon S3. The source systems send data in .CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement?

Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet

Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.

Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.

Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

7 / 30

A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model's performance?

Receiver operating characteristic (ROC) curve

Misclassification rate

Root Mean Square Error (RMSE)

L1 norm

8 / 30

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that

training accuracy oscillates.

What is the MOST likely cause of this issue?

The class distribution in the dataset is imbalanced.

Dataset shuffling is disabled.

The batch size is too big.

The learning rate is very high.

9 / 30

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

Use AWS Glue to catalogue the data and Amazon Athena to run queries.

Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

10 / 30

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a

Machine Learning Specialist would like to build a binary classifier based on two features: age of account

and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information, which model would have the HIGHEST accuracy?

Long short-term memory (LSTM) model with scaled exponential linear unit (SELU)

Logistic regression

Support vector machine (SVM) with non-linear kernel

Single perceptron with tanh activation function

11 / 30

A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training,

and needs to continue working for an extended period with no Wi-Fi access.

Which approach should the Specialist use to continue working?

Install Python 3 and boto3 on their laptop and continue the code development using that environment.

Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment, and use the Amazon SageMaker Python SDK to test the code.

Download TensorFlow from tensorflow.org to emulate the TensorFlow kernel in the SageMaker environment.

Download the SageMaker notebook to their local environment, then install Jupyter Notebooks on their laptop and continue the development in a local notebook.

12 / 30

A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat.

The ML Specialist performs some tests and records the following results for a neural network-based image

classifier:

Total number of images available = 1,000

Test set images = 100 (constant test set)

The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by

their owners.

Which techniques can be used by the ML Specialist to improve this specific test error?

Increase the training data by adding variation in rotation for training images.

Increase the number of epochs for model training

Increase the number of layers for the neural network.

Increase the dropout rate for the second-to-last layer.

13 / 30

A large consumer goods manufacturer has the following products on sale:

1. 34 different toothpaste variants

2. 48 different toothbrush variants

3. 43 different mouthwash variants

The entire sales history of all these products is available in Amazon S3. Currently, the company is using

custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these

products. The company wants to predict the demand for a new product that will soon be launched.

Which solution should a Machine Learning Specialist apply?

Train a custom ARIMA model to forecast demand for the new product.

Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product.

Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product.

Train a custom XGBoost model to forecast demand for the new product.

14 / 30

A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The

Specialist needs to understand whether the model is more frequently overestimating or underestimating

the target.

What option can the Specialist use to determine whether it is overestimating or underestimating the target

value?

Root Mean Square Error (RMSE)

Residual plots

Area under the curve

Confusion matrix

15 / 30

A Data Scientist wants to gain real-time insights into a data stream of GZIP files.

Which solution would allow the use of SQL to query the stream with the LEAST latency?

Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.

AWS Glue with a custom ETL script to transform the data.

An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.

Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.

16 / 30

A company is observing low accuracy while training on the default built-in image classification algorithm in

Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture

instead of a ResNet architecture.

Which of the following will accomplish this? (Choose two.)

Customize the built-in image classification algorithm to use Inception and use this for model training.

Create a support case with the SageMaker team to change the default image classification algorithm to Inception.

Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training.

Use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network, and use this for model training.

Download and apt-get install the inception network code into an Amazon EC2 instance and use this instance as a Jupyter notebook in Amazon SageMaker.

17 / 30

A company is using Amazon Polly to translate plaintext documents to speech for automated company

announcements. However, company acronyms are being mispronounced in the current documents.

How should a Machine Learning Specialist address this issue for future documents?

Convert current documents to SSML with pronunciation tags.

Create an appropriate pronunciation lexicon.

Output speech marks to guide in pronunciation.

Use Amazon Lex to preprocess the text files for pronunciation

18 / 30

A Machine Learning Specialist built an image classification deep learning model. However, the Specialist

ran into an overfitting problem in which the training and testing accuracies were 99% and 75%,

respectively.

How should the Specialist address this issue and what is the reason behind it?

The learning rate should be increased because the optimization process was trapped at a local minimum.

The dropout rate at the flatten layer should be increased because the model is not generalized enough.

The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough.

The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

19 / 30

A large mobile network operating company is building a machine learning model to predict customers who

are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as

the cost of churn is far greater than the cost of the incentive.

The model produces the following confusion matrix after evaluating on a test dataset of 100 customers:

Based on the model evaluation results, why is this a viable model for production?

The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.

The precision of the model is 86%, which is less than the accuracy of the model.

The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.

The precision of the model is 86%, which is greater than the accuracy of the model.

20 / 30

A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs

queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the

team requires better performance. How should the records be stored in Amazon S3 to improve query performance?

CSV files

Parquet files

Compressed JSON

RecordlO

21 / 30

An online reseller has a large, multi-column dataset with one column missing 30% of its data. A Machine

Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing

data.

Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

Listwise deletion

Last observation carried forward

Multiple imputation

Mean substitution

22 / 30

A gaming company has launched an online game where people can start playing for free, but they need to

pay if they choose to use certain features. The company needs to build an automated system to predict

whether or not a new user will become a paid user within 1 year. The company has gathered a labeled

dataset from 1 million users.

The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year)

and 999,000 negative samples (from users who did not use any paid features). Each data sample consists

of 200 features including user age, device, location, and play patterns.

Using this dataset for training, the Data Science team trained a random forest model that converged with

over 99% accuracy on the training set. However, the prediction results on a test dataset were not

satisfactory

Which of the following approaches should the Data Science team take to mitigate this issue? (Choose

two.)

Add more deep trees to the random forest to enable the model to learn more features.

Include a copy of the samples in the test dataset in the training dataset.

Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.

Change the cost function so that false negatives have a higher impact on the cost value than false positives.

Change the cost function so that false positives have a higher impact on the cost value than false negatives.

23 / 30

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon

Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record

contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only.

How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Convert the records to Apache Parquet format.

Convert the records to JSON format.

Convert the records to GZIP CSV format.

Convert the records to XML format.

24 / 30

An agency collects census information within a country to determine healthcare and social program needs

by province and city. The census form collects responses for approximately 500 questions from each

citizen.

Which combination of algorithms would provide the appropriate insights? (Select TWO.)

The factorization machines (FM) algorithm

The Latent Dirichlet Allocation (LDA) algorithm

The principal component analysis (PCA) algorithm

The k-means algorithm

The Random Cut Forest (RCF) algorithm

25 / 30

A Machine Learning Specialist working for an online fashion company wants to build a data ingestion

solution for the company's Amazon S3-based data lake.

The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised

of:

Real-time analytics

Interactive analytics of historical data

Clickstream analytics

Product recommendations

Which services should the Specialist use?

AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations

Amazon Athena as the data catalog: Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for near-real-time data insights; Amazon Kinesis Data Firehose for clickstream analytics; AWS Glue to generate personalized product recommendations

AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations

Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon DynamoDB streams for clickstream analytics; AWS Glue to generate personalized product recommendations

26 / 30

Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

Recall

Misclassification rate

Mean absolute percentage error (MAPE)

Area Under the ROC Curve (AUC)

27 / 30

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a

Machine Learning Specialist would like to build a binary classifier based on two features: age of account

and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

Decision tree

Linear support vector machine (SVM)

Naive Bayesian classifier

Single Perceptron with sigmoidal activation function

28 / 30

Machine Learning Specialist is working with a media company to perform classification on popular articles

from the company's website. The company is using random forests to classify how popular an article will

be before it is published. A sample of the data being used is below.

Given the dataset, the Specialist wants to convert the Day_Of_Week column to binary values.

What technique should be used to convert this column to binary values?

Binarization

One-hot encoding

Tokenization

Normalization transformation

29 / 30

An interactive online dictionary wants to add a widget that displays words used in similar contexts. A

Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model

powering the widget.

What should the Specialist do to meet these requirements?

Create one-hot word encoding vectors.

Produce a set of synonyms for every word using Amazon Mechanical Turk.

Create word embedding vectors that store edit distance with every other word.

Download word embeddings pre-trained on a large corpus.

30 / 30

A retail company intends to use machine learning to categorize new products. A labeled dataset of current

products was provided to the Data Science team. The dataset includes 1,200 products. The labeled

dataset has 15 features for each product such as title dimensions, weight, and price. Each product is

labeled as belonging to one of six categories such as books, games, electronics, and movies.

Which model should be used for categorizing new products using the provided dataset for training?

AnXGBoost model where the objective parameter is set to multi:softmax

A deep convolutional neural network (CNN) with a softmax activation function for the last layer

A regression forest where the number of trees is set equal to the number of product categories

A DeepAR forecasting model based on a recurrent neural network (RNN)

Your score is