✅ Question 1: Data Collection and Aggregation
Prompt:
You are given access to the data containing information about taxi drivers and their rides, created by April 15th, 2023. When calculating any time features, consider April 15th, 2023 as today. The data is distributed across 6 different files:
drivers.csv
:
driver_id
(type:int
) – Unique identifier of the drivercar_id
(type:int
)age
(type:int
)started_driving_year
(type:int
)second_language
(type:str
) – If a driver doesn't have a second language, the value is"no"
rating
(type:float
)net_worth_of_tips
(type:float
)driver_class
(type:str
) – One of the following:["A class", "B class"]

rides_{i}.csv
(split into 4 files):
ride_id
(type:int
)driver_id
(type:int
)passenger_id
(type:int
)date
(type:str
)status
(type:str
) – One of the following:["Rejected by the driver", "Cancelled by the passenger", "Success"]
car_clearness_upvote_given
(type:bool
)politeness_upvote_given
(type:bool
)communication_upvote_given
(type:bool
)punctuality_upvote_given
(type:bool
)complaint_given
(type:bool
)
cars.csv
:
car_id
(type:int
)model
(type:str
)manufacture_year
(type:int
)last_inspection_date
(type:str
)
Your task:
Retrieve the needed information from the data about each driver and store it in the collected.csv
file.
Your goal is to obtain a table with the following columns. You may order rows and columns in any way you find comfortable to work with, tests are designed to be order-agnostic:
driver_id
(int
)car_model
(str
)car_manufacture_year
(int
)days_since_inspection
(int
)age
(int
)experience
(int
)second_language
(str
)rating
(float
)net_worth_of_tips
(float
)number_of_upvotes
(int
)driver_class
(str
)

✅ Question 2: Basic Analysis and CSV Output
Prompt:
Your tasks are as follows:
Calculate the average driver rating.
- Compute the average of the
rating
column fromdrivers.csv
- Store the result as:
insight_type
:"average_driver_rating"
value
: The calculated average (float)
Calculate the percentage of drivers with a second language.
- Determine the percentage of drivers who have a second language (i.e., where
second_language
is not"no"
) - Store the result as:
insight_type
:"percentage_drivers_with_second_language"
value
: The calculated percentage
Calculate the ride success rate.
- Combine the ride data from all
rides_{i}.csv
files into a single dataset - Calculate the percentage of rides that were successful (i.e., where
status
is"Success"
) - Store the result as:
insight_type
:"ride_success_rate"
value
: The calculated percentage
Output Requirements:
- Save the results in a CSV file named
analysis_results.csv
- The CSV file should have two columns:
insight_type
value
✅ Question 3: Preprocessing for Classification
Prompt:
The dataset includes various attributes for each driver.
Your tasks:
a. Encode the car_model
column using numerical encoding.
Encoding should start from 0
and encode values with consecutive integers.
Example of correct mapping:
"Nissan Altima" -> 3
"Ford Fusion" -> 1
"Honda Accord" -> 0
"Hyundai Sonata"-> 2
The set of codes is (0, 1, 2, 3)
; it is a consecutive sequence from zero.
Example of incorrect mapping:
"Nissan Altima" -> 5
"Ford Fusion" -> 4
"Honda Accord" -> 2
"Hyundai Sonata"-> 1
The set of codes is (1, 2, 4, 5)
; it is not a consecutive sequence starting from zero.
b. Normalize the net_worth_of_tips
column using Standard Scaling.
c. Convert the driver_class
column into numerical values:
- Replace
"A class"
with0
- Replace
"B class"
with1
✅ Question 4: Classification Model and Predictions
Prompt:
You are given the dataset that is a result of the previous question. To prevent being able to use the dataset for the answer to the previous question, random adjustments were applied, but the format and structure remain the same.
Using the cleaned dataset from the prior question, your goal is to train a classifier that is able to predict whether the driver is of "A class"
(0) or of a "B class"
(1).
This is a free-form task, so feel free to use any machine-learning model you want. You can also use any Python libraries you want.
The testing set from the previous task was split into validation and test sets. The test set does not contain the classes, and it is used to evaluate model performance on unseen data when questions are submitted. The validation set contains the classes and can be used to evaluate model performance during training.
Your task:
Predict classes of the drivers from test.csv
with the lowest possible error.
As an error metric, use precision
and recall
, considering B class to be the positive class.
Your goal is to maximize recall
, while keeping the precision
on a relatively high level.
Once you are satisfied with the model’s performance on the validation set val.csv
, submit the predicted classes of the drivers in the test set to the platform.
The predicted classes should be saved in a csv file with the name predictions.csv
. The file should have the following format:
driver_class
0
1
1
0
我们长期稳定承接各大科技公司如Capital One、TikTok、Google、Amazon等的OA笔试代写服务,确保满分通过。如有需求,请随时联系我们。
We consistently provide professional online assessment services for major tech companies like TikTok, Google, and Amazon, guaranteeing perfect scores. Feel free to contact us if you're interested.
