DSM030 Statistics and Statistical Data Mining Assignment Brief 2026 | UOL

University University of London (UOL)
Subject DSM030 Statistics and Statistical Data Mining

DSM030 Assignment Brief

MSc Data Science

Module: Statistics and Statistical Data Mining

Task Name: Data Preprocessing and Engineering using Python 3

Assignent Date: Monday, 09 March 2026

  • Please Note: You are permitted to upload your Coursework in the final submission area as many times as you like before the deadline. You will receive a similarity/originality score which represents what the Turnitin system identifies as work similar to another source. The originality score can take over 24 hours to generate, especially at busy times e.g. submission deadline.
  • If you upload the wrong version of your Coursework, you are able to upload the correct version of your Coursework via the same submission area. You simply need to click on the ‘submit paper’ button again and submit your new version before the deadline.

In doing so, this will delete the previous version which you submitted and your new updated version will replace it. Therefore your Turnitin similarity score should not be affected. If there is a change in your Turnitin similarity score, it will be due to any changes you may have made to your Coursework.

  • Please note, when the due date is reached, the version you have submitted last, will be considered as your final submission and it will be the version that is marked.
  • Once the due date has passed, it will not be possible for you to upload a different version of your assessment. Therefore, you must ensure you have submitted the correct version of your assessment which you wish to be marked, by the due date.

You are asked to submit a Jupyter notebook that contains your solution (Weighted at 50% of final mark for the module). You will be given a Jupyter notebook that you can use as a skeleton/guide. Please make sure you use Python 3 and not Python 2. Python 2 code will not be marked and will be considered as a non-submission.

Coursework Description

Task Name: Data Engineering and Pre-processing

Data pre-processing and engineering is a very important step in statistical data mining. This step might look straightforward, but it can easily be a nightmare. This could be due to any number of difficulties, including: 1) the nature of the problem, 2) the number of variables and their types (i.e. numerical, categorical etc), and 3) Selecting the correct transformation if a transformation is required.

In this task, you will implement several data pre-processing and engineering steps that are common in data science and machine learning. These steps involve several key topics in statistics.

You are expected to learn some simple techniques that are required to finish this task (this is if you do not already know them). There will be a video explaining the task further in order to assist you.

Data description, the dataset you will use for this task contains data about house sale prices. The file ‘data_description.txt’ contains a detailed description of all the variables, what they represent, their values and so on. The target variable is ‘SalePrice’, which is the house’s sale price in US dollars.

Here is a description of the steps you are asked to implement and their corresponding marks:

1. Import the required libraries.

2. Load the data using pandas and plot a Histogram of the SalePrice column. This code is provided for you, do not change it.

3. The SalePrice column is not normally distributed (i.e. not Gaussian), prove this by running a statistical test and obtaining and interpreting the  p-value.     [5 marks]

4. Split data into train and test sets making sure the test set is 30% of the original data and the remaining 70% are for training. This code is provided for you, do not change it.

5. Create a list of all categorical variables (by checking their type in the original dataset). [2 marks]

6. Using the training set (X_train), create a list of all categorical variables that contain missing data and print the percentage of missing values per variable in X_train. [3 marks]

7. Using the result of the previous step: For categorical variables with more than 10% of data missing, replace missing data with the word ‘Missing’, in other variables replace the missing data with the most frequent category in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]

8. Create a list of all numerical variables (do not include SalePrice). [2 marks]

9. Create a list of all numerical variables that contain missing data and print out the percentage of missing values per variable (use the training data).  [3 marks]

10. Using the result of the previous step: For numerical variables with less than 15% of data missing, replace missing data with the mean of the variable, in other variables replace the missing data with the median of the variable in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]

11.In the train and test sets, replace the values of variables ‘YearBuilt’, ‘YearRemodAdd’ and ‘GarageYrBlt’ with the time elapsed between them and the year in which the house was sold ‘YrSold’. After that drop the ‘YrSold’ column.  [5 marks]

12.Apply mappings to categorical variables that have an order (in total there should be 14 of them). Some of the categorical variables have values with an assigned order, related to quality (for more information, check the data description file). This means you can replace categories by numbers to determine quality. For example, values in the ‘BsmtExposure’ can be mapped as follows: ‘No’ can be mapped to 1, ‘Mn’ can be mapped to 2, ‘Av’ can be mapped to 3 and ‘Gd’ can be mapped to 4.

One way of doing this is to manually create mappings similar to the example given. Each mapping can be saved as a Python dictionary and used to perform the actual mapping to transform the described variables from categorical to numerical.

To Make it easier for you, here are groups of variables that have the same mappings (Hint: you can map both categories ‘Missing’ and ‘NA’ to 0):

  • [‘ExterQual’, ‘ExterCond’, ‘BsmtQual’, ‘BsmtCond’, ‘HeatingQC’,

‘KitchenQual’, ‘FireplaceQu’,’GarageQual’, ‘GarageCond’]

  • [‘BsmtFinType1’, ‘BsmtFinType2’]

Each of the following variables has its own mapping: ‘BsmtExposure’,  ‘GarageFinish’, ‘Fence’.  [5 marks]

13. Replace Rare Labels with ‘Rare’. For the remaining five categorical variables (the variables that you did not apply value mappings to, they should be five variables), you will need to group those categories that are present in less than 1% of the observations in the training set. That is, all values of categorical variables that are shared by less than 1% of houses in the training set will be replaced by the string “Rare” in both the training and test set. So in more detail you need to find rare labels in the remaining categorical variables and replace them with the category ‘Rare’. Remember: rare labels are those categories that only appear in a small percentage of the observations (in our case in < 1%). Hint: If you look at unique values in a categorical variable in the training set and count how many times each of the unique values appear in the variable, you can compute the percentage of each unique value by dividing its count by the total number of observations. Remember to make the computations using the training set and replacement in both training and test sets.  [5 marks]

14.Perform one hot encoding to transform the previous five categorical variables into binary variables. Make sure you do it correctly for both the training and testing sets. After this, remember to drop the original five categorical variables (the ones with the strings) from the trainin  and test after the encoding.  [5 marks]

15. Feature Scaling. Now we know all variables in our two datasets (i.e. the training and test sets) are numerical, the final step in this exercise is to apply scaling by making sure the minimum value in each variable is 0 and the maximum value is 1. For this step, you can use MinMaxScaler() from sci-kit learn. Make sure you apply it correctly by transforming the test set based on the training set.  [5 marks]

After applying all the previous steps, the overall mean value of all entries in the training set was approximately 0.249 and in the test set was approximately 0.247.

Please refer to Appendix C of the Programme Regulations for detailed Assessment Criteria.

Plagiarism:

This is cheating. Do not be tempted and certainly do not succumb to temptation. Plagiarised copies are invariably rooted out and severe penalties apply. All assignment submissions are electronically tested for plagiarism.

Struggling with DSM030 Statistics & Data Mining Assignment at UOL?

MSc Data Science students often struggle with DSM030 Statistics and Statistical Data Mining assignments because this assignment be complex and time-consuming. There is no need to worry, as Students Assignment Help provides expert data mining assignment help aligned with University of London marking criteria. For trust, students can review our expert-written data science assignment samples. Order your DSM030 Assignment today through our statistics assignment helper and receive 100% original, well-documented Jupyter notebook solutions written only for you.

Answer
img-blur-answers
WhatsApp Icon

Facing Issues with Assignments? Talk to Our Experts Now!Download Our App Now!

Have Questions About Our Services?
Download Our App!

Get the App Today!

QRcode