- Unit 5 Understand the Role of the Social Care Worker Assessment Question 2026
- Leading and Managing Change Assessment 1 2026 | University of Greenwich
- 6F7V0020 Biodiversity, Natural Capital and Ecosystem Services Summative In-Course Assessment Briefing 2026
- BTEC HND Level 5 Unit 4 The Hospitality Business Toolkit Assignment Brief 2026
- BTEC Level 3 Unit 2 Working in Health and Social Care Assigment 2026
- BTEC Level 1-2 Unit 20 Building a Personal Computer Internal Assessment 2026
- BTEC Level 2 Unit 11 Computer Networks Assignment Brief 2026 | Pearson
- BTEC Level 4/5 Unit 06 Construction Information Assignment Brief 2026
- BA601 Management Control Qualifi Level 6 Assessment Brief 2026 | UEL
- MLA603 Maritime Regulation and Governance Assessment Brief 2026 | MLA College
- Introduction to Organizational Behavior Assessment Critical Essay | NTU
- ILM Level 4 Unit 416 Solving Problems by Making Effective Decisions in the Workplace Assignment
- NURS08059 Resilience in Healthcare Assignment Guide 2026 | UWS
- ILM Level 4 Unit 409 Managing Personal Development Assignment 2026
- BTEC Level 3 Unit 1 Axborot Texnologiyalari Tizimlari Assignment Brief 2026
- NI523 Approaches to Nursing Adults with Long Term Conditions Assignment Workbook 2026 | UOB
- GBEN5004 Social Entrepreneurship Assignment Brief 2026 | Oxford Brookes University
- EBSC6017 Data Mining for Marketers Unit Handbook 2026 | UCA
- A7080 Recent Advances in Ruminant Nutrition Individual Assignment 2025/26 | HAU
- ST2187 Business Analytics, Applied Modelling and Prediction Assignment | UOL
DSM030 Statistics and Statistical Data Mining Assignment Brief 2026 | UOL
| University | University of London (UOL) |
| Subject | DSM030 Statistics and Statistical Data Mining |
DSM030 Assignment Brief
MSc Data Science
Module: Statistics and Statistical Data Mining
Task Name: Data Preprocessing and Engineering using Python 3
Assignent Date: Monday, 09 March 2026
- Please Note: You are permitted to upload your Coursework in the final submission area as many times as you like before the deadline. You will receive a similarity/originality score which represents what the Turnitin system identifies as work similar to another source. The originality score can take over 24 hours to generate, especially at busy times e.g. submission deadline.
- If you upload the wrong version of your Coursework, you are able to upload the correct version of your Coursework via the same submission area. You simply need to click on the ‘submit paper’ button again and submit your new version before the deadline.
In doing so, this will delete the previous version which you submitted and your new updated version will replace it. Therefore your Turnitin similarity score should not be affected. If there is a change in your Turnitin similarity score, it will be due to any changes you may have made to your Coursework.
- Please note, when the due date is reached, the version you have submitted last, will be considered as your final submission and it will be the version that is marked.
- Once the due date has passed, it will not be possible for you to upload a different version of your assessment. Therefore, you must ensure you have submitted the correct version of your assessment which you wish to be marked, by the due date.
You are asked to submit a Jupyter notebook that contains your solution (Weighted at 50% of final mark for the module). You will be given a Jupyter notebook that you can use as a skeleton/guide. Please make sure you use Python 3 and not Python 2. Python 2 code will not be marked and will be considered as a non-submission.
Coursework Description
Task Name: Data Engineering and Pre-processing
Data pre-processing and engineering is a very important step in statistical data mining. This step might look straightforward, but it can easily be a nightmare. This could be due to any number of difficulties, including: 1) the nature of the problem, 2) the number of variables and their types (i.e. numerical, categorical etc), and 3) Selecting the correct transformation if a transformation is required.
In this task, you will implement several data pre-processing and engineering steps that are common in data science and machine learning. These steps involve several key topics in statistics.
You are expected to learn some simple techniques that are required to finish this task (this is if you do not already know them). There will be a video explaining the task further in order to assist you.
Data description, the dataset you will use for this task contains data about house sale prices. The file ‘data_description.txt’ contains a detailed description of all the variables, what they represent, their values and so on. The target variable is ‘SalePrice’, which is the house’s sale price in US dollars.
Here is a description of the steps you are asked to implement and their corresponding marks:
1. Import the required libraries.
2. Load the data using pandas and plot a Histogram of the SalePrice column. This code is provided for you, do not change it.
3. The SalePrice column is not normally distributed (i.e. not Gaussian), prove this by running a statistical test and obtaining and interpreting the p-value. [5 marks]
4. Split data into train and test sets making sure the test set is 30% of the original data and the remaining 70% are for training. This code is provided for you, do not change it.
5. Create a list of all categorical variables (by checking their type in the original dataset). [2 marks]
6. Using the training set (X_train), create a list of all categorical variables that contain missing data and print the percentage of missing values per variable in X_train. [3 marks]
7. Using the result of the previous step: For categorical variables with more than 10% of data missing, replace missing data with the word ‘Missing’, in other variables replace the missing data with the most frequent category in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]
8. Create a list of all numerical variables (do not include SalePrice). [2 marks]
9. Create a list of all numerical variables that contain missing data and print out the percentage of missing values per variable (use the training data). [3 marks]
10. Using the result of the previous step: For numerical variables with less than 15% of data missing, replace missing data with the mean of the variable, in other variables replace the missing data with the median of the variable in the training set (Apply the replacement to X_train and X_test and make sure it is based on the results you have obtained from the training set). [5 marks]
11.In the train and test sets, replace the values of variables ‘YearBuilt’, ‘YearRemodAdd’ and ‘GarageYrBlt’ with the time elapsed between them and the year in which the house was sold ‘YrSold’. After that drop the ‘YrSold’ column. [5 marks]
12.Apply mappings to categorical variables that have an order (in total there should be 14 of them). Some of the categorical variables have values with an assigned order, related to quality (for more information, check the data description file). This means you can replace categories by numbers to determine quality. For example, values in the ‘BsmtExposure’ can be mapped as follows: ‘No’ can be mapped to 1, ‘Mn’ can be mapped to 2, ‘Av’ can be mapped to 3 and ‘Gd’ can be mapped to 4.
One way of doing this is to manually create mappings similar to the example given. Each mapping can be saved as a Python dictionary and used to perform the actual mapping to transform the described variables from categorical to numerical.
To Make it easier for you, here are groups of variables that have the same mappings (Hint: you can map both categories ‘Missing’ and ‘NA’ to 0):
- [‘ExterQual’, ‘ExterCond’, ‘BsmtQual’, ‘BsmtCond’, ‘HeatingQC’,
‘KitchenQual’, ‘FireplaceQu’,’GarageQual’, ‘GarageCond’]
- [‘BsmtFinType1’, ‘BsmtFinType2’]
Each of the following variables has its own mapping: ‘BsmtExposure’, ‘GarageFinish’, ‘Fence’. [5 marks]
13. Replace Rare Labels with ‘Rare’. For the remaining five categorical variables (the variables that you did not apply value mappings to, they should be five variables), you will need to group those categories that are present in less than 1% of the observations in the training set. That is, all values of categorical variables that are shared by less than 1% of houses in the training set will be replaced by the string “Rare” in both the training and test set. So in more detail you need to find rare labels in the remaining categorical variables and replace them with the category ‘Rare’. Remember: rare labels are those categories that only appear in a small percentage of the observations (in our case in < 1%). Hint: If you look at unique values in a categorical variable in the training set and count how many times each of the unique values appear in the variable, you can compute the percentage of each unique value by dividing its count by the total number of observations. Remember to make the computations using the training set and replacement in both training and test sets. [5 marks]
14.Perform one hot encoding to transform the previous five categorical variables into binary variables. Make sure you do it correctly for both the training and testing sets. After this, remember to drop the original five categorical variables (the ones with the strings) from the trainin and test after the encoding. [5 marks]
15. Feature Scaling. Now we know all variables in our two datasets (i.e. the training and test sets) are numerical, the final step in this exercise is to apply scaling by making sure the minimum value in each variable is 0 and the maximum value is 1. For this step, you can use MinMaxScaler() from sci-kit learn. Make sure you apply it correctly by transforming the test set based on the training set. [5 marks]
After applying all the previous steps, the overall mean value of all entries in the training set was approximately 0.249 and in the test set was approximately 0.247.
Please refer to Appendix C of the Programme Regulations for detailed Assessment Criteria.
Plagiarism:
This is cheating. Do not be tempted and certainly do not succumb to temptation. Plagiarised copies are invariably rooted out and severe penalties apply. All assignment submissions are electronically tested for plagiarism.
Struggling with DSM030 Statistics & Data Mining Assignment at UOL?
MSc Data Science students often struggle with DSM030 Statistics and Statistical Data Mining assignments because this assignment be complex and time-consuming. There is no need to worry, as Students Assignment Help provides expert data mining assignment help aligned with University of London marking criteria. For trust, students can review our expert-written data science assignment samples. Order your DSM030 Assignment today through our statistics assignment helper and receive 100% original, well-documented Jupyter notebook solutions written only for you.



