Business Intelligence – Coursework 1 (2021/22)
Unit Coursework 1
Weighting: 50%
Qualifying mark 30%
Description Show evidence of understanding of various Business Intelligence concepts, through the implementation of clustering & forecasting algorithms using real datasets. Implementation is performed in R environment, while students need to perform some critical evaluation of their results.
Learning Outcomes Covered in this Assignment: This assignment contributes towards the following Learning Outcomes (LOs):
• LO3 review the recent business intelligence tools to carry out critical evaluation on methodologies and technologies available for information retrieval, pattern recognition and knowledge discovery;
• LO4 apply contemporary business intelligence technologies in order enable users to view data patterns by deploying various tools;
Handed Out: 11/10/2021
Due date 17/11/21
Instructions for this coursework
During marking period, all coursework assessments will be compared in order to detect possible cases of plagiarism/collusion. For each question, show all the steps of your work (codes/results/discussion). In addition, students need to be informed, that although clarifications for CW questions can be provided during tutorials, coursework work has to be performed outside tutorial sessions.
Coursework Description
Clustering Part
In this assignment, we consider a set of observations on a number of white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of testing can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.
One dataset (whitewine_v1.xls) is available of which is on white wine and has 4873 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines, one of which is Quality (i.e. the last column), based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.
Description of attributes:
1. fixed acidity: most acids involved with wine or fixed or non-volatile (do not evaporate readily)
2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to wines
4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/litre and wines with greater than 45 grams/litre are considered sweet
5. chlorides: the amount of salt in the wine
6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. density: the density of water is close to that of water depending on the percent alcohol and sugar content
9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. sulphates: a wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
11. alcohol: the percent alcohol content of the wine
12. Output variable (based on sensory data): quality (score between 0 and 10)
For this clustering part you need to use the first 11 attributes to your “clustering”- based calculations. Do not attempt/apply any dimensionality reduction techniques.
1st Objective (partitioning clustering)
You need to conduct the k-means clustering analysis of this white wine dataset problem. As this is a typical multi-dimensional, in terms of features problem, initially, you need to provide a brief discussion of the methodologies used in reducing the dimensionality for such type of problems and the rationale of using them. (Suggestion: consult related literature and add some relevant references). In this specific clustering part, however, the analysis will be performed with all initial features, as the main aim is to assess different clustering results under the initial conditions. Before conducting the k-means, perform the following pre-processing tasks: scaling and outliers removal and briefly justify your answer. (Suggestion: the order of scaling and outliers removal is important. The outlier removal topic is not covered in tutorials, so you need to explore it yourself). As the provided dataset is not balanced (the number of samples per quality classes – i.e. 12th column – varies), you may also, before scaling/outlier tasks, consider to merge adjacent classes which have few samples. For example, quality classes 7 and 8. Initially, the dataset contains 5 classes. If you perform such “merging” task, please provide all details in your report, but the final number of classes cannot be less than 3. Define the number of cluster centres (via manual & automated tools) and perform k-means analysis for each attempt (i.e. different k). For each of the above k-means attempts, check your produced cluster outcome against the information obtained from 12th column and provide the related results/discussion (evidence of a “confusion” matrix and calculation of the accuracy/recall/precision indices from it). Choose the best “winner” clustering case (justify your response) and briefly explain the meaning of accuracy/recall/precision indices. Finally, for the “winner” case, provide the coordinates of each centre for each clustering group. Write a code in R Studio to address all the above issues (codes/results/discussion need to be included in your report). At the end of your report, provide also as an Appendix, the full code developed by you. The usage of kmeans R function is compulsory.
(Marks 50)
Forecasting Part
Time series analysis can be used in a multitude of business applications for forecasting a quantity into the future and explaining its historical patterns. Exchange rate is the currency rate of one country expressed in terms of the currency of another country. In the modern world, exchange rates of the most successful countries are tending to be floating. This system is set by the foreign exchange market over supply and demand for that particular currency in relation to the other currencies. Exchange rate prediction is one of the challenging applications of modern time series forecasting and very important for the success of many businesses and financial institutions. The rates are inherently noisy, non-stationary and deterministically chaotic. One general assumption made in such cases is that the historical data incorporate all those behavior. As a result, the historical data is the major input to the prediction process. Forecasting of exchange rate poses many challenges. Exchange rates are influenced by many economic factors. As like economic time series exchange rate has trend cycle and irregularity. Classical time series analysis does not perform well on finance-related time series. Hence, the idea of applying Neural Networks (NN) to forecast exchange rate has been considered as an alternative solution. NN tries to emulate human learning capabilities, creating models that represent the neurons in the human brain.
In this forecasting part you need to use an MLP-NN to predict the next step-ahead exchange rate of EUR/USD. Daily data (exchangeEUR20152016.xlsx) have been collected from February 2015 until September 2016 (400 data). The first 300 of them have to be used as training data, while the remaining ones as testing set. Use only the 3rd column from the .xlsx file, which corresponds to the exchange rates.
2nd Objective (MLP)
You need to construct an MLP neural network for this forecasting problem. The definition of the input vector for NNs is a very important component for time-series analysis. Therefore, initially you need to provide a brief discussion of the various schemes/methods used to define this input vector. (Suggestion: consult related literature and add some relevant references). In this specific forecasting part, however, we are going to utilise only the “autoregressive” (AR) approach, i.e. time-delayed exchange rates as input variables. As the order of this AR approach is not known, you need to experiment with various input vectors and for each one of these cases you need to construct an input/output matrix for the MLP (using “time-delayed” rates).
Each one of these matrices needs to be normalised, as this is a standard procedure for MLP NN. You need to explain briefly why normalisation procedure is necessary for this specific type of NN. For the training phase, you need to experiment with various MLPs, utilising these input vectors and various internal network structures (such as hidden layers, nodes, learning rate, activation function, etc.). For each case, the testing performance (i.e. evaluation) of the networks will be calculated using the standard statistical indices (RMSE, MAE and MAPE). Create a comparison table of their testing performances (using these specific statistical indices). Briefly explain the meaning of these three stat. indices. From this comparison table, check the “efficiency” of your best one-hidden layer and two-hidden layer networks, by checking the total number of weight parameters per network. Briefly, discuss which approach is more preferable to you and why. Finally, provide for your best MLP network, the related results both graphically (your prediction output vs. desired output) and via the stat. indices. Write a code in R Studio to address all these requirements. Show all your working steps (code & results, including comparison results from models with different input vectors and internal structure). As everyone will have different forecasting result, emphasis in the marking scheme will be given to the adopted methodology and the explanation/justification of various decisions you have taken in order to provide an acceptable, in terms of performance, solution. Full details of your results/codes/discussion are needed in your report. At the end of your report, provide also as an Appendix, the full code developed by you. The usage of neuralnet R function for MLP modelling is compulsory.
(Marks 50)
Coursework Marking scheme
The Coursework will be marked based on the following marking criteria:
1st Objective (partitioning clustering)
• Brief discussion of methodologies used for reducing the input dimensionality 5
• Pre-processing tasks (3 marks for scaling and 7 marks for outliers removal) 10
• Define the number of cluster centres by showing all necessary steps/methods
(via manual & automated tools). 7
• K-means analysis for each attempt (show all kmeans R-template outputs) 6
• Evaluation of the produced outputs against 12th column 9
• Define the final “winner” cluster case and provide brief explanation of evaluation indices
.(2 marks for winner and 6 marks for indices) 8
• Illustrate the coordinates of each centre for each clustering group 5
2nd Objective (MLP)
• Brief discussion of the various methods used for defining the input vector in 5
time-series problems (provide relevant references in the text)
• Evidence of various adopted input vectors and the related input/output matrices 5
• Evidence of correct normalisation (5 marks) and brief discussion of its necessity (3 marks) 8
• Implement a number of MLPs, using various structures (layers/nodes) / input parameters 16
/ network parameters and show in a table their performances comparison (based on
testing data) through the provided stat. indices. (4 marks for structures with different
input vectors, 8 marks for different internal NN structures/parameters and 4 for the
comparison table).
• Discussion of the meaning of these stat. indices 6
• Discuss the issue of “efficiency” with your two best NN structures 4
• Provide your best results both graphically (your prediction output vs. desired output) 6
and via performance indices (3 marks for the graphical display and 3 marks for showing
the requested statistical indices)
Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount