process a collection of email messages and create an R data frame of “derived” variables that
give various measures of the email messages, e.g. the number of recipients to whom the mail was sent, the
percentage of capital words in the body of the text, is the message a reply to another message. See below
for a list of all the variables and also consider other variables you think might help help classify a message
as SPAM versus HAM. The messages are in 5 different directories/folders. The name of the directory indicates whether the messages
it contains are HAM or SPAM. There are 6,541 messages in total. This is a large amount of data.
Category: R
This project is about over-fitting and it is based on chapter 6 Statistical Mach
This project is about over-fitting and it is based on chapter 6 Statistical Machine Learning from ‘Practical Statistics for Data Scientists’.
Files needed:
P4p4F1.pdf P4p4F1.pdf – Alternative Formats
P4p4F2.pdf P4p4F2.pdf – Alternative Formats
P4p4F3.pdf P4p4F3.pdf – Alternative Formats
Cover in the project the following:
Explain the data from figure P4p4F1.pdf.
Explain the differences in (a) and (b) parts in figure P4p4F2.pdf.
Try to recreate with R or Octave, as close as possible, the data from the figure P4p4F1.pdf. Functions needed are: runif (R) rand (Octave) for uniform distribution and rnorm (R) randn() (Octave) for the normal distributionexplain how you can recreate P4p4F1.pdf
compare and discuss my P4p4F3.pdf with the figure you created
Based on the P4p4F3.pdf, or your data created, explain how you would make a decision tree to classify ‘+’ and ‘o’ similarly to the way it was done in the left tree in P4p2F2.pdf
In your opinion, why is it practical or useful to simulate the data for the classification?
This project is about over-fitting and it is based on chapter 6 Statistical Mach
This project is about over-fitting and it is based on chapter 6 Statistical Machine Learning from ‘Practical Statistics for Data Scientists’.
Files needed:
P4p4F1.pdf P4p4F1.pdf – Alternative Formats
P4p4F2.pdf P4p4F2.pdf – Alternative Formats
P4p4F3.pdf P4p4F3.pdf – Alternative Formats
Cover in the project the following:
Explain the data from figure P4p4F1.pdf.
Explain the differences in (a) and (b) parts in figure P4p4F2.pdf.
Try to recreate with R or Octave, as close as possible, the data from the figure P4p4F1.pdf. Functions needed are: runif (R) rand (Octave) for uniform distribution and rnorm (R) randn() (Octave) for the normal distributionexplain how you can recreate P4p4F1.pdf
compare and discuss my P4p4F3.pdf with the figure you created
Based on the P4p4F3.pdf, or your data created, explain how you would make a decision tree to classify ‘+’ and ‘o’ similarly to the way it was done in the left tree in P4p2F2.pdf
In your opinion, why is it practical or useful to simulate the data for the classification?
Continuing with the theme of hypothesis testing, this week, we turn our attentio
Continuing with the theme of hypothesis testing, this week, we turn our attention to conducting tests for one sample, two paired samples, and two independent samples. To further develop our understanding of these tests, this assignment will focus on the application of these statistical techniques. You will select a dataset, conduct the appropriate tests, and share your findings.Assignment Requirements: Dataset Selection: Choose a dataset that allows for one sample, two paired samples, and two independent sample tests. Briefly explain why you have chosen this dataset.
Hypothesis Formulation: Formulate hypotheses appropriate for one sample, two paired samples, and two independent sample tests. Describe the hypotheses for each test clearly.
Execution of Tests: Perform the tests using Python or R, and document the steps you have taken. Be sure to include your code in your initial post.
Results Interpretation: Interpret the results of your tests. What do the results tell you about your dataset and the hypotheses you formulated?
Conclusions and Applications: Summarize your findings and discuss potential real-world applications of your conclusions. Submission Format: Your submission should be a maximum of 500-600 words (excluding Python/R code). Submit your assignment in APA format as a Word document or a PDF file. Include your written analysis and any tables or visualizations that support your findings. If you used any software for your calculations (like R, Python, Excel), please include your code or formulas as well. Include an APA-formatted reference list for any external resources used.
Provide in the plain text R commands that finds/solves the following: The stude
Provide in the plain text R commands that finds/solves the following: The student directory for a large university has 400 pages with 130 names per page, a total of 52,000 names. Using software, show how to select a simple random sample of 10 names. From the Murder data file, use the variable murder, which is the murder rate (per 100,000 population) for each state in the U.S. in 2017 according to the FBI Uniform Crime Reports. At first, do not use the observation for D.C. (DC). Using software:Find the mean and standard deviation and interpret their values.
Find the five-number summary, and construct the corresponding boxplot.
Now include the observation for D.C. What is affected more by this outlier: The mean or the median? The Houses data file lists the selling price (thousands of dollars), size (square feet), tax bill (dollars), number of bathrooms, number of bedrooms, and whether the house is new (1 = yes,0 = no) for 100 home sales in Gainesville, Florida. Let’s analyze the selling prices.Construct a frequency distribution and a histogram.
Find the percentage of observations that fall within one standard deviation of the mean.
Construct a boxplot. Datasets needed are murder data https://stat4ds.rwth-aachen.de/data/Murder.dat, https://stat4ds.rwth-aachen.de/data/Murder2.dat
Houses data : https://stat4ds.rwth-aachen.de/data/Houses.dat
Useful functions in R to solve problems in this assignment: sample, read.table, mean, sd, summary, boxplot, hist, table, cbind, length, case, tapply
Assignment Instructions: Dataset Selection: Select a suitable dataset for perfor
Assignment Instructions:
Dataset Selection: Select a suitable dataset for performing a cluster analysis. Explain why you have chosen this specific dataset and what you hope to discover from this analysis. Cluster Analysis: Perform a cluster analysis on your selected dataset. Document the steps you took and include the code you used for your analysis. Hierarchical and Non-Hierarchical Agglomeration Schedules: Discuss how you applied hierarchical and non-hierarchical agglomeration schedules in your cluster analysis. Explain the differences between these schedules and their impacts on the results of your analysis. Results Interpretation: Interpret the results of your cluster analysis. Discuss the insights gained from this analysis and explain how the agglomeration schedules impacted your results. Real-world Applications: Discuss how the insights from your cluster analysis could be applied in a real-world context. Explain the relevance and potential impact of these insights.
Submission Format: Your submission should be a maximum of 500-600 words (excluding Python/R code). Submit your assignment in APA format as a Word document or a PDF file. Include your written analysis and any tables or visualizations that support your findings. If you used any software for your calculations (like R, Python, Excel), please include your code or formulas as well. Include an APA-formatted reference list for any external resources used.
Project 2: Decision making based on historical data Attached Files: I_1.jpeg (4
Project 2: Decision making based on historical data
Attached Files:
I_1.jpeg (49.983 KB)
I_2.jpeg (48.819 KB)
dataG2.csv (149.28 KB)
This project reflects the basics of data distribution. The project topics relate to the definitions of variance and skewness.
Files needed for the project are attached.
Cover in the project the following: Explain the variance and skewnessShow a simple example of how to calculate variance and then explain the meaning of it.
Show a simple example of how to calculate skewness and then explain the meaning of it. After loading dataG2.csv into R or Octave, explain the meaning of each column or what the attributes explain. Columns are for skewness, median, mean, standard deviation, and the last price (each row describes with the numbers the distribution of the stock prices): Draw your own conclusions based on what you learned under 1. and 2.Explain the meaning of variables ‘I_1’ and ‘I_2’ after you execute (after dataG2.csv is loaded in R or Octave) imported_data <- read.csv("dataG2.csv") S=imported_data[,5]-imported_data[,3] I_1 =which.min(S) # use figure I_1 (see attached)I_2 = which.max(S) # use figure I_2 (see attached)
Based on the results in a., which row (stock) would you buy and sell and why (if you believe history repeats)? Explain how would you use the skewness (first column attribute) to decide about buying or selling a stock. If you want to decide, based on the historical data, which row (stock) to buy or sell, would you base your decision on skewness attribute (1st column) or the differences between the last prices with mean (differences between 5th attribute and 3rd attribute)? Explain.
Data scientists conduct continual experiments. This process starts with a hypoth
Data scientists conduct continual experiments. This process starts with a hypothesis. An experiment is designed to test the hypothesis. It is designed in such a way that it hopefully will deliver conclusive results. The data from a population is collected and analyzed, and then a conclusion is drawn. From your own experiences and reading:
Explain what are the 2 major problems with collecting the samples? Is it possible to fix the problems you mentioned? If not, explain why is that so. If it is, explain how you would do it. To participate in the discussion, respond to the discussion promptly by Thursday at 11:59PM EST. Then, read a selection of your colleagues’ postings. Finally, respond to at least two classmates by Sunday at 11:59PM EST in one or more of the following ways: I will post two classmates’s work later and you will respond to both of them
## Exercise set.seed(54) myts <- ts(c(rnorm(50, 34, 10), rnorm(67
## Exercise set.seed(54)
myts <- ts(c(rnorm(50, 34, 10), rnorm(67, 7, 1), runif(23, 3, 14)))
#5. Plot the data, explain the statistical characters of the data
#6. Use 80% of the data as the training set and the rest as testing set - This is to make sure the forecast models #do not carry any information of the testing set (the rest 20% of the data) reserved for accuracy analysis.
#7.Set up three forecasting models using the training set.
#8.Get a plot with the three forecast models, add a legend. Which method looks more promising?
#9.Perform accuracy analysis to get the error measures and compare them; do the results match the
#visual impression (plot the residual)? if not, why?
#10. Check relevant statistical traits: Mean of zero; equal variance; #standard distribution of the residual.
A store sells two types of toys, A and B. The store owner pays $8 and $14 for e
A store sells two types of toys, A and B. The store owner pays $8 and $14 for each one unit of toy A and B respectively. One unit of toys A yields a profit of $2 while a unit of toys B yields a profit of $3. The store owner estimates that no more than 2000 toys will be sold every month and he does not plan to invest more than $20,000 in inventory of these toys. How many units of each type of toys should be stocked in order to maximize his profit?
ex2 transportation problem