R Archives - Skilled Papers

1. Suppose there is a population of 1000 people and 500 of them have already ado

Posted on September 3, 2024 | by Linus | Leave a Comment

1. Suppose there is a population of 1000 people and 500 of them have already adopted a new behavior. In the next time period, how many will begin the behavior if there is a constant hazard of .5? What about if the hazard is .5 * current adoption base? Show your work.
2. Draw a CDF (cumulative distribution over time) graph for internal influence and another for external influence. (No numbers, just the general shape). Label which of the graphs reflects a constant hazard.
3. (Two points) Open the NetLogo model “epiDEM Basic,” which simulates a S-I-R diffusion model.
https://www.netlogoweb.org/launch#https://www.netl…
Set it to 400 people, 20% infection-chance, 30% recovery-chance, and average recovery time of 100. Let it run for about 100 simulated hours (this will only take a few seconds of real time). Now examine the “Cumulative Infected and Recovered” and the “Infection and Recovery Rates” data. Note that NetLogo draws these graphs too flat to read, so you will probably want to click the three horizontal lines and either “download CSV” or “view full screen.” Include a copy of the graph. (Note the “download PNG” button just gives you the smushed graph so you might want to screenshot). What does the shape of the “% infected” line on the “cumulative infected and recovered” graph suggest about internal influence vs external influence?
4. What is the R0? Read the “model info” tab to learn what “R0” means and then explain it in your own words. (If you use a source besides the “model info” tab please cite it). Given the parameters in question #3, after 20 hours R0 is probably around 5.5. Play around with the “infection-chance,” “recovery-chance,” and “average-recovery-time” sliders. Include a screenshot if you can find a combination of parameters that gets R0 below 5 after 50 hours.
5. Consider Granovetter’s threshold model of collective behavior and whether each of the following assumptions about a population of 500 would be consistent with frequent riots.
* a uniform distribution of rioting thresholds from 0 to 100
* a normal distribution of rioting thresholds with a mean of 10 and a standard deviation of 2.
* a normal distribution of rioting thresholds with a mean of 12 and a standard deviation of 4.
* a Poisson distribution with a mean of 10
Explain the model and use it to justify your answer.
(if your stats knowledge is too rusty to visualize what these distributions look like, see pdf attached)
6. According to Rossman and Fisher’s simulation, under what conditions does it matter if an innovation starts with the most central person in a network?
7. In Centola’s model would a “simple contagion” spread faster in a pure ring lattice or a Watts-Strogatz with 2% rewiring? Why? How about a “complex contagion”? Why?

Instructions Provide the code that parallelizes the following: library(M

Posted on August 20, 2024 | by Linus | Leave a Comment

Instructions
Provide the code that parallelizes the following:

library(MKinfer) # Load package used for permutation t-test
# Create a function for running the simulation:
simulate_type_I <- function(n1, n2, distr, level = 0.05, B = 999,alternative = "two.sided", ...) { # Create a data frame to store the results in: p_values <- data.frame(p_t_test = rep(NA, B),p_perm_t_test = rep(NA, B),p_wilcoxon = rep(NA, B)) for(i in 1:B) { # Generate data: x <- distr(n1, ...) y <- distr(n2, ...) # Compute p-values: p_values[i, 1] <- t.test(x, y, alternative = alternative)$p.value p_values[i, 2] <- perm.t.test(x, y,alternative = alternative,R = 999)$perm.p.value p_values[i, 3] <- wilcox.test(x, y,alternative = alternative)$p.value } # Return the type I error rates: return(colMeans(p_values < level)) } 2. Provide the code that runs the following code in parallel with 4 workers (with mclapply): lapply(airquality, function(x) { (x-mean(x))/sd(x) })

Download the dataset here Download the dataset herefor this question. The data s

Posted on June 14, 2024 | by Linus | Leave a Comment

Download the dataset here Download the dataset herefor this question.
The data set contains information on sales of 1oz gold coins on eBay. Further details will be available in the key after the exam ends. The file contains the following variables:
DATE: date of the sale
SALE: final selling price of the coin
GOLDPRICE: price of gold, one ounce, at the end of trading on the date of the sale, or, if the sale is on a weekend or holiday, the end of the previous day of trading.
BIDS: the number of bids submitted for the auction (these eBay sales were in an auction format)
TYPE: E for Eagle or a US coin, KR for Krugerrand or a South African coin and ML for Maple Leaf or a Canadian coin.
SHIPPING: cost of shipping; this is an additional fee the buyer must pay so that SALE+SHIPPING is the total cost to the buyer.
SLABBER: P for PCGS, N for NGC or U for not slabbed; a slabbed coin is a coin inside a tamper proof holder that also indicates the coin’s condition or grade.
GRADE: the grade of slabbed coins. If SLABBER=’U’ then this is 0.
other: additional characteristics of slabbed coins are noted here; example FD means the “slab” or coin holder notes that the coin was minted on the first day of minting and FDIFLAG means that it is labeled was first day of issue and the holder has an image of a flag on it.
a) What is the average for SALE?
[ Select ] [“1915”, “1651”, “1654”, “1930”] .
b) What is the maximum for BIDS?
[ Select ] [“55”, “60”, “57”, “62”] .
c) Create a boxplot of SALE. You should see that there are 3 (three) outliers. Look at those three observations and choose the correct statement. (i) the observations either have only 1 bid or other=”BURNISHED”, (ii) the observations all have SLABBER=”P”, (iii) the observation(s) with low value(s) for SALE has/have only 1 or 2 bids while the observation(s) with high value(s) for SALE has/have numbers of bids near the maximum, say within 5 of the maximum, (iv) the observations either have other=”ME” or “LD”.
[ Select ] [“(ii)”, “(iii)”, “(i)”, “(iv)”] .
d) In R type the following command, table(yourdataset$other), where yourdataset is the name you gave to the dataset with the ebay coin sales. This will produce a table showing the value for “other” and the number of observations which have that value. For example, it will show the value “0” and under that the number 20, meaning that there are 20 observations where other is 0 and then it will show BURNISHED and under that a 2, meaning there are 2 coins where other is BURNISHED. How many observations are there where other is “LD”?
[ Select ] [“4”, “6”, “1”, “2”] .
e) Run a regression where SALE is the dependent variable and GOLDPRICE, BIDS, and SHIPPING are the explanatory variables. Consider the following statements and select which ones are correct (1) although there is little explanatory power the model is basically a good model (2) the model has minimal explanatory power (3) none of the independent variables have statistically significant coefficients at standard levels of significance (4) at least 1 of the estimated coefficients has the wrong sign, (5) some combination of items (2), (3) and (4) suggest this is not a good model.
[ Select ] [“(1), (2) and (3)”, “(1) and (3)”, “(2) and (4)”, “(1) and (2)”, “(2) and (3)”, “(2), (3), (4) and (5)”] .
f) Run a regression where SALE is the dependent variable and GOLDPRICE, BIDS, SHIPPING and a set of dummy variables for the values of other are the explanatory variables. NOTE: remove the observations where other=”ME” since there is only one such observation. Because there is only one observation with “ME” it will have a residual of 0 since the “ME” will perfectly explain why it is different from all other observations. This means that your regression is run with only 46 observations and you should see the df for the F statistic being 10 and 35.
What is the R2 value for this regression?
[ Select ] [“0.6268”, “0.6801”, “0.431”, “0.5527”, “0.3496”] .
g) Using this model what is the expected value for SALE for an auction with a gold price of $1650, 5 bids, free shipping (SHIPPING=0), and other= FDIFLAG?
[ Select ] [“$1988”, “$1945”, “$1956”, “$1919”, “$1972”] .
h) Is the coefficient on FDIFLAG statistically significant at the 0.05 level?
[ Select ] [“NO”, “YES”] .
i) Examine the residual plots. Find the observation with the largest absolute residual and the observation with the largest Cook’s Distance. Identify the correct statement. (i) the observation with the largest absolute residual is an outlier in the residual space and this is due to an extremely low sale price which might relate to only receiving one bid (ii) the observation with the largest Cook’s Distance is influential and has high leverage which might be because it has an unusual grade, GRADE, for a slabbed coin (iii) the observation with the largest absolute residual is an outlier in the residual space and this is due to an extremely high sale price which might relate to the unusually high price for gold at the time of the sale (iv) the value for the largest absolute residual is not an outlier and the largest value for Cook’s Distance does not qualify as being influential.
[ Select ] [“(i)”, “(iii)”, “(ii)”, “(iv)”] .
j) Remove the observations or observation from part (i) that had the largest absolute residual and the largest Cook’s Distance. If those are the same observation then remove only one observation. If they are different then remove them both, i.e., two observations. With this smaller dataset (which also has other=”ME” removed from before) regress SALE on GOLDPRICE, BIDS, SHIPPING and a set of dummy variables for the values of other. The estimated coefficient on GOLDPRICE is
[ Select ] [“2.5083”, “1.5076”, “2.1763”, “1.763”, “2.0756”] .
k) Using the most recent model, from part (j), test the hypothesis that the coefficient on GOLDPRICE is 1. The t test statistic for this test is
[ Select ] [“1.232”, “1.733”, “0.833”, “1.497”] .
l) Examine the model results, from part (j). Based on these results, if you were auctioning off a gold coin to maximize your revenue, would you rather offer free shipping or would you rather charge $7.5 for shipping?
[ Select ] [“Offer free shipping.”, “It doesn’t appear to matter.”, “Charge $7.5 for shipping.”] .
m) Again, using the model from part (j), test whether the errors have constant variance using the test covered in the lectures. What is the p-value?

Review The Power of Good Design and select three of the ten principles noted for

Posted on May 22, 2024 | by Linus | Leave a Comment

Review The Power of Good Design and select three of the ten principles noted for good design. Next in R, utilize these three principles in a problem that you will solve. First note the problem to solve, the dataset (where the information was pulled from), and what methods you are going to take to solve the problem. Ensure the problem is simple enough to complete within a two-page document. For example, I need to purchase a house and want to know what my options are given x amount of dollars and x location based on a sample of data from Zillow within each location.
Ensure there is data visualization in the homework and note how it relates to the three principles selected.

Simple Linear Regression and Predictive Modeling The data for Assignment #10 is

Posted on May 18, 2024 | by Linus | Leave a Comment

Simple Linear Regression and Predictive Modeling
The data for Assignment #10 is the Nutrition Study data.It is a 16 variable dataset with n=315 records that you have seen and worked with previously. The data was obtained from medical record information and observational self-report of adults. The dataset consists of categorical, continuous, and composite scores of different types. A data dictionary is not available for this dataset, but the qualities measured can easily be inferred from the variable and categorical names for most of the variables. As such, higher scores for the composite variables translate into having more of that quality. The QUETELET variable is essentially a body mass index. It can be googled for more detailed information. It is the ratio of BodyWeight (in lbs) divided by (Height (in inch))^2. Then the ratio is adjusted with an adjustment factor so that the numbers become meaningful. Specifically, QUETELET above 25 is considered overweight, while a QUETELET above 30 is considered obese. There is no other information available about this data.
1) Download the Nutrition Dataand read it into R-Studio. We will work with the entire data set for this assignment.
2) There are 11 variables that are clearly continuous variables. For this assignment, you should consider the Quetelet variable to be the dependent response variable (Y). All other continuous variables should be considered independent or explanatory variables. Make a scatter plot of each continuous variable (X) with Y. You should have 10 different scatterplots. Obtain Pearson Product Moment Correlations for each X variable with Y. You can do this in a table form or individually. It does not matter. Stil, combine the scatterplot with the correlation information and discuss the appropriateness of simple linear regression for each scatterplot. Which variable seems most predictive of Quetelet (Y)?
3) Often times, the explanatory variables are correlated amongst themselves. Obtain, a standard correlation matrix for all of the explanatory variables. Then, obtain a heat matrix of the correlations (see the correlation classroom for an example of this). Are there groups, or subsets, of explanatory variables that seem to clump together in that they are highly correlated amongst themselves?
4) Use the explanatory variable that is most highly correlated with Y, and fit a simple linear regression model. Call this Model 1. Report the prediction equation for Model1, interpret the coefficients, report the R-squared statistics as a measure of goodness of fit. Set up and report the results of the hypothesis test for the slope parameter (beta1).
5) Pick one of the remaining explanatory variables. Add that variable into the regression Model 1 from task 4). Re-fit the linear regression model (note, it is now a multiple regression model – why?). Call this Model 2. Report the prediction equation for Model 2, interpret the coefficients, report and interpret the R-squared statistic. How much has R-squared changed from Model 1 to Model 2? What is this change in R-squared uniquely attributable to? Does this change seem to have a practical meaning or value? Discuss.
6) For the remainder of the explanatory variables, add them into Model 2 one at a time so that the model becomes 1 variable larger at each step. Note the R-squared value and the change in R-squared between each subsequent model. Which explanatory variables seem to contribute alot (or a practical amount) to predicting Y and which explanatory variables contribute little or nothing?
7) Re-fit a multiple regression model using only those explanatory variables from task 6 that seem to contribute alot or a practical amount to predicting Y. Call this the Final Model. Report the prediction equation for the final model, interpret the coefficients, report the R-squared statistic. Does this model seem to be meaningful, in a larger medical scope of things, for predicting Quetelet? Remember, a regression model is also information about the relationships between variables – so it should have meaning and be part of the data’s story. Discuss. Is this modeling done? Or, is there something else you would want to do to model this data? Write up your synthesis description of what this data set seems to be saying (up to this point) and where we should go from here.

This project requires you to strictly follow the requirements (I have uploading

Posted on May 15, 2024 | by Linus | Leave a Comment

This project requires you to strictly follow the requirements (I have uploading requirements in the attachment). This project requires the completion of R programming and a report written in APA format based on the data conclusions drawn from the R programming content.

For this assignment, you will be using the Framingham Heart Study Data. The Fram

Posted on May 14, 2024 | by Linus | Leave a Comment

For this assignment, you will be using the Framingham Heart Study Data. The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. We will be using this original data.
As you look over the Framingham Heart Study data and data dictionary to familiarize yourself with this data, you will notice that the study had a longitudinal design. This means that there were multiple observations on the same individuals at different points in time. You will notice variables with the same name, but with 1, 2 or 3 at the end of the name. These numbers indicate the data collection time points. For this assignment, we will only be using the primary variables and the variables at time point 1. Because of this, we can create an analysis file by retaining only the variables we want and removing the variables we do not need. This will make the data file easier to work with.
To reduce the dataset to a more manageable size, open the Framingham Heart Study data in EXCEL. Remove all variables that have a name that ends in a ‘2’ or ‘3’. Variables like: sex2, sex3, age2, age3, etc should all be removed. In EXCEL, you can simply highlight those variables you do not want and delete. Next, remove all variables that start with “TIME…..” These are variables like: TIMEAP, TIMEMI, etc. etc.
Save your reduced datafile to your computer using a different filename. Call this reduced dataset something like: FHS_assign7.xlxs.
Check on the records to see if there is missing values. Delete records with missing values. Re-save your dataset.
Read your new analysis file into R. You are good to go
ASSIGNMENT TASKS
Part A – Mechanics (25 points)For this analysis, the variable “stroke” should be considered the response variable Y and the “diabetes1” variable should be considered the explanatory variable (X). Complete the following
:1) Construct a side by side bar graph to compare these two categorical variables. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.
2) Construct a contingency table complete with Marginal row and column totals for these two variables, then answer the following
a) What is the conditional probability of having a stroke given diabetes is present at time 1? What is the conditional probability of having a stroke given diabetes is NOT present at time 1?
b) What are the odds of having a stroke if diabetes is present at time 1? What are the odds of having a stroke of diabetes is not present at time 1?
c) Calculate the odds ratio of having a stroke when diabetes is present relative to when it is not. Interpret this result.d) Specify the null and alternative hypothesis, and then conduct a hypothesis test to see if diabetes is related to having a stroke. Interpret the results.
Part B – Open Ended Analysis (75 points)3) In professional practice, when you have an observational dataset like the Framingham Heart Study data, one typically is looking for risk factors. In other words, explanatory variables that are related to specific response variables of interest. For this last task, you will identify and work with only categorical explanatory variables. The response variables of interest are ANYCHD, STROKE, and DEATH. What categorical explanatory variables seem to indicate elevated risk of Coronary Heart Disease, Stroke or Death? Conduct an analysis. Report and interpret your results
.4) Which of the continuous explanatory variables do you think is most likely indicative of elevated risk of Coronary Heart Disease (ANYCHD), Stroke, or Death? Pick one such variable. Create a new variable that maps the continuous variable’s values into a categorical variable with at least 3 levels. Conduct contingency table analyses relating this newly created categorical variable to ANYCHD, STROKE and DEATH. These analyses should be done separately. In other words, you will have at least 3 separate contingency tables. Do NOT attempt to have multiple dimension contingency tables! Report on the results of your analysis and discuss the results.
5) Reflect on your experiences here. What are your recommendations for future analysis? Congratulations! You’ve completed the Assignment 8. Please save your R-code, because you can re-use or cannibalize this code in future assignments. Your write-up should address each task.

Comparison of Multiple Groups via ANOVA1) Download theNutrion study data and rea

Posted on May 8, 2024 | by Linus | Leave a Comment

Comparison of Multiple Groups via ANOVA1) Download theNutrion study data and read it into R-Studio. We will work with the entire data set for this assignment. Use the IFELSE( ) function to create 2 new categorical variables. The variable should be defined as:
Age_Cat =
1 if Age <=19 2 if 20<= Age<=29 3 if 30<=Age<=39 4 if 40<=Age<=49 5 if 50<=Age<=59 6 if 60<=Age<=69 7 if Age>=70
and,
Alcohol_Cat =
0 if Alcohol=0
1 if 0=10
If you have trouble using the IFELSE( ) function in R, you could create these new categorical variables in EXCEL, and then just read them into R with the dataset. It works either way.
Report the counts for each value of these 2 new categorical variables.
2) Using the variable Quetelet as the dependent response variable (Y), specify the null and alternative hypotheses and conduct a oneway ANOVA F-test to check for mean differences on the levels of Age_Cat variable, and a separate ANOVA for the Alcohol_Cat variable. Interpret the two hypothesis tests. What do you conclude? If you have a statistically significant result at the alpha=0.05 level, then you must follow up the significant ANOVA with a post hoc analysis. At this point, use 95% Confidence Intervals for each group to determine if there are group mean differences and where they occur. Discuss your findings.
3) Now, using the Calories variable as the dependent response variable (Y), conduct similar ANOVA hypothesis tests and obtain confidence intervals for each group to determine if there are group mean differences relative to Age_Cat and Alcohol_Cat. You will need to clearly set up the null and alternative hypotheses, conduct the test with appropriate statistics, and interpret the individual group confidence intervals.
4) For the FAT, FIBER, and CHOLESTEROL variables, use a 95% confidence interval approach to compare groups, on average, for Age_Cat and Alcohol_Cat. Interpretthe confidence intervals. Use whatever outside information you can obtain to help interpret the results.
5) With the results from this additional analysis, how has the story description from Modeling Assigment #3 changed? You are welcome to bring in information from your prior knowledge and experience to embellish this story. Is the analysis sufficient so far for your story, or is something missing? What should be done next? Write up your synthesis description of what this data set seems to be saying (up to this point) and where we should go from here.

Ticket sales patterns (15 points) i. Create a model to predict ticket revenues o

Posted on May 7, 2024 | by Linus | Leave a Comment

Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016?
I need the answers to the questions above, I need the top 10 in an excel file and the rest screenshots of the work done in R.

I was given this answer, BUT I need the answers to the questions with visualizat

Posted on May 7, 2024 | by Linus | Leave a Comment

I was given this answer, BUT I need the answers to the questions with visualizations not just the code written out.
Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016?