# Introduction Genetic algorithms provide probabilistic solutions to optimization

Introduction
Genetic algorithms provide probabilistic solutions to optimization problems. These algorithms can be thought of as an advanced “guess and check” technique that eventually arrives at an output that is close to the actual solution without having to know how to compute the solution directly.
Resources:
https://en.wikipedia.org/wiki/Genetic_algorithm
https://www.tutorialspoint.com/genetic_algorithms/index.htm
In application objective 1, you will write a generic genetic algorithm that can find approximate solutions to optimization problems. This algorithm will be written in such a way that it can be reused for any application suitable for a genetic algorithm. For each problem, you will need to provide your genetic algorithm with a cost function to determine how well a potential solution performs, and an incubator function that creates a potential solution from a list of doubles (Referred to as genes). In the lecture tasks, you will set up several problems to be solved by the genetic algorithm.
Immutable
You will design and implement a genetic algorithm without using any mutable variables or state.
Specifically, the following are banned in your submissions:
Variables (var)
The value at any memory address on the stack or heap cannot change throughout the execution of your program
You can use values (val) to store values
Any way of directly simulating mutability that is against the spirit of this assignment. (Ex. Importing a class that has a mutable state variable)
If your submission, including testing, violates this immutability restriction it will not be graded.
Project Structure
Repository: https://github.com/jessehartloff/GeneticAlgorithmHandout
Clone the starter code from the repository as a new IntelliJ project
Run maven and mark src as the root sources directory
Learning Objective – Write
Lecture Task 1: Average and Top K
Our goal in this project will be to use our genetic algorithm to generate playlists of songs based on a variety of factors. To this end, we’ll need methods to compute several statistical values. When writing these methods, we’ll write them generically so we never have to implement them again (ie. Whenever you need the top K values from any data structure, you can call your topK method from this LT.
Testing: In the tests package, complete the test suite named LectureTask1 that tests the following functionality.
In the statistics.Statistics object, a method named “average” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty (Do not write a test with an empty List as an input since this behavior is undefined)
Returns the average of the elements in the list after the input function is applied to them
Ie. Call the input function on every element of the List, then find the average of all the values returned by the function
By taking a type parameter and a function, this average method is completely generic and can be used to find the average of any aspect of a data set. For example, if you want to find the average of the absolute values of a list of integers you would call this method with your list of integers and the absolute value function
e music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returnedSong Examples: Two example uses of this method are provided in the music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returned
In the statistics.Statistics object, a method named “topK” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int “k”
Returns the k elements from the list with the highest outputs of function in decreasing order
Example: If k is 10 you would return a top 10 list
If the list has fewer than k elements, return all elements of the list
The returned list should be of type List[T]
If there are duplicates in the list, all duplicates should be in the returned list
Ex. The top 3 ints in the list (2,1,3,4,3) are (4,3,3)
Ex. The top 2 ints in the list (2,1,3,4,3) are (4,3)
Playlist Examples: Two example uses of this method are provided in the music.SampleUsage object. One call finds the top 10 songs based on their bayesian average rating (LT3) and the other finds the top 10 most controversial songs based on the standard deviation (LT2) of their ratings. You’ll also have to complete LT4 before running these example so you can read the file of ratings
Functionality: Implement the methods that you’ve tested.
We’ll also need a way to compute the standard deviation of any value of a list of objects. We’ll use population standard deviation as opposed to sample standard deviation. For a brief refresher on standard deviation, it is the square root of the sum of the differences from the mean squared for each value divided by the number of elements. That is:
Find the mean (average) of the data
Subtract the mean from each element and find the square of this difference
Sum up (add together) all these squared values
Divide the sum by the number of values (length of the input list)
Take the square root of the value calculated in step 4
Testing: In the tests package, complete the test suite named LectureTask2 that tests the following functionality.
In the statistics.Statistics object, implement the following method:
In the statistics.Statistics object, a method named “standardDeviation” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty
Returns the standard deviation of the elements in the list after the input function is applied to them
Note: The structure of this method is the same as the “average” method except you’ll compute the standard deviation instead of the average of the outputs of the function
Functionality: Implement the standardDeviation method.
Simple averages do not work well with user reviews since all of our top K lists will be dominated by songs with a single rating of 5. For example, a song with 100 ratings and an average rating of 4.9 should be ranked higher than a song with 1 rating of 5 even though the later song has a higher average rating.
To address this issue we will use a Bayesian average. To compute a Bayesian average we will add “fake” ratings to every song before computing the average. For example if we add 2 fake ratings of 3 to every song, a song with a single rating of 5 will have a bayesian average rating of 3.67 (The average of [5, 3, 3]) while the song with 100 ratings will barely be affected.
References:
https://fulmicoton.com/posts/bayesian_rating
https://en.wikipedia.org/wiki/Bayesian_average
Testing: In the tests package, complete the test suite named LectureTask3 that tests the following functionality.
In the statistics.Statistics object, a method named “bayesianAverage” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int representing the number of extra “fake” ratings, and an Int representing the value of the extra ratings
You can assume the input list is not empty
Returns the bayesian average of the elements in the list after the input function is applied to them
Functionality: Implement the bayesianAverage method.
Note: The intent of this task is to give you practice writing a recursive method. There is a way to write this method without recursion, but it is difficult to attempt LT4 without any practice with recursion. If you have questions in office hours about LT4 and you did not use recursion on LT3, the response to your question should be to rewrite LT3 using recursion.
We have some tools to work with data, but we don’t have data. In this task you’ll read song ratings from a file and return the list of songs, with all their ratings, from the file. This will be a CSV file in the format “youtubeId,artist,title,rating,energyRating” and you can assume that there are no commas in the data itself so you can safely split on commas (Note: In general, CSV data can contain commas if the data is surrounded by quotes). The rating and energyRating will be integer values ranging from 1 to 5.
After parsing the file you should have a list of songs containing all the ratings from the file. If a song has been rated multiple times there should only be one song in the returned list containing all the ratings for that song. Two songs are considered the same if they have the same youtubeId even if they have a different artist/title (eg. If someone mistyped the artist/title do not consider that as a different song).
Testing: There is no testing requirement for this objective. This does not mean that you shouldn’t test your code, only that your testing will not be checked in AutoLab. If you have questions during office hours about LT4 and you don’t have any tests, you know what we’ll say in response to your question “write tests!”
Functionality: In the Song class implement the following method:
A method named “addRating” that takes a SongRating and returns a new Song that is a copy of the song with the new rating added
In the Song object complete the following method:
A method named “readSongsFromFile” that takes a String representing a filename containing song ratings in the format above and returns a list of Songs containing all the songs and rating from the file
Note: After completing LT1-4 you can run the code in music.SampleUsage to generate playlists from a file of ratings and your statistics methods. Enjoy the music!
Lecture Task 5: Song Cost Function
To interact with the genetic algorithm we need a cost function. In this task you’ll write a method that computes the cost of a song given a specific user’s ratings. We will combine the overall rating of the song with the user’s rating for the song, if the user rated this song. This cost will be a measure of how much this user would like/dislike having this song in a playlist. The best song for this user is the song with the lowest cost.
Testing: In the tests package, complete the test suite named LectureTask5 that tests the following functionality.
In the Song object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Song as a parameter and returns a Double. This function will compute the cost of a Song for this specific user based on their ratings. The function will return:
1000.0 if the song has been rated with a 1 or 2 by this user. This is a very high cost relative to songs rated 3+ and will almost certainly prevent these songs from appearing in any playlist for this user
If the song has not been rated by this user, treat it as though the user gave it a rating of 3
If the song has been rated 3, 4, 5, or has not been rated and is being treated as a 3, return a cost of 1 / (bayesianRating * userRating)
For bayesianRating add 2 extra rating with value 3
Example: If a song has ratings (2,4,5) and the user rated this song with a 4 (from the input Map) the song will have a bayesian average rating of 3.4 and a cost of 0.0735294
Functionality: Implement the Song.costFunction method.
Lecture Task 6: Playlist Cost Function
Just as we did with Songs, you’ll write a cost function for entire playlists. This will allow your genetic algorithm to generate playlists for a user that will minimize this score.
Testing: In the tests package, complete the test suite named LectureTask6 that tests the following functionality.
In the Playlist object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Playlist as a parameter and returns a Double. This function will compute the cost of a Playlist for this specific user based on their ratings. The function will compute the cost as follows:
Compute a “raw cost” as the sum of the cost of all the songs. Remember that you can call your Song.costFunction method to get a cost function for songs and call this function to compute these costs. Sum up all the costs and that’s the raw cost of the playlist
Compute a cost multiplier that is:
1000.0 if the playlist contains a duplicate song (Check if the playlist contains any youtubeIds multiple times). This large multiplier will all but certainly prevent any playlist from containing duplicate songs
Compute the standard deviation of the averageEnergyRating of all songs (i.e. The data set is the average energy rating of each song in the playlist. Compute the standard deviation of this data set). This standard deviation will give us a measure of the variety of songs in the playlist based on energy level. We will favor playlists with higher energy variety
If the standard deviation is < 0.5, set the cost multiplier to 10.0 If the standard deviation is >= 0.5, set the cost multiplier to the inverse of the standard deviation (eg. 1.0 divided by the standard deviation)
Returns the raw cost times the cost multiplier
Functionality: Implement the Playlist.costFunction method.