Introduction Genetic algorithms provide probabilistic solutions to optimization

Introduction
Genetic algorithms provide probabilistic solutions to optimization problems. These algorithms can be thought of as an advanced “guess and check” technique that eventually arrives at an output that is close to the actual solution without having to know how to compute the solution directly.
Resources:
https://en.wikipedia.org/wiki/Genetic_algorithm
https://www.tutorialspoint.com/genetic_algorithms/index.htm
In application objective 1, you will write a generic genetic algorithm that can find approximate solutions to optimization problems. This algorithm will be written in such a way that it can be reused for any application suitable for a genetic algorithm. For each problem, you will need to provide your genetic algorithm with a cost function to determine how well a potential solution performs, and an incubator function that creates a potential solution from a list of doubles (Referred to as genes). In the lecture tasks, you will set up several problems to be solved by the genetic algorithm.
Immutable
You will design and implement a genetic algorithm without using any mutable variables or state.
Specifically, the following are banned in your submissions:
Variables (var)
The value at any memory address on the stack or heap cannot change throughout the execution of your program
You can use values (val) to store values
Any way of directly simulating mutability that is against the spirit of this assignment. (Ex. Importing a class that has a mutable state variable)
If your submission, including testing, violates this immutability restriction it will not be graded.
Project Structure
Repository: https://github.com/jessehartloff/GeneticAlgorithmHandout
Clone the starter code from the repository as a new IntelliJ project
Run maven and mark src as the root sources directory
Learning Objective – Write
Lecture Task 1: Average and Top K
Our goal in this project will be to use our genetic algorithm to generate playlists of songs based on a variety of factors. To this end, we’ll need methods to compute several statistical values. When writing these methods, we’ll write them generically so we never have to implement them again (ie. Whenever you need the top K values from any data structure, you can call your topK method from this LT.
Testing: In the tests package, complete the test suite named LectureTask1 that tests the following functionality.
In the statistics.Statistics object, a method named “average” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty (Do not write a test with an empty List as an input since this behavior is undefined)
Returns the average of the elements in the list after the input function is applied to them
Ie. Call the input function on every element of the List, then find the average of all the values returned by the function
By taking a type parameter and a function, this average method is completely generic and can be used to find the average of any aspect of a data set. For example, if you want to find the average of the absolute values of a list of integers you would call this method with your list of integers and the absolute value function
e music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returnedSong Examples: Two example uses of this method are provided in the music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returned
In the statistics.Statistics object, a method named “topK” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int “k”
Returns the k elements from the list with the highest outputs of function in decreasing order
Example: If k is 10 you would return a top 10 list
If the list has fewer than k elements, return all elements of the list
The returned list should be of type List[T]
If there are duplicates in the list, all duplicates should be in the returned list
Ex. The top 3 ints in the list (2,1,3,4,3) are (4,3,3)
Ex. The top 2 ints in the list (2,1,3,4,3) are (4,3)
Playlist Examples: Two example uses of this method are provided in the music.SampleUsage object. One call finds the top 10 songs based on their bayesian average rating (LT3) and the other finds the top 10 most controversial songs based on the standard deviation (LT2) of their ratings. You’ll also have to complete LT4 before running these example so you can read the file of ratings
Functionality: Implement the methods that you’ve tested.
We’ll also need a way to compute the standard deviation of any value of a list of objects. We’ll use population standard deviation as opposed to sample standard deviation. For a brief refresher on standard deviation, it is the square root of the sum of the differences from the mean squared for each value divided by the number of elements. That is:
Find the mean (average) of the data
Subtract the mean from each element and find the square of this difference
Sum up (add together) all these squared values
Divide the sum by the number of values (length of the input list)
Take the square root of the value calculated in step 4
Testing: In the tests package, complete the test suite named LectureTask2 that tests the following functionality.
In the statistics.Statistics object, implement the following method:
In the statistics.Statistics object, a method named “standardDeviation” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty
Returns the standard deviation of the elements in the list after the input function is applied to them
Note: The structure of this method is the same as the “average” method except you’ll compute the standard deviation instead of the average of the outputs of the function
Functionality: Implement the standardDeviation method.
Simple averages do not work well with user reviews since all of our top K lists will be dominated by songs with a single rating of 5. For example, a song with 100 ratings and an average rating of 4.9 should be ranked higher than a song with 1 rating of 5 even though the later song has a higher average rating.
To address this issue we will use a Bayesian average. To compute a Bayesian average we will add “fake” ratings to every song before computing the average. For example if we add 2 fake ratings of 3 to every song, a song with a single rating of 5 will have a bayesian average rating of 3.67 (The average of [5, 3, 3]) while the song with 100 ratings will barely be affected.
References:
https://fulmicoton.com/posts/bayesian_rating
https://en.wikipedia.org/wiki/Bayesian_average
Testing: In the tests package, complete the test suite named LectureTask3 that tests the following functionality.
In the statistics.Statistics object, a method named “bayesianAverage” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int representing the number of extra “fake” ratings, and an Int representing the value of the extra ratings
You can assume the input list is not empty
Returns the bayesian average of the elements in the list after the input function is applied to them
Functionality: Implement the bayesianAverage method.
Note: The intent of this task is to give you practice writing a recursive method. There is a way to write this method without recursion, but it is difficult to attempt LT4 without any practice with recursion. If you have questions in office hours about LT4 and you did not use recursion on LT3, the response to your question should be to rewrite LT3 using recursion.
We have some tools to work with data, but we don’t have data. In this task you’ll read song ratings from a file and return the list of songs, with all their ratings, from the file. This will be a CSV file in the format “youtubeId,artist,title,rating,energyRating” and you can assume that there are no commas in the data itself so you can safely split on commas (Note: In general, CSV data can contain commas if the data is surrounded by quotes). The rating and energyRating will be integer values ranging from 1 to 5.
After parsing the file you should have a list of songs containing all the ratings from the file. If a song has been rated multiple times there should only be one song in the returned list containing all the ratings for that song. Two songs are considered the same if they have the same youtubeId even if they have a different artist/title (eg. If someone mistyped the artist/title do not consider that as a different song).
Testing: There is no testing requirement for this objective. This does not mean that you shouldn’t test your code, only that your testing will not be checked in AutoLab. If you have questions during office hours about LT4 and you don’t have any tests, you know what we’ll say in response to your question “write tests!”
Functionality: In the Song class implement the following method:
A method named “addRating” that takes a SongRating and returns a new Song that is a copy of the song with the new rating added
In the Song object complete the following method:
A method named “readSongsFromFile” that takes a String representing a filename containing song ratings in the format above and returns a list of Songs containing all the songs and rating from the file
Note: After completing LT1-4 you can run the code in music.SampleUsage to generate playlists from a file of ratings and your statistics methods. Enjoy the music!
Lecture Task 5: Song Cost Function
To interact with the genetic algorithm we need a cost function. In this task you’ll write a method that computes the cost of a song given a specific user’s ratings. We will combine the overall rating of the song with the user’s rating for the song, if the user rated this song. This cost will be a measure of how much this user would like/dislike having this song in a playlist. The best song for this user is the song with the lowest cost.
Testing: In the tests package, complete the test suite named LectureTask5 that tests the following functionality.
In the Song object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Song as a parameter and returns a Double. This function will compute the cost of a Song for this specific user based on their ratings. The function will return:
1000.0 if the song has been rated with a 1 or 2 by this user. This is a very high cost relative to songs rated 3+ and will almost certainly prevent these songs from appearing in any playlist for this user
If the song has not been rated by this user, treat it as though the user gave it a rating of 3
If the song has been rated 3, 4, 5, or has not been rated and is being treated as a 3, return a cost of 1 / (bayesianRating * userRating)
For bayesianRating add 2 extra rating with value 3
Example: If a song has ratings (2,4,5) and the user rated this song with a 4 (from the input Map) the song will have a bayesian average rating of 3.4 and a cost of 0.0735294
Functionality: Implement the Song.costFunction method.
Lecture Task 6: Playlist Cost Function
Just as we did with Songs, you’ll write a cost function for entire playlists. This will allow your genetic algorithm to generate playlists for a user that will minimize this score.
Testing: In the tests package, complete the test suite named LectureTask6 that tests the following functionality.
In the Playlist object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Playlist as a parameter and returns a Double. This function will compute the cost of a Playlist for this specific user based on their ratings. The function will compute the cost as follows:
Compute a “raw cost” as the sum of the cost of all the songs. Remember that you can call your Song.costFunction method to get a cost function for songs and call this function to compute these costs. Sum up all the costs and that’s the raw cost of the playlist
Compute a cost multiplier that is:
1000.0 if the playlist contains a duplicate song (Check if the playlist contains any youtubeIds multiple times). This large multiplier will all but certainly prevent any playlist from containing duplicate songs
Compute the standard deviation of the averageEnergyRating of all songs (i.e. The data set is the average energy rating of each song in the playlist. Compute the standard deviation of this data set). This standard deviation will give us a measure of the variety of songs in the playlist based on energy level. We will favor playlists with higher energy variety
If the standard deviation is < 0.5, set the cost multiplier to 10.0 If the standard deviation is >= 0.5, set the cost multiplier to the inverse of the standard deviation (eg. 1.0 divided by the standard deviation)
Returns the raw cost times the cost multiplier
Functionality: Implement the Playlist.costFunction method.
Application Objectives
Testing Note
Your tests will not be checked by AutoLab for these application objectives
If you ask a question in office hours or on Piazza about these objectives without having any testing, don’t expect a satisfying answer (Unless your question is about testing)
Application Objective 1: Genetic Algorithm
Implement a generic genetic algorithm.
In the GeneticAlgorithm object, write a method named geneticAlgorithm with the following features:
Takes a type parameter T
This is the type of the solution space that you are trying to find. For example, when finding the best playlist this type will be Playlist
As the first parameter, takes an “incubator” function that takes a List[Double] and returns an object of type T
This function will provide a way to convert a list of genes into an object in the solution space. Your algorithm will only work with lists of doubles to guess new solutions and call this incubator to convert the genes into possible solutions
As the second parameter, takes a cost function of type T to Double
This function takes a potential solution and computes its cost. The goal of your algorithm is to find a solution with minimal cost. A solution with a cost of 0.0 is considered perfect, though a perfect solution will not always exist.
The third, and final, parameter is an Int representing the number of genes in the List[Double] for the incubator function
When creating/mutating/crossing lists for the incubator, make sure all the lists have this length
Returns an object of type T which is the most optimal solution found by the algorithm
Your algorithm will be tested with an unknown application to ensure you implemented the algorithm generically. Your algorithm will be called 20 times during testing to test the efficiency and consistency of your implementation. Be sure your algorithm is efficient enough to run this many times. Grading will timeout after 5 minutes.
Algorithm Outline: You have a significant amount of freedom in designing your genetic algorithm. Your only goal is to return a solution that is close to optimal. If you achieve this, you can complete this assignment. For a little guidance on designing an algorithm, or if you don’t want to design your own from scratch, here are a few steps you can follow.
Start by generating a fixed number of random “animals”
Each animal is defined by its genes (List of doubles) so this involves generating random lists of doubles
Depending on the rest of your algorithm and applications, the range of these random values may be important. Ensure that the range is wide enough to allow any potential solution. (If your genes cover the range -100 to 100 you will be able to pass the application objective)
Find the best animals
Use the incubator function to create a potential solution for each set of genes
Use the cost function to compute the cost of each potential solution
Sort the animals based on their cost to find the best animals
Survival of the fittest
Only keep animals with the least costs and say goodbye to the rest
Mutations
Keep the best animal with no mutations
Also create mutations of the best animals by creating new sets of genes that are similar to the best ones, but with some random variation
Randomness is your friend in this step. You are starting with a pretty good solution and creating random solutions that are similar to the good one to find an even better one. Creating several variations of the best solutions with random changes will help your algorithm to find the best solution
Crossover
Pair up the top animals and create “offspring” by combining their genes
There are several methods of this, for example averaging their genes or randomly choosing one parent for each gene
More random animals
Add more random animals to get back to the original number of animals
The next generation
Go back to step 2 [with recursion]
You can recurse for a fixed number of generations, or until the improvement from the previous generation is very small. Do not run until the cost is close to 0.0 since the most optimal solution may have a high cost.
If you follow this outline, there are many parameters that are up to you to choose. You should test and adjust your parameters until you have a robust genetic algorithm. These parameters include:
How many animals in each generation
How many generations
How many animals from each generation you’ll keep
How to mutate genes
How many mutations to generate and which animals get mutated
How to crossover two (or more) animals
Testing: Your testing will not be graded in AutoLab as you are expected to know how to write useful tests without being forced to write them at this point in the course. To write your tests you should use the cost function and incubator from the Song and Playlist objects as your inputs. Start with something simple like a few songs and an empty map of ratings and ensure your algorithm returns the song with the highest bayesian average rating. Eventually you’ll build up to generating playlists with an increasing number of songs (number of genes is the length of a playlist). Accurately finding a playlist of length 3 would be a significant accomplishment and should be enough to pass my tests on AutoLab.
Application Objective 2: Neural Networks
Use your genetic algorithm to train a neural network.
Complete the NeuralNetwork object and class to complete this objective. Your software will be checked for an accuracy >= 80% on the iris dataset to check if you’ve completed this objective (ie. Don’t strive for 100% accuracy. I believe it’s impossible on this dataset using a Neural Network).
Neural Network: A neural network is a machine learning technique inspired by the functionality of neurons. The goal of a neural network is to take a vector of doubles and return the category to which the vector belongs. In other words, given several features of something represented as doubles, the network will attempt to correctly determine what is represented by those values.
A neural network accomplishes this by simulating biological neurons. A simulated neuron is connected to several other neurons which will send signals to it (represented as doubles) and the neuron will combine these signals to determine its output signal. The network has 3 types of nodes:
Input neurons: These nodes are set to the input vector values. These nodes output the feature vector without modification. The number of input nodes will match the size of the feature vector for each application. In the iris examples below, there will be 4 input neurons in the network.
Hidden neurons: Each hidden neuron is connected to the output of all input neurons and combines all the input features into a single value to determine its output signal strength. This is accomplished by assigning a weight to each input and combining them linearly with an additional constant as a bias term. The output of this linear combination is then the input to a function (commonly the sigmoid function) to determine whether or not the neuron “fires.” This makes the output signal tend towards either 1.0 or 0.0.
Output neurons: The output neurons are connected to all of the hidden neurons in the same way that the hidden neurons are connected to the inputs. Each output neuron is associated with one of the possible categories for the application. The output node with the strongest signal is the most likely category for that input and is the category returned by the network.
For more details about neural network functionality, this video contains an excellent overview: https://www.youtube.com/watch?v=aircAruvnKk
Our goal is to train a neural network by determining the values for all the weights and biases for each neuron. This includes a weight for each input node for each hidden node, plus a bias term for each hidden node and a weight for each hidden node for each output node, plus a bias term for each output node. This total number of weights and biases should be the number of dimensions of the network which is the size of your genes in the genetic algorithm. This gene size will depend on your choice for the number of hidden nodes and will be determined by calling NeuralNetwork.numberOfDimensions(numberOfInputNodes, numberOfCategories) which you should call in your testing.
Example: For testing, you can use the publicly available iris dataset (linked below). This data set contains data about different types of irises represented by four measurements of the flowers (the feature vector) along with their categories (type of iris).
https://gist.github.com/curran/a08a1080b88344b0c8a7#file-iris-csv
The goal of your neural network is to correctly categorize the type of iris given its feature vector. More specifically, create a neural network such that the findCategory method will return the type of iris given it’s feature vector as a list of doubles.
Training: To train your neural network, and training AI algorithms in general, you should split your data into a training set and a testing set. It is important not to test with the same data the algorithm was trained with to avoid overfitting to the data. The handout code contains the full dataset, and the dataset split into testing and training sets as an example.
Summary: Your neural network will be created by providing the number of inputs (feature vector size) and the output categories. For the iris example a neural network will be created by calling
new NeuralNetwork(4, List(“setosa”, “versicolor”, “virginica”), List(…)))
And the created network should have 4 input neurons, 3 output neurons associated with the three categories, and a number of hidden nodes determined by your design.
The numberOfDimensions method in the NeuralNetwork object will be used to determine how many dimensions should be used by the genetic algorithm. This should return the number of genes you expect for a given number of inputs and output. It is up to you to determine how to set all the weights and biases based on these genes. You can decide any order you’d like for these weights and biases.
The cost function should determine how well a neural network performed on a training data set which you can create using the iris data. This training data is prelabeled with the correct category for a number of feature vectors. The cost should be some measure of how many input vectors were misclassified by the neural network.
Run your genetic algorithm on part/half of the iris data as the training set, then check the output neural network on the remaining data to see how well the genetic algorithm trained a neural network.
If you’ve made it this far, congratulations! You’ve just trained a machine to identify irises without telling it anything about irises! This is one way that algorithms can “learn” just by providing them with training data and why your data is so valuable to so many companies. If you are interested in this, you should consider taking machine learning courses and study more efficient methods of training classifiers.

Introduction Genetic algorithms provide probabilistic solutions to optimization

Introduction
Genetic algorithms provide probabilistic solutions to optimization problems. These algorithms can be thought of as an advanced “guess and check” technique that eventually arrives at an output that is close to the actual solution without having to know how to compute the solution directly.
Resources:
https://en.wikipedia.org/wiki/Genetic_algorithm
https://www.tutorialspoint.com/genetic_algorithms/index.htm
In application objective 1, you will write a generic genetic algorithm that can find approximate solutions to optimization problems. This algorithm will be written in such a way that it can be reused for any application suitable for a genetic algorithm. For each problem, you will need to provide your genetic algorithm with a cost function to determine how well a potential solution performs, and an incubator function that creates a potential solution from a list of doubles (Referred to as genes). In the lecture tasks, you will set up several problems to be solved by the genetic algorithm.
Immutable
You will design and implement a genetic algorithm without using any mutable variables or state.
Specifically, the following are banned in your submissions:
Variables (var)
The value at any memory address on the stack or heap cannot change throughout the execution of your program
You can use values (val) to store values
Any way of directly simulating mutability that is against the spirit of this assignment. (Ex. Importing a class that has a mutable state variable)
If your submission, including testing, violates this immutability restriction it will not be graded.
Project Structure
Repository: https://github.com/jessehartloff/GeneticAlgorithmHandout
Clone the starter code from the repository as a new IntelliJ project
Run maven and mark src as the root sources directory
Learning Objective – Write
Lecture Task 1: Average and Top K
Our goal in this project will be to use our genetic algorithm to generate playlists of songs based on a variety of factors. To this end, we’ll need methods to compute several statistical values. When writing these methods, we’ll write them generically so we never have to implement them again (ie. Whenever you need the top K values from any data structure, you can call your topK method from this LT.
Testing: In the tests package, complete the test suite named LectureTask1 that tests the following functionality.
In the statistics.Statistics object, a method named “average” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty (Do not write a test with an empty List as an input since this behavior is undefined)
Returns the average of the elements in the list after the input function is applied to them
Ie. Call the input function on every element of the List, then find the average of all the values returned by the function
By taking a type parameter and a function, this average method is completely generic and can be used to find the average of any aspect of a data set. For example, if you want to find the average of the absolute values of a list of integers you would call this method with your list of integers and the absolute value function
e music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returnedSong Examples: Two example uses of this method are provided in the music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returned
In the statistics.Statistics object, a method named “topK” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int “k”
Returns the k elements from the list with the highest outputs of function in decreasing order
Example: If k is 10 you would return a top 10 list
If the list has fewer than k elements, return all elements of the list
The returned list should be of type List[T]
If there are duplicates in the list, all duplicates should be in the returned list
Ex. The top 3 ints in the list (2,1,3,4,3) are (4,3,3)
Ex. The top 2 ints in the list (2,1,3,4,3) are (4,3)
Playlist Examples: Two example uses of this method are provided in the music.SampleUsage object. One call finds the top 10 songs based on their bayesian average rating (LT3) and the other finds the top 10 most controversial songs based on the standard deviation (LT2) of their ratings. You’ll also have to complete LT4 before running these example so you can read the file of ratings
Functionality: Implement the methods that you’ve tested.
We’ll also need a way to compute the standard deviation of any value of a list of objects. We’ll use population standard deviation as opposed to sample standard deviation. For a brief refresher on standard deviation, it is the square root of the sum of the differences from the mean squared for each value divided by the number of elements. That is:
Find the mean (average) of the data
Subtract the mean from each element and find the square of this difference
Sum up (add together) all these squared values
Divide the sum by the number of values (length of the input list)
Take the square root of the value calculated in step 4
Testing: In the tests package, complete the test suite named LectureTask2 that tests the following functionality.
In the statistics.Statistics object, implement the following method:
In the statistics.Statistics object, a method named “standardDeviation” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty
Returns the standard deviation of the elements in the list after the input function is applied to them
Note: The structure of this method is the same as the “average” method except you’ll compute the standard deviation instead of the average of the outputs of the function
Functionality: Implement the standardDeviation method.
Simple averages do not work well with user reviews since all of our top K lists will be dominated by songs with a single rating of 5. For example, a song with 100 ratings and an average rating of 4.9 should be ranked higher than a song with 1 rating of 5 even though the later song has a higher average rating.
To address this issue we will use a Bayesian average. To compute a Bayesian average we will add “fake” ratings to every song before computing the average. For example if we add 2 fake ratings of 3 to every song, a song with a single rating of 5 will have a bayesian average rating of 3.67 (The average of [5, 3, 3]) while the song with 100 ratings will barely be affected.
References:
https://fulmicoton.com/posts/bayesian_rating
https://en.wikipedia.org/wiki/Bayesian_average
Testing: In the tests package, complete the test suite named LectureTask3 that tests the following functionality.
In the statistics.Statistics object, a method named “bayesianAverage” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int representing the number of extra “fake” ratings, and an Int representing the value of the extra ratings
You can assume the input list is not empty
Returns the bayesian average of the elements in the list after the input function is applied to them
Functionality: Implement the bayesianAverage method.
Note: The intent of this task is to give you practice writing a recursive method. There is a way to write this method without recursion, but it is difficult to attempt LT4 without any practice with recursion. If you have questions in office hours about LT4 and you did not use recursion on LT3, the response to your question should be to rewrite LT3 using recursion.
We have some tools to work with data, but we don’t have data. In this task you’ll read song ratings from a file and return the list of songs, with all their ratings, from the file. This will be a CSV file in the format “youtubeId,artist,title,rating,energyRating” and you can assume that there are no commas in the data itself so you can safely split on commas (Note: In general, CSV data can contain commas if the data is surrounded by quotes). The rating and energyRating will be integer values ranging from 1 to 5.
After parsing the file you should have a list of songs containing all the ratings from the file. If a song has been rated multiple times there should only be one song in the returned list containing all the ratings for that song. Two songs are considered the same if they have the same youtubeId even if they have a different artist/title (eg. If someone mistyped the artist/title do not consider that as a different song).
Testing: There is no testing requirement for this objective. This does not mean that you shouldn’t test your code, only that your testing will not be checked in AutoLab. If you have questions during office hours about LT4 and you don’t have any tests, you know what we’ll say in response to your question “write tests!”
Functionality: In the Song class implement the following method:
A method named “addRating” that takes a SongRating and returns a new Song that is a copy of the song with the new rating added
In the Song object complete the following method:
A method named “readSongsFromFile” that takes a String representing a filename containing song ratings in the format above and returns a list of Songs containing all the songs and rating from the file
Note: After completing LT1-4 you can run the code in music.SampleUsage to generate playlists from a file of ratings and your statistics methods. Enjoy the music!
Lecture Task 5: Song Cost Function
To interact with the genetic algorithm we need a cost function. In this task you’ll write a method that computes the cost of a song given a specific user’s ratings. We will combine the overall rating of the song with the user’s rating for the song, if the user rated this song. This cost will be a measure of how much this user would like/dislike having this song in a playlist. The best song for this user is the song with the lowest cost.
Testing: In the tests package, complete the test suite named LectureTask5 that tests the following functionality.
In the Song object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Song as a parameter and returns a Double. This function will compute the cost of a Song for this specific user based on their ratings. The function will return:
1000.0 if the song has been rated with a 1 or 2 by this user. This is a very high cost relative to songs rated 3+ and will almost certainly prevent these songs from appearing in any playlist for this user
If the song has not been rated by this user, treat it as though the user gave it a rating of 3
If the song has been rated 3, 4, 5, or has not been rated and is being treated as a 3, return a cost of 1 / (bayesianRating * userRating)
For bayesianRating add 2 extra rating with value 3
Example: If a song has ratings (2,4,5) and the user rated this song with a 4 (from the input Map) the song will have a bayesian average rating of 3.4 and a cost of 0.0735294
Functionality: Implement the Song.costFunction method.
Lecture Task 6: Playlist Cost Function
Just as we did with Songs, you’ll write a cost function for entire playlists. This will allow your genetic algorithm to generate playlists for a user that will minimize this score.
Testing: In the tests package, complete the test suite named LectureTask6 that tests the following functionality.
In the Playlist object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Playlist as a parameter and returns a Double. This function will compute the cost of a Playlist for this specific user based on their ratings. The function will compute the cost as follows:
Compute a “raw cost” as the sum of the cost of all the songs. Remember that you can call your Song.costFunction method to get a cost function for songs and call this function to compute these costs. Sum up all the costs and that’s the raw cost of the playlist
Compute a cost multiplier that is:
1000.0 if the playlist contains a duplicate song (Check if the playlist contains any youtubeIds multiple times). This large multiplier will all but certainly prevent any playlist from containing duplicate songs
Compute the standard deviation of the averageEnergyRating of all songs (i.e. The data set is the average energy rating of each song in the playlist. Compute the standard deviation of this data set). This standard deviation will give us a measure of the variety of songs in the playlist based on energy level. We will favor playlists with higher energy variety
If the standard deviation is < 0.5, set the cost multiplier to 10.0 If the standard deviation is >= 0.5, set the cost multiplier to the inverse of the standard deviation (eg. 1.0 divided by the standard deviation)
Returns the raw cost times the cost multiplier
Functionality: Implement the Playlist.costFunction method.

Introduction Genetic algorithms provide probabilistic solutions to optimization

Introduction
Genetic algorithms provide probabilistic solutions to optimization problems. These algorithms can be thought of as an advanced “guess and check” technique that eventually arrives at an output that is close to the actual solution without having to know how to compute the solution directly.
Resources:
https://en.wikipedia.org/wiki/Genetic_algorithm
https://www.tutorialspoint.com/genetic_algorithms/index.htm
In application objective 1, you will write a generic genetic algorithm that can find approximate solutions to optimization problems. This algorithm will be written in such a way that it can be reused for any application suitable for a genetic algorithm. For each problem, you will need to provide your genetic algorithm with a cost function to determine how well a potential solution performs, and an incubator function that creates a potential solution from a list of doubles (Referred to as genes). In the lecture tasks, you will set up several problems to be solved by the genetic algorithm.
Immutable
You will design and implement a genetic algorithm without using any mutable variables or state.
Specifically, the following are banned in your submissions:
Variables (var)
The value at any memory address on the stack or heap cannot change throughout the execution of your program
You can use values (val) to store values
Any way of directly simulating mutability that is against the spirit of this assignment. (Ex. Importing a class that has a mutable state variable)
If your submission, including testing, violates this immutability restriction it will not be graded.
Project Structure
Repository: https://github.com/jessehartloff/GeneticAlgorithmHandout
Clone the starter code from the repository as a new IntelliJ project
Run maven and mark src as the root sources directory
Learning Objective – Write
Lecture Task 1: Average and Top K
Our goal in this project will be to use our genetic algorithm to generate playlists of songs based on a variety of factors. To this end, we’ll need methods to compute several statistical values. When writing these methods, we’ll write them generically so we never have to implement them again (ie. Whenever you need the top K values from any data structure, you can call your topK method from this LT.
Testing: In the tests package, complete the test suite named LectureTask1 that tests the following functionality.
In the statistics.Statistics object, a method named “average” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty (Do not write a test with an empty List as an input since this behavior is undefined)
Returns the average of the elements in the list after the input function is applied to them
Ie. Call the input function on every element of the List, then find the average of all the values returned by the function
By taking a type parameter and a function, this average method is completely generic and can be used to find the average of any aspect of a data set. For example, if you want to find the average of the absolute values of a list of integers you would call this method with your list of integers and the absolute value function
e music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returnedSong Examples: Two example uses of this method are provided in the music.Song class to make this functionality more clear. Each example takes a list of SongRatings and a function that either accesses the “rating” or the “energy” of the song’s ratings. These ratings will be on a 1-5 scale and the average of the ratings will be returned
In the statistics.Statistics object, a method named “topK” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int “k”
Returns the k elements from the list with the highest outputs of function in decreasing order
Example: If k is 10 you would return a top 10 list
If the list has fewer than k elements, return all elements of the list
The returned list should be of type List[T]
If there are duplicates in the list, all duplicates should be in the returned list
Ex. The top 3 ints in the list (2,1,3,4,3) are (4,3,3)
Ex. The top 2 ints in the list (2,1,3,4,3) are (4,3)
Playlist Examples: Two example uses of this method are provided in the music.SampleUsage object. One call finds the top 10 songs based on their bayesian average rating (LT3) and the other finds the top 10 most controversial songs based on the standard deviation (LT2) of their ratings. You’ll also have to complete LT4 before running these example so you can read the file of ratings
Functionality: Implement the methods that you’ve tested.
We’ll also need a way to compute the standard deviation of any value of a list of objects. We’ll use population standard deviation as opposed to sample standard deviation. For a brief refresher on standard deviation, it is the square root of the sum of the differences from the mean squared for each value divided by the number of elements. That is:
Find the mean (average) of the data
Subtract the mean from each element and find the square of this difference
Sum up (add together) all these squared values
Divide the sum by the number of values (length of the input list)
Take the square root of the value calculated in step 4
Testing: In the tests package, complete the test suite named LectureTask2 that tests the following functionality.
In the statistics.Statistics object, implement the following method:
In the statistics.Statistics object, a method named “standardDeviation” with:
A type Parameter T
Has parameters of type List of T and a function of type T to Double
You can assume the input list is not empty
Returns the standard deviation of the elements in the list after the input function is applied to them
Note: The structure of this method is the same as the “average” method except you’ll compute the standard deviation instead of the average of the outputs of the function
Functionality: Implement the standardDeviation method.
Simple averages do not work well with user reviews since all of our top K lists will be dominated by songs with a single rating of 5. For example, a song with 100 ratings and an average rating of 4.9 should be ranked higher than a song with 1 rating of 5 even though the later song has a higher average rating.
To address this issue we will use a Bayesian average. To compute a Bayesian average we will add “fake” ratings to every song before computing the average. For example if we add 2 fake ratings of 3 to every song, a song with a single rating of 5 will have a bayesian average rating of 3.67 (The average of [5, 3, 3]) while the song with 100 ratings will barely be affected.
References:
https://fulmicoton.com/posts/bayesian_rating
https://en.wikipedia.org/wiki/Bayesian_average
Testing: In the tests package, complete the test suite named LectureTask3 that tests the following functionality.
In the statistics.Statistics object, a method named “bayesianAverage” with:
A type Parameter T
Has parameters of type List of T, a function of type T to Double, an Int representing the number of extra “fake” ratings, and an Int representing the value of the extra ratings
You can assume the input list is not empty
Returns the bayesian average of the elements in the list after the input function is applied to them
Functionality: Implement the bayesianAverage method.
Note: The intent of this task is to give you practice writing a recursive method. There is a way to write this method without recursion, but it is difficult to attempt LT4 without any practice with recursion. If you have questions in office hours about LT4 and you did not use recursion on LT3, the response to your question should be to rewrite LT3 using recursion.
We have some tools to work with data, but we don’t have data. In this task you’ll read song ratings from a file and return the list of songs, with all their ratings, from the file. This will be a CSV file in the format “youtubeId,artist,title,rating,energyRating” and you can assume that there are no commas in the data itself so you can safely split on commas (Note: In general, CSV data can contain commas if the data is surrounded by quotes). The rating and energyRating will be integer values ranging from 1 to 5.
After parsing the file you should have a list of songs containing all the ratings from the file. If a song has been rated multiple times there should only be one song in the returned list containing all the ratings for that song. Two songs are considered the same if they have the same youtubeId even if they have a different artist/title (eg. If someone mistyped the artist/title do not consider that as a different song).
Testing: There is no testing requirement for this objective. This does not mean that you shouldn’t test your code, only that your testing will not be checked in AutoLab. If you have questions during office hours about LT4 and you don’t have any tests, you know what we’ll say in response to your question “write tests!”
Functionality: In the Song class implement the following method:
A method named “addRating” that takes a SongRating and returns a new Song that is a copy of the song with the new rating added
In the Song object complete the following method:
A method named “readSongsFromFile” that takes a String representing a filename containing song ratings in the format above and returns a list of Songs containing all the songs and rating from the file
Note: After completing LT1-4 you can run the code in music.SampleUsage to generate playlists from a file of ratings and your statistics methods. Enjoy the music!
Lecture Task 5: Song Cost Function
To interact with the genetic algorithm we need a cost function. In this task you’ll write a method that computes the cost of a song given a specific user’s ratings. We will combine the overall rating of the song with the user’s rating for the song, if the user rated this song. This cost will be a measure of how much this user would like/dislike having this song in a playlist. The best song for this user is the song with the lowest cost.
Testing: In the tests package, complete the test suite named LectureTask5 that tests the following functionality.
In the Song object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Song as a parameter and returns a Double. This function will compute the cost of a Song for this specific user based on their ratings. The function will return:
1000.0 if the song has been rated with a 1 or 2 by this user. This is a very high cost relative to songs rated 3+ and will almost certainly prevent these songs from appearing in any playlist for this user
If the song has not been rated by this user, treat it as though the user gave it a rating of 3
If the song has been rated 3, 4, 5, or has not been rated and is being treated as a 3, return a cost of 1 / (bayesianRating * userRating)
For bayesianRating add 2 extra rating with value 3
Example: If a song has ratings (2,4,5) and the user rated this song with a 4 (from the input Map) the song will have a bayesian average rating of 3.4 and a cost of 0.0735294
Functionality: Implement the Song.costFunction method.
Lecture Task 6: Playlist Cost Function
Just as we did with Songs, you’ll write a cost function for entire playlists. This will allow your genetic algorithm to generate playlists for a user that will minimize this score.
Testing: In the tests package, complete the test suite named LectureTask6 that tests the following functionality.
In the Playlist object, a method named “costFunction” that:
Takes a Map of Strings to Ints as a parameter representing the rating of a specific user. The keys of this map will be youtubeIds and the values are the ratings from this user
Returns a function that takes a Playlist as a parameter and returns a Double. This function will compute the cost of a Playlist for this specific user based on their ratings. The function will compute the cost as follows:
Compute a “raw cost” as the sum of the cost of all the songs. Remember that you can call your Song.costFunction method to get a cost function for songs and call this function to compute these costs. Sum up all the costs and that’s the raw cost of the playlist
Compute a cost multiplier that is:
1000.0 if the playlist contains a duplicate song (Check if the playlist contains any youtubeIds multiple times). This large multiplier will all but certainly prevent any playlist from containing duplicate songs
Compute the standard deviation of the averageEnergyRating of all songs (i.e. The data set is the average energy rating of each song in the playlist. Compute the standard deviation of this data set). This standard deviation will give us a measure of the variety of songs in the playlist based on energy level. We will favor playlists with higher energy variety
If the standard deviation is < 0.5, set the cost multiplier to 10.0 If the standard deviation is >= 0.5, set the cost multiplier to the inverse of the standard deviation (eg. 1.0 divided by the standard deviation)
Returns the raw cost times the cost multiplier
Functionality: Implement the Playlist.costFunction method.
Application Objectives
Testing Note
Your tests will not be checked by AutoLab for these application objectives
If you ask a question in office hours or on Piazza about these objectives without having any testing, don’t expect a satisfying answer (Unless your question is about testing)
Application Objective 1: Genetic Algorithm
Implement a generic genetic algorithm.
In the GeneticAlgorithm object, write a method named geneticAlgorithm with the following features:
Takes a type parameter T
This is the type of the solution space that you are trying to find. For example, when finding the best playlist this type will be Playlist
As the first parameter, takes an “incubator” function that takes a List[Double] and returns an object of type T
This function will provide a way to convert a list of genes into an object in the solution space. Your algorithm will only work with lists of doubles to guess new solutions and call this incubator to convert the genes into possible solutions
As the second parameter, takes a cost function of type T to Double
This function takes a potential solution and computes its cost. The goal of your algorithm is to find a solution with minimal cost. A solution with a cost of 0.0 is considered perfect, though a perfect solution will not always exist.
The third, and final, parameter is an Int representing the number of genes in the List[Double] for the incubator function
When creating/mutating/crossing lists for the incubator, make sure all the lists have this length
Returns an object of type T which is the most optimal solution found by the algorithm
Your algorithm will be tested with an unknown application to ensure you implemented the algorithm generically. Your algorithm will be called 20 times during testing to test the efficiency and consistency of your implementation. Be sure your algorithm is efficient enough to run this many times. Grading will timeout after 5 minutes.
Algorithm Outline: You have a significant amount of freedom in designing your genetic algorithm. Your only goal is to return a solution that is close to optimal. If you achieve this, you can complete this assignment. For a little guidance on designing an algorithm, or if you don’t want to design your own from scratch, here are a few steps you can follow.
Start by generating a fixed number of random “animals”
Each animal is defined by its genes (List of doubles) so this involves generating random lists of doubles
Depending on the rest of your algorithm and applications, the range of these random values may be important. Ensure that the range is wide enough to allow any potential solution. (If your genes cover the range -100 to 100 you will be able to pass the application objective)
Find the best animals
Use the incubator function to create a potential solution for each set of genes
Use the cost function to compute the cost of each potential solution
Sort the animals based on their cost to find the best animals
Survival of the fittest
Only keep animals with the least costs and say goodbye to the rest
Mutations
Keep the best animal with no mutations
Also create mutations of the best animals by creating new sets of genes that are similar to the best ones, but with some random variation
Randomness is your friend in this step. You are starting with a pretty good solution and creating random solutions that are similar to the good one to find an even better one. Creating several variations of the best solutions with random changes will help your algorithm to find the best solution
Crossover
Pair up the top animals and create “offspring” by combining their genes
There are several methods of this, for example averaging their genes or randomly choosing one parent for each gene
More random animals
Add more random animals to get back to the original number of animals
The next generation
Go back to step 2 [with recursion]
You can recurse for a fixed number of generations, or until the improvement from the previous generation is very small. Do not run until the cost is close to 0.0 since the most optimal solution may have a high cost.
If you follow this outline, there are many parameters that are up to you to choose. You should test and adjust your parameters until you have a robust genetic algorithm. These parameters include:
How many animals in each generation
How many generations
How many animals from each generation you’ll keep
How to mutate genes
How many mutations to generate and which animals get mutated
How to crossover two (or more) animals
Testing: Your testing will not be graded in AutoLab as you are expected to know how to write useful tests without being forced to write them at this point in the course. To write your tests you should use the cost function and incubator from the Song and Playlist objects as your inputs. Start with something simple like a few songs and an empty map of ratings and ensure your algorithm returns the song with the highest bayesian average rating. Eventually you’ll build up to generating playlists with an increasing number of songs (number of genes is the length of a playlist). Accurately finding a playlist of length 3 would be a significant accomplishment and should be enough to pass my tests on AutoLab.
Application Objective 2: Neural Networks
Use your genetic algorithm to train a neural network.
Complete the NeuralNetwork object and class to complete this objective. Your software will be checked for an accuracy >= 80% on the iris dataset to check if you’ve completed this objective (ie. Don’t strive for 100% accuracy. I believe it’s impossible on this dataset using a Neural Network).
Neural Network: A neural network is a machine learning technique inspired by the functionality of neurons. The goal of a neural network is to take a vector of doubles and return the category to which the vector belongs. In other words, given several features of something represented as doubles, the network will attempt to correctly determine what is represented by those values.
A neural network accomplishes this by simulating biological neurons. A simulated neuron is connected to several other neurons which will send signals to it (represented as doubles) and the neuron will combine these signals to determine its output signal. The network has 3 types of nodes:
Input neurons: These nodes are set to the input vector values. These nodes output the feature vector without modification. The number of input nodes will match the size of the feature vector for each application. In the iris examples below, there will be 4 input neurons in the network.
Hidden neurons: Each hidden neuron is connected to the output of all input neurons and combines all the input features into a single value to determine its output signal strength. This is accomplished by assigning a weight to each input and combining them linearly with an additional constant as a bias term. The output of this linear combination is then the input to a function (commonly the sigmoid function) to determine whether or not the neuron “fires.” This makes the output signal tend towards either 1.0 or 0.0.
Output neurons: The output neurons are connected to all of the hidden neurons in the same way that the hidden neurons are connected to the inputs. Each output neuron is associated with one of the possible categories for the application. The output node with the strongest signal is the most likely category for that input and is the category returned by the network.
For more details about neural network functionality, this video contains an excellent overview: https://www.youtube.com/watch?v=aircAruvnKk
Our goal is to train a neural network by determining the values for all the weights and biases for each neuron. This includes a weight for each input node for each hidden node, plus a bias term for each hidden node and a weight for each hidden node for each output node, plus a bias term for each output node. This total number of weights and biases should be the number of dimensions of the network which is the size of your genes in the genetic algorithm. This gene size will depend on your choice for the number of hidden nodes and will be determined by calling NeuralNetwork.numberOfDimensions(numberOfInputNodes, numberOfCategories) which you should call in your testing.
Example: For testing, you can use the publicly available iris dataset (linked below). This data set contains data about different types of irises represented by four measurements of the flowers (the feature vector) along with their categories (type of iris).
https://gist.github.com/curran/a08a1080b88344b0c8a7#file-iris-csv
The goal of your neural network is to correctly categorize the type of iris given its feature vector. More specifically, create a neural network such that the findCategory method will return the type of iris given it’s feature vector as a list of doubles.
Training: To train your neural network, and training AI algorithms in general, you should split your data into a training set and a testing set. It is important not to test with the same data the algorithm was trained with to avoid overfitting to the data. The handout code contains the full dataset, and the dataset split into testing and training sets as an example.
Summary: Your neural network will be created by providing the number of inputs (feature vector size) and the output categories. For the iris example a neural network will be created by calling
new NeuralNetwork(4, List(“setosa”, “versicolor”, “virginica”), List(…)))
And the created network should have 4 input neurons, 3 output neurons associated with the three categories, and a number of hidden nodes determined by your design.
The numberOfDimensions method in the NeuralNetwork object will be used to determine how many dimensions should be used by the genetic algorithm. This should return the number of genes you expect for a given number of inputs and output. It is up to you to determine how to set all the weights and biases based on these genes. You can decide any order you’d like for these weights and biases.
The cost function should determine how well a neural network performed on a training data set which you can create using the iris data. This training data is prelabeled with the correct category for a number of feature vectors. The cost should be some measure of how many input vectors were misclassified by the neural network.
Run your genetic algorithm on part/half of the iris data as the training set, then check the output neural network on the remaining data to see how well the genetic algorithm trained a neural network.
If you’ve made it this far, congratulations! You’ve just trained a machine to identify irises without telling it anything about irises! This is one way that algorithms can “learn” just by providing them with training data and why your data is so valuable to so many companies. If you are interested in this, you should consider taking machine learning courses and study more efficient methods of training classifiers.