Assignment: Halloween Candy
Name:
Overview
In this assignment, your job will be to clean and wrangle data from a survey of Halloween candy to prepare it for a machine learning project.
Once you have completed all the tasks in the notebook, save your notebook as halloween_candy, and verify that all the tests pass in CodeGrade.
For those of you that are more comfortable with using Pandas, I have also provided extra optional data analysis questions at the end of the assignment for more practice.
Data Set
The data set that we will be using is the 2017 Halloween Candy Hierarchy data set as discussed in this boingboing article. You can also read more about the data in the Science Creative Quarterly.
The following are the rating instructions from the survey:
Basically, consider that feeling you get when you receive this item in your Halloween haul. Does it make you really happy (JOY)? Or is it something that you automatically place in the junk pile (DESPAIR)? MEH for indifference, and you can leave blank if you have no idea what the item is.Note that the original data set has been slightly altered from its original state, and if you wanted to perform any analysis for future projects, you would need to download the data directly from the links above.
This data is a great example of a messy data set, especially since they allowed respondents to enter text for a number of the fields. Also, note that some of the comments in the file might be considered inappropriate to some readers but cleaning this type of data is normal in a lot of data science projects.
Note
Show Work
Remember that you must show your work. Students submissions are spot checked manually throughout the term to verify that they are not hard coding the answer from looking only in the file or in CodeGrade’s expected output. If this is seen, the student’s answer will be manually marked wrong and their grade will be changed to reflect this.
For example, if the answer to Q1, the mean of a specific column, is 22:
# correct way
Our End Goal
Q1 = df['column_name'].mean()
# incorrect way
Q1 = 22
Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person’s gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.
Initial Import & Exploration
In [18]:
# initial importsimport pandas as pdimport numpy as np# Do not change this option; This allows the CodeGrade auto grading to function correctlypd.set_option(‘display.max_columns’, 20)Let’s start by importing our data and creating a DataFrame called candy. We need to include encoding=’iso-8859-1′ during the import because there are special characters in the data that Pandas doesn’t recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is utf-8, so changing the encoding allows Pandas to recognize those special characters.
Run the following code, with the encoding argument, and it should import correctly.
In [19]:
# read_csv with iso-8859-1 encoding; using latin-1 would also work herecandy_full = pd.read_csv(‘candy.csv’, encoding=‘iso-8859-1’)# copy to new DF so that we can have a copy of the original import if neededcandy = candy_full.copy()Let’s take a brief look at the data by using head().
In [20]:
# first five rowscandy.head()Out[20]:
Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
5 rows × 120 columns
Next, run the following code to see information about the DataFrame.
In [21]:
# check info about the DataFramecandy.info()
RangeIndex: 2479 entries, 0 to 2478
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB
Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation: “If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.”
We can make the columns display by setting the max_cols argument equal to the number of columns in the data set.
In [22]:
# check info, set max_colscandy.info(max_cols=120)
RangeIndex: 2479 entries, 0 to 2478
Data columns (total 120 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Internal ID 2479 non-null int64
1 Q1: GOING OUT? 2368 non-null object
2 Q2: GENDER 2437 non-null object
3 Q3: AGE 2394 non-null object
4 Q4: COUNTRY 2414 non-null object
5 Q5: STATE, PROVINCE, COUNTY, ETC 2377 non-null object
6 Q6 | 100 Grand Bar 1728 non-null object
7 Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) 1741 non-null object
8 Q6 | Any full-sized candy bar 1803 non-null object
9 Q6 | Black Jacks 1517 non-null object
10 Q6 | Bonkers (the candy) 1482 non-null object
11 Q6 | Bonkers (the board game) 1469 non-null object
12 Q6 | Bottle Caps 1709 non-null object
13 Q6 | Box’o’Raisins 1787 non-null object
14 Q6 | Broken glow stick 1767 non-null object
15 Q6 | Butterfinger 1793 non-null object
16 Q6 | Cadbury Creme Eggs 1792 non-null object
17 Q6 | Candy Corn 1796 non-null object
18 Q6 | Candy that is clearly just the stuff given out for free at restaurants 1784 non-null object
19 Q6 | Caramellos 1723 non-null object
20 Q6 | Cash, or other forms of legal tender 1795 non-null object
21 Q6 | Chardonnay 1732 non-null object
22 Q6 | Chick-o-Sticks (we donÕt know what that is) 1528 non-null object
23 Q6 | Chiclets 1764 non-null object
24 Q6 | Coffee Crisp 1621 non-null object
25 Q6 | Creepy Religious comics/Chick Tracts 1771 non-null object
26 Q6 | Dental paraphenalia 1783 non-null object
27 Q6 | Dots 1746 non-null object
28 Q6 | Dove Bars 1773 non-null object
29 Q6 | Fuzzy Peaches 1652 non-null object
30 Q6 | Generic Brand Acetaminophen 1744 non-null object
31 Q6 | Glow sticks 1778 non-null object
32 Q6 | Goo Goo Clusters 1595 non-null object
33 Q6 | Good N’ Plenty 1741 non-null object
34 Q6 | Gum from baseball cards 1759 non-null object
35 Q6 | Gummy Bears straight up 1778 non-null object
36 Q6 | Hard Candy 1780 non-null object
37 Q6 | Healthy Fruit 1781 non-null object
38 Q6 | Heath Bar 1763 non-null object
39 Q6 | Hershey’s Dark Chocolate 1802 non-null object
40 Q6 | HersheyÕs Milk Chocolate 1803 non-null object
41 Q6 | Hershey’s Kisses 1797 non-null object
42 Q6 | Hugs (actual physical hugs) 1762 non-null object
43 Q6 | Jolly Rancher (bad flavor) 1781 non-null object
44 Q6 | Jolly Ranchers (good flavor) 1780 non-null object
45 Q6 | JoyJoy (Mit Iodine!) 1449 non-null object
46 Q6 | Junior Mints 1777 non-null object
47 Q6 | Senior Mints 1532 non-null object
48 Q6 | Kale smoothie 1731 non-null object
49 Q6 | Kinder Happy Hippo 1528 non-null object
50 Q6 | Kit Kat 1801 non-null object
51 Q6 | LaffyTaffy 1739 non-null object
52 Q6 | LemonHeads 1745 non-null object
53 Q6 | Licorice (not black) 1789 non-null object
54 Q6 | Licorice (yes black) 1790 non-null object
55 Q6 | Lindt Truffle 1757 non-null object
56 Q6 | Lollipops 1784 non-null object
57 Q6 | Mars 1750 non-null object
58 Q6 | Maynards 1450 non-null object
59 Q6 | Mike and Ike 1746 non-null object
60 Q6 | Milk Duds 1782 non-null object
61 Q6 | Milky Way 1787 non-null object
62 Q6 | Regular M&Ms 1800 non-null object
63 Q6 | Peanut M&MÕs 1804 non-null object
64 Q6 | Blue M&M’s 1748 non-null object
65 Q6 | Red M&M’s 1746 non-null object
66 Q6 | Green Party M&M’s 1711 non-null object
67 Q6 | Independent M&M’s 1662 non-null object
68 Q6 | Abstained from M&M’ing. 1532 non-null object
69 Q6 | Minibags of chips 1751 non-null object
70 Q6 | Mint Kisses 1699 non-null object
71 Q6 | Mint Juleps 1664 non-null object
72 Q6 | Mr. Goodbar 1735 non-null object
73 Q6 | Necco Wafers 1731 non-null object
74 Q6 | Nerds 1752 non-null object
75 Q6 | Nestle Crunch 1777 non-null object
76 Q6 | Now’n’Laters 1657 non-null object
77 Q6 | Peeps 1765 non-null object
78 Q6 | Pencils 1766 non-null object
79 Q6 | Pixy Stix 1753 non-null object
80 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 1722 non-null object
81 Q6 | ReeseÕs Peanut Butter Cups 1796 non-null object
82 Q6 | Reese’s Pieces 1784 non-null object
83 Q6 | Reggie Jackson Bar 1460 non-null object
84 Q6 | Rolos 1761 non-null object
85 Q6 | Sandwich-sized bags filled with BooBerry Crunch 1699 non-null object
86 Q6 | Skittles 1769 non-null object
87 Q6 | Smarties (American) 1750 non-null object
88 Q6 | Smarties (Commonwealth) 1573 non-null object
89 Q6 | Snickers 1785 non-null object
90 Q6 | Sourpatch Kids (i.e. abominations of nature) 1737 non-null object
91 Q6 | Spotted Dick 1593 non-null object
92 Q6 | Starburst 1782 non-null object
93 Q6 | Sweet Tarts 1767 non-null object
94 Q6 | Swedish Fish 1760 non-null object
95 Q6 | Sweetums (a friend to diabetes) 1472 non-null object
96 Q6 | Take 5 1557 non-null object
97 Q6 | Tic Tacs 1761 non-null object
98 Q6 | Those odd marshmallow circus peanut things 1739 non-null object
99 Q6 | Three Musketeers 1767 non-null object
100 Q6 | Tolberone something or other 1769 non-null object
101 Q6 | Trail Mix 1767 non-null object
102 Q6 | Twix 1785 non-null object
103 Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein 1683 non-null object
104 Q6 | Vicodin 1686 non-null object
105 Q6 | Whatchamacallit Bars 1652 non-null object
106 Q6 | White Bread 1718 non-null object
107 Q6 | Whole Wheat anything 1728 non-null object
108 Q6 | York Peppermint Patties 1770 non-null object
109 Q7: JOY OTHER 917 non-null object
110 Q8: DESPAIR OTHER 723 non-null object
111 Q9: OTHER COMMENTS 389 non-null object
112 Q10: DRESS 1730 non-null object
113 Unnamed: 113 9 non-null object
114 Q11: DAY 1750 non-null object
115 Q12: MEDIA [Daily Dish] 85 non-null float64
116 Q12: MEDIA [Science] 1376 non-null float64
117 Q12: MEDIA [ESPN] 99 non-null float64
118 Q12: MEDIA [Yahoo] 67 non-null float64
119 Click Coordinates (x, y) 1619 non-null object
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB
Of course, if you are just looking for the column names, you can just use a simple for loop.
In [23]:
# print a list of column namesfor col in candy.columns: print(col)Internal ID
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q5: STATE, PROVINCE, COUNTY, ETC
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box’o’Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N’ Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey’s Dark Chocolate
Q6 | HersheyÕs Milk Chocolate
Q6 | Hershey’s Kisses
Q6 | Hugs (actual physical hugs)
Q6 | Jolly Rancher (bad flavor)
Q6 | Jolly Ranchers (good flavor)
Q6 | JoyJoy (Mit Iodine!)
Q6 | Junior Mints
Q6 | Senior Mints
Q6 | Kale smoothie
Q6 | Kinder Happy Hippo
Q6 | Kit Kat
Q6 | LaffyTaffy
Q6 | LemonHeads
Q6 | Licorice (not black)
Q6 | Licorice (yes black)
Q6 | Lindt Truffle
Q6 | Lollipops
Q6 | Mars
Q6 | Maynards
Q6 | Mike and Ike
Q6 | Milk Duds
Q6 | Milky Way
Q6 | Regular M&Ms
Q6 | Peanut M&MÕs
Q6 | Blue M&M’s
Q6 | Red M&M’s
Q6 | Green Party M&M’s
Q6 | Independent M&M’s
Q6 | Abstained from M&M’ing.
Q6 | Minibags of chips
Q6 | Mint Kisses
Q6 | Mint Juleps
Q6 | Mr. Goodbar
Q6 | Necco Wafers
Q6 | Nerds
Q6 | Nestle Crunch
Q6 | Now’n’Laters
Q6 | Peeps
Q6 | Pencils
Q6 | Pixy Stix
Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Q6 | ReeseÕs Peanut Butter Cups
Q6 | Reese’s Pieces
Q6 | Reggie Jackson Bar
Q6 | Rolos
Q6 | Sandwich-sized bags filled with BooBerry Crunch
Q6 | Skittles
Q6 | Smarties (American)
Q6 | Smarties (Commonwealth)
Q6 | Snickers
Q6 | Sourpatch Kids (i.e. abominations of nature)
Q6 | Spotted Dick
Q6 | Starburst
Q6 | Sweet Tarts
Q6 | Swedish Fish
Q6 | Sweetums (a friend to diabetes)
Q6 | Take 5
Q6 | Tic Tacs
Q6 | Those odd marshmallow circus peanut things
Q6 | Three Musketeers
Q6 | Tolberone something or other
Q6 | Trail Mix
Q6 | Twix
Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
Q6 | Vicodin
Q6 | Whatchamacallit Bars
Q6 | White Bread
Q6 | Whole Wheat anything
Q6 | York Peppermint Patties
Q7: JOY OTHER
Q8: DESPAIR OTHER
Q9: OTHER COMMENTS
Q10: DRESS
Unnamed: 113
Q11: DAY
Q12: MEDIA [Daily Dish]
Q12: MEDIA [Science]
Q12: MEDIA [ESPN]
Q12: MEDIA [Yahoo]
Click Coordinates (x, y)
This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.
Note: Unless the instructions ask you to do something different, please always update the original candy DataFrame for the exercises below. The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations. Also, feel free to add additional cells as needed.
Data Cleaning
Exercise1: Taking a look at the column names, you may notice that some include the character Õ. This should instead be an apostrophe ‘ mark. Rename the column names that include the Õ character and replace it with an apostrophe.
Remember that you should be updating the candy DataFrame for the tasks listed as unless told differently.
In [24]:
candy_columns=candy.replace(‘Õ’, “‘”, inplace=True)candy_columnsQ1: How many duplicated rows are there in the file? Assume that a duplicate is any row that is exactly the same as another one. Save this number as Q1.
In [25]:
Q1= candy.duplicated(keep=False).sum()Q1Out[25]:
34Q2: How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same Internal ID number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same Internal ID as another. Save this number as Q2.
In [26]:
Q2= candy[‘Internal ID’].duplicated(keep=False).sum()Q2Out[26]:
38Exercise2: Drop any duplicates from the candy DataFrame. Duplicates are to be defined as any row with the same Internal ID as another. Use the default setting that keeps the first record from the duplicates.
In [27]:
candy= candy.drop_duplicates(subset=‘Internal ID’)candyOut[27]:
Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
…………………………………………………………
247490314359NoMale24USAMDJOYDESPAIRMEHDESPAIR…Fruit Stripe GumNaNWhite and goldNaNFridayNaNNaNNaNNaNNaN
247590314580NoFemale33USANew YorkMEHDESPAIRJOYNaN…CapersNaNBlue and blackNaNFridayNaN1.0NaNNaN(70, 26)
247690314634NoFemale26USATennesseeMEHDESPAIRJOYDESPAIR…NaNNaNBlue and blackNaNFridayNaN1.0NaNNaN(67, 35)
247790314658NoMale58UsaNorth CarolinaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
247890314802NoFemale66usaPennsylvaniaDESPAIRDESPAIRJOYDESPAIR…NaNYou hit all my chocolate highlights, and broug…White and goldNaNSunday1.0NaNNaNNaN(19, 26)
2460 rows × 120 columns
Exercise3: Your next task is to remove the following columns from the candy DataFrame as we will not use these columns for this project. You are welcome to do further analysis on these columns but do not save your analysis in this notebook.
Remove the following columns: Internal ID, Q5: STATE, PROVINCE, COUNTY, ETC, Q7: JOY OTHER, Q8: DESPAIR OTHER, Q9: OTHER COMMENTS, Unnamed: 113, Click Coordinates (x, y).
In [28]:
candy_columns_remove=[ ‘Internal ID’, ‘Q5: STATE, PROVINCE, COUNTY, ETC’, ‘Q7: JOY OTHER’, ‘Q8: DESPAIR OTHER’, ‘Q9: OTHER COMMENTS’, ‘Unnamed: 113’, ‘Click Coordinates (x, y)’]candy.drop(columns=candy_columns_remove,inplace=True)Code Check: As a check for the above exercises, the shape of your data should now be: (2460, 113)
In [30]:
candy.shapeOut[30]:
(2460, 113)Exercise4: Let’s now take a look at the Q2: GENDER column since this will be what we are trying to predict. Take a look at the value counts for this column.
In [31]:
candy=candy[‘Q2: GENDER’].value_counts()candyOut[31]:
Q2: GENDER
Male 1466
Female 839
I’d rather not say 83
Other 30
Name: count, dtype: int64Q3: How many missing values are in the Q2: GENDER column? Save this as Q3.
In [37]:
Q3 = candy[‘Q2: GENDER’].isnull().sum()Q3—————————————————————————
KeyError Traceback (most recent call last)
File ~anaconda3Libsite-packagespandascoreindexesbase.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File ~anaconda3Libsite-packagespandas_libsindex.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File ~anaconda3Libsite-packagespandas_libsindex.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas_libshashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas_libshashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ‘Q2: GENDER’
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[37], line 1
—-> 1 Q3 = candy[‘Q2: GENDER’].isnull().sum()
2 Q3
File ~anaconda3Libsite-packagespandascoreseries.py:1007, in Series.__getitem__(self, key)
1004 return self._values[key]
1006 elif key_is_scalar:
-> 1007 return self._get_value(key)
1009 if is_hashable(key):
1010 # Otherwise index.get_value will raise InvalidIndexError
1011 try:
1012 # For labels that don’t resolve as scalars like tuples and frozensets
File ~anaconda3Libsite-packagespandascoreseries.py:1116, in Series._get_value(self, label, takeable)
1113 return self._values[label]
1115 # Similar to Index.get_value, but we do not fall back to positional
-> 1116 loc = self.index.get_loc(label)
1118 if is_integer(loc):
1119 return self._values[loc]
File ~anaconda3Libsite-packagespandascoreindexesbase.py:3655, in Index.get_loc(self, key)
3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
-> 3655 raise KeyError(key) from err
3656 except TypeError:
3657 # If we have a listlike key, _check_indexing_error will raise
3658 # InvalidIndexError. Otherwise we fall through and re-raise
3659 # the TypeError.
3660 self._check_indexing_error(key)
KeyError: ‘Q2: GENDER’
Exercise5: Using the candy DataFrame, remove all rows with a missing value in the Q2: GENDER column. (This should overwrite and be saved as candy like you have been doing for the previous exercises.)
In [ ]:
### ENTER CODE HERE ###Exercise6: For this project, we want to use binary classification, which predicts one of two classes. We want to predict between Male or Female. Because of this, select only the rows that contain either Male or Female in the Q2: GENDER column.
In [ ]:
### ENTER CODE HERE ###Code Check: As a check for the above exercises, the shape of your data should now be: (2305, 113)
In [ ]:
### ENTER CODE HERE ###Now, let’s work on filling some of the missing data. There are easier ways to do this with the sklearn library which you will learn about more in the machine learning classes, but for now, let’s try to practice our Pandas skills.
Q4: How many missing values are in the Q1: GOING OUT? column? Save this number as Q4.
In [ ]:
### ENTER CODE HERE ###Exercise7: For a future analysis question, we are interested in those that we know will definitely go out for Halloween. Because of this, fill all missing values in the Q1: GOING OUT? column with a No value.
In [ ]:
### ENTER CODE HERE ###Code Check: Double check your above work and look at the value counts for the Q1: GOING OUT? column. Make sure that you only have “Yes” and No” values and that they add up to 2305, which is the number of rows you should have at this step in the assignment.
In [ ]:
### ENTER CODE HERE ###Exercise8: To get ready for the next step, let’s practice selecting all the columns: going from Q6 | 100 Grand Bar to Q11: DAY. Save this slice as candy_slice.
In [ ]:
### ENTER CODE HERE ###Exercise9: Now that you know how to slice the data, fill any missing values in the candy DataFrame for those columns (going from Q6 | 100 Grand Bar to Q11: DAY) with the string NO_ANSWER. Make sure you are working with the candy DataFrame and not the candy_slice DataFrame.
In [ ]:
### ENTER CODE HERE ###Exercise10: For all four Q12: Media columns in the candy DataFrame, fill the missing values with 0.0.
In [ ]:
### ENTER CODE HERE ###Code Check: As a check for the above code, make sure that there are no missing values left for the Q6 to Q12 columns.
In [ ]:
### ENTER CODE HERE ###Now, let’s look at the very messy Q4: COUNTRY column and see what we can do about it. First, run the code below to look at the different unique values in the data.
In [ ]:
# check unique valuescandy[‘Q4: COUNTRY’].unique()Code Check: As a check for the Country column, check to see how many unique values are in the data. You should have 115 different unique values for the Q4: COUNTRY column. If you have less or more than this number, double check your work above.
In [ ]:
# check the Q4: COUNTRY number of unique valuescandy[‘Q4: COUNTRY’].nunique()We want to clean up this data to only include four areas: USA, Canada, Europe (the continent, not necessarily the European Union), and Other.
There are different ways to do this, but I would suggest that you look at the way we handled the property_type column in the vienna data set and the code in the amenities_to_columns() function in the module notebook. These might be a little harder than those examples but they should give you a good baseline approach.
You could use replace() for this step, and it is fine if you ultimately decide to do this, but I would suggest that you come up with a solution similar to what was shown in the vienna data cleaning notebook. This method would be much more robust if you had many more values in your data.
I suggest the following order for this section to make it easier:
Fill in all missing values with Other
Code Australia as Other (doing this step will help when trying to use us in the next step if you use string methods)
Combine all USA entries together as USA
Combine Canadian entries as CA
Combine European entries as EU
Everything else gets coded as Other
Exercise11: Fill the missing values in the Q4: COUNTRY column with Other.
In [ ]:
### ENTER CODE HERE ###Code Check: Double check that there are no missing values in the Q4: COUNTRY column. Also, double check the unique values to make sure that “Other” was added. This should mean that you now have 116 unique values for this column.
In [ ]:
# check missing Q4 values### ENTER CODE HERE ###In [ ]:
# check unique values### ENTER CODE HERE ###Exercise12: Combine all Australia entries into Other. Watch out for capitalization issues. You should have 114 unique values after this step.
In [ ]:
### ENTER CODE HERE ###In [ ]:
# check number of unique values### ENTER CODE HERE ###Exercise13: Combine all United States entries together into USA. These would include the following:
'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!'
In [ ]:
### ENTER CODE HERE ###Code Check: You should be merging the above values together into 1 (USA) and be left with 61 unique values after this step (including the USA value).
In [ ]:
# check unique values### ENTER CODE HERE ###Exercise14: Combine the Canadian entries (both upper and lower case) and label them as CA. Be careful as there are extra spaces, characters, and misspellings (Can, Canae).
These values include:
'canada', 'Canada', 'canada ', 'Canada ', 'Can', 'Canae', 'Canada`', 'CANADA'
In [ ]:
### ENTER CODE HERE ###Code Check: You should be merging 8 values together into 1 (CA) and be left with 54 unique values after this step (including the CA value).
In [ ]:
# check unique values### ENTER CODE HERE ###Exercise15: Combine the European entries and label them as EU. Again, we are looking at the continent of Europe and not necessarily the countries that are a part of the European Union.
These values include:
'uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom'
In [ ]:
### ENTER CODE HERE ###Code Check: You should be merging 26 entries together and be left with 29 unique values after this step (including the EU value).
In [ ]:
# check unique values### ENTER CODE HERE ###Exercise16: Finally, combine the other entries and label them as Other.
In [ ]:
### ENTER CODE HERE ###Code Check: Double check that you only have four unique values in the Q4: COUNTRY column: USA, Other, CA, and EU
In [ ]:
# check values### ENTER CODE HERE ###Q5: To double check that everything was coded correctly, save the value counts of the Q4: COUNTRY column as Q5. You can check this once you run your CodeGrade check.
In [ ]:
### ENTER CODE HERE ###We now want to look at the Q3: AGE column. Let’s look at all the unique values.
In [ ]:
# check unique age values### ENTER CODE HERE ###Again, this is a pretty messy column of data. This is a good example of why those that create online surveys shouldn’t allow the individual to just put any value into the field. But it is now our job to clean this up.
Exercise17: Your task is to put these values into the following categorical bins: unknown, 17 and under, 18-25, 26-35, 36-45, 46-55, and 56+.
The category labels should exactly match the above.
Missing values should be replaced with the unknown category
To make things easier and avoid ambiguity, let’s say that any value with text, even if we could determine the age, will be binned with the unknown category. For example: sixty-nine should be coded as unknown, 45-55 should be coded as unknown, 59 on the day after Halloween should be coded as unknown, etc.
Ensure that the category labels are unordered but reorder the categories so that ‘unknown’ is listed in the first position. This is not really needed but will help us grade your assignment. The categories should be listed as follows: Index([‘unknown’, ’17 and under’, ’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’], dtype=’object’)
First, we will replace any non-numeric value (those with text as mentioned above) with a missing value. This will allow you to turn the other values into floats so that you can bin them. Just don’t forget to code the missing values as unknown when you are done. To replace the non-numeric values, run the following code:
In [ ]:
# create True/False indexage_index = candy[‘Q3: AGE’].str.isnumeric()# for the index, fill missing values with Falseage_index = age_index.fillna(False)# select Age column for only those False values from index and code as missingcandy.loc[~age_index, ‘Q3: AGE’] = np.nanIn [ ]:
### ENTER REST OF CODE HERE ###Exercise18: Double check yourself by checking the categories for the Q3: AGE column. It should output: Index([‘unknown’, ’17 and under’, ’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’], dtype=’object’)
In [ ]:
# double check categories### ENTER CODE HERE ###Code Check: To double check your above binning worked correctly, your value counts (sorted by the index) should be as follows:
unknown: 60
In [ ]:
17 and under: 49
18-25: 85
26-35: 520
36-45: 768
46-55: 525
56+: 298
### ENTER CODE HERE ###You can also double check some of your work up to this point by making sure that there are no missing values in the data set anymore.
Code Check: Check to see if there are any missing values in the data set. Your output should show 0.
In [ ]:
### ENTER CODE HERE ###Exercise19: Before you move on to the next section, reset the index for candy ensuring that it goes from 0 to n-1.
In [ ]:
### ENTER CODE HERE ###Feature Engineering
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models. In this section, we will create a new column called “net_feelies” (calculated by the authors as the total joy count minus the total despair count).
First, let’s narrow down our data to make working with it easier.
Exercise20: Select only the Q6 candy columns (Q6 | 100 Grand Bar through Q6 | York Peppermint Patties) in the data set and save this as a new DataFrame called candy_reduced.
In [ ]:
### ENTER CODE HERE ###Next, we will create two Series, one with JOY counts and one with DESPAIR counts to add to our candy_reduced data.
Exercise21: Create a Series called joy_count that lists total counts for JOY for each column, making sure to keep it in the same order as the columns in the candy_reduced DataFrame. Hint: A simple way to do this is to filter the entire DataFrame for any JOY values and then use count(). See this stackoverflow question and answers.
In [ ]:
### ENTER CODE HERE ###Exercise22: Same as above except you will create a Series called despair_count that lists the total counts for DESPAIR for each column.
In [ ]:
### ENTER CODE HERE ###Exercise23: Take the transpose of the candy_reduced DataFrame and save this transposed data as candy_reduced_transpose.
In [ ]:
### ENTER CODE HERE ###Exercise24: Add a new column called “joy_count” using the joy_count Series above and a new column called ‘despair_count” using the despair_count Series above to the candy_reduced_transpose DataFrame.
In [ ]:
### ENTER CODE HERE ###Exercise25: Add a new column to the candy_reduced_transpose DataFrame called “net_feelies” that takes the joy_count column and subtracts the despair_count column.
In [ ]:
### ENTER CODE HERE ###Exercise26: Select only the joy_count, despair_count, and net_feelies columns from the candy_reduced_transpose DataFrame. Sort this DataFrame in descending order by net_feelies and save this as candy_net_sorted.
In [ ]:
### ENTER CODE HERE ###Encoding
We now want to get the candy DataFrame ready to run a machine learning algorthim to determine if we could predict a person’s gender based on what candy they prefer.
You will learn more about this in the machine learning classes, but some algorithms work exclusively with numeric values. We will now turn all of our values into numeric values. There are easier ways to do this with sklearn, which you will study in later courses, but we will use Pandas to perform these exercises for further practice.
Exercise27: For grading purposes, we want to leave the candy DataFrame as is. Make a copy of the candy DataFrame and save this new DataFrame as candy_encode.
In [ ]:
### ENTER CODE HERE ###Exercise28: For the candy_encode DataFrame, replace any Female values with 0 and any Male values with 1.
In [ ]:
### ENTER CODE HERE ###Exercise29: Again, you will learn more about this later, but we need to separate the column that we want to predict (called the response) and the columns that we will use to make the predictions (called the features). For both of the items below, make sure that the index is reset and goes from 0 to n-1.
Select only the Q2: GENDER column from candy_encode and save this as candy_response. Note: This should be a Series.
Drop the following column
Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount