Design a simple relational database for Amazon with at least three tables (refer to slides 17, 18, and 20).
Answer the following questions for your design:
a. What might a data set look like given your database after denormalization? (refer to slide 20, the pet example. Include at least one observation/row of data. You may use makeup/fake data.)
b. What questions might be answered with this (denormalized) data? You may assume the data set is big.
c. Are there any privacy/security issues related to your (denormalized) design? How would you protect against these issues?
Pick and define 5 key terms from the given list below. Use at least 30 words for each definition.
data mining, database, data warehouse, operational data, organization data, OLTP, OLAP, denormalization, CRISP-DM, out-of-sync data
Struggling with where to start this assignment? Follow this guide to tackle your assignment easily!
Step 1: Design the Database
Let’s design a simple relational database for Amazon with at least three tables. We can use the following tables to represent different aspects of the Amazon system:
Tables:
- Customers Table
- CustomerID (Primary Key)
- Name
- Address
- PhoneNumber
- Products Table
- ProductID (Primary Key)
- ProductName
- Category
- Price
- StockQuantity
- Orders Table
- OrderID (Primary Key)
- CustomerID (Foreign Key)
- ProductID (Foreign Key)
- Quantity
- OrderDate
- OrderStatus
Step 2: Denormalization Example
a. What might a data set look like given your database after denormalization?
After denormalization, we combine relevant data from the three normalized tables into one large table. This will include repeating data (e.g., customer details) for each order, leading to faster querying at the cost of storage efficiency.
Example of Denormalized Data:
OrderID | CustomerID | Name | ProductID | ProductName | Category | Price | Quantity | OrderDate | OrderStatus | |
---|---|---|---|---|---|---|---|---|---|---|
101 | 1 | John Doe | john@email.com | 1001 | Laptop | Electronics | 1000 | 1 | 2025-02-01 | Shipped |
102 | 2 | Jane Smith | jane@email.com | 1002 | Headphones | Electronics | 200 | 2 | 2025-02-02 | Processing |
103 | 1 | John Doe | john@email.com | 1003 | Book | Books | 15 | 3 | 2025-02-03 | Delivered |
b. What questions might be answered with this (denormalized) data?
With this denormalized data, you can easily answer questions related to customer orders without needing to join multiple tables. Some examples of questions include:
- What are the most popular products bought by customers?
- How many orders did John Doe place in February?
- What is the total amount spent by each customer?
- What is the average order quantity per product?
- Which customers have the highest number of orders?
Since this is a large dataset, it would be efficient for querying answers to these questions without needing to join the Customers
, Products
, and Orders
tables repeatedly.
c. Are there any privacy/security issues related to your (denormalized) design? How would you protect against these issues?
Yes, denormalized data can expose sensitive information. For instance, in the above example, customer names, emails, and addresses are repeated with each order. This could lead to:
- Privacy Risks: Sensitive customer data, such as their email addresses and shipping addresses, are visible for every order. If someone gains unauthorized access to this dataset, they could misuse the information.
- Data Redundancy: Repeating the same customer information could lead to greater chances of errors or inconsistent data if updates are not applied consistently.
To protect against these issues:
- Encryption: Encrypt sensitive data, such as customer emails and addresses, both in transit and at rest.
- Access Controls: Limit access to sensitive information based on user roles. Only authorized personnel should have access to customer personal information.
- Data Masking: Mask sensitive parts of data in reports or query results.
- Audit Logs: Maintain detailed logs of who accessed the data and when, so suspicious activity can be tracked and investigated.
Step 3: Key Terms Definitions
- Data Mining: Data mining is the process of discovering patterns, trends, and relationships in large datasets using statistical methods, machine learning, and artificial intelligence. The goal of data mining is to extract useful information and generate insights that can guide business decisions. For example, in Amazon, data mining can be used to analyze purchase patterns and recommend products to customers based on their browsing history.
- Database: A database is an organized collection of data, typically stored and accessed electronically from a computer system. It allows for the efficient storage, retrieval, and management of data. Databases are typically managed by database management systems (DBMS), which control how data is stored, manipulated, and queried. A database could be used by Amazon to store customer, product, and transaction data, allowing quick searches and updates.
- OLTP (Online Transaction Processing): OLTP refers to systems that manage transaction-based applications, typically involving short online transactions. These systems handle a large number of transactional queries (e.g., sales, orders, and payments) and focus on maintaining data integrity and speed. For example, Amazon’s order system uses OLTP to manage customer purchases and payment processing.
- Denormalization: Denormalization is the process of combining multiple related tables into one table in a relational database. It’s often used to improve query performance by reducing the need for complex joins between tables. In denormalized databases, redundancy is introduced, which can sometimes lead to inconsistency but is beneficial for read-heavy workloads, such as reporting or data analysis. For Amazon, denormalization could be applied in the sales reporting system.
- OLAP (Online Analytical Processing): OLAP refers to systems designed to analyze large amounts of data quickly and efficiently. These systems allow for the multidimensional analysis of business data, often for decision-making purposes. OLAP databases are used for complex queries and data mining. For example, Amazon might use OLAP to analyze customer buying trends and product performance across various dimensions like region, time, and product category.
By following this guide, you’ll be able to design a relational database for Amazon, analyze its denormalized data, understand the potential questions you could answer, and address any privacy and security concerns effectively.