Book Sales Analysis
Comprehensive analysis of bestselling books using SQL and Python to uncover market trends and reader preferences.
%pip install pandas
%pip install nbconvert
Import dependencies and connect to db¶
import sqlite3
import pandas as pd
# Path to your SQLite database file
db_path = '/Users/sararodrigues/Documents/Data Analyst/book_sales' # or .db for example
# Connect to the database
conn = sqlite3.connect(db_path)
# Example: Query data into a pandas DataFrame
query = "SELECT * FROM book_sales LIMIT 10;"
df = pd.read_sql_query(query, conn)
# Show the data
print(df.head())
# Close the connection when done
conn.close()
Name \ 0 10-Day Green Smoothie Cleanse 1 11/22/63: A Novel 2 12 Rules for Life: An Antidote to Chaos 3 1984 (Signet Classics) 4 5,000 Awesome Facts (About Everything!) (Natio... Author User Rating Reviews Price Year Genre 0 JJ Smith 4.7 17350 8 2016 Non Fiction 1 Stephen King 4.6 2052 22 2011 Fiction 2 Jordan B. Peterson 4.7 18979 15 2018 Non Fiction 3 George Orwell 4.7 21424 6 2017 Fiction 4 National Geographic Kids 4.8 7665 12 2019 Non Fiction
Analysing Trends and Insights in Bestselling Books Using SQL¶
Project Overview¶
This project presents an in-depth analysis of a bestselling books dataset, which includes attributes such as book title, author, user rating, number of reviews, price, year of bestseller status, and genre. The dataset spans multiple years and features both fiction and non-friction titles.
The analysis is designed to uncover key market trends and reader engagement insights, answering questions such as which authors dominate bestseller lists, how book popularity and pricing evolve over time, and what factors contribute to enduring bestseller success.
Areas of focus include:
- Data preparation and cleaning for accuracy and consistency.
- Exploratory analysis to identify standout books, author performance, and overall market patterns.
- Time-series and trend analysis across years and genres.
- Correlation and value assessment of price, popularity, and ratings.
- Advanced SQL methodologies such as window functions, ranking, and Common Table Expressions (CTEs).
- Clear presentation of actionable results for authors, publishers, and retailers.
By integrating advanced SQL techniques with business-focused reporting, this project demonstrates how data analysts can support strategic, decision-making in the publishing industry.
1. Dataset Preparation and Cleaning¶
Accurate and reliable analysis starts with thorough data preparation. This section details the key cleaning steps applied to the bestselling books dataset to ensure consistency and reliability of insights.
1. 1 Remove Exact Duplicates¶
Duplicate records can distort summary statistics and rankings. Using SQL window functions, exact duplicates-rows that have identical values across all columns were identified and removed, keeping only one representative record for each duplicate set.
SQL Example:
# Connect to the database
conn = sqlite3.connect(db_path)
# Example: Query data into a pandas DataFrame
query = """
WITH Deduped AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY Name, Author, [User Rating], Reviews, Price, Year, Genre
ORDER BY Name
) AS rn
FROM Book_Sales bs
)
SELECT *
FROM Deduped
WHERE rn = 1;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | User Rating | Reviews | Price | Year | Genre | rn | |
---|---|---|---|---|---|---|---|---|
0 | 10-Day Green Smoothie Cleanse | JJ Smith | 4.7 | 17350 | 8 | 2016 | Non Fiction | 1 |
1 | 11/22/63: A Novel | Stephen King | 4.6 | 2052 | 22 | 2011 | Fiction | 1 |
2 | 12 Rules for Life: An Antidote to Chaos | Jordan B. Peterson | 4.7 | 18979 | 15 | 2018 | Non Fiction | 1 |
3 | 1984 (Signet Classics) | George Orwell | 4.7 | 21424 | 6 | 2017 | Fiction | 1 |
4 | 5,000 Awesome Facts (About Everything!) (Natio... | National Geographic Kids | 4.8 | 7665 | 12 | 2019 | Non Fiction | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
545 | Wrecking Ball (Diary of a Wimpy Kid Book 14) | Jeff Kinney | 4.9 | 9413 | 8 | 2019 | Fiction | 1 |
546 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2016 | Non Fiction | 1 |
547 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2017 | Non Fiction | 1 |
548 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2018 | Non Fiction | 1 |
549 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2019 | Non Fiction | 1 |
550 rows × 8 columns
1.2 Identify Multi-Year Bestsellers¶
Some books appear on the bestseller list in multiple years. Recognising these allows tracking how attributes like price and rating change over time.
SQL Example:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Author, Year, Price, [User Rating],
LAG(Price) OVER (PARTITION BY Name, Author ORDER BY Year) AS Prev_Year_Price,
LAG([User Rating]) OVER (PARTITION BY Name, Author ORDER BY Year) AS Prev_Year_Rating
FROM book_sales bs
WHERE (Name, Author) IN (
SELECT Name, Author
FROM book_sales bs2
GROUP BY Name, Author
HAVING COUNT(DISTINCT Year) > 1
)
ORDER BY Name, Author, Year;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | Year | Price | User Rating | Prev_Year_Price | Prev_Year_Rating | |
---|---|---|---|---|---|---|---|
0 | A Man Called Ove: A Novel | Fredrik Backman | 2016 | 8 | 4.6 | NaN | NaN |
1 | A Man Called Ove: A Novel | Fredrik Backman | 2017 | 8 | 4.6 | 8.0 | 4.6 |
2 | All the Light We Cannot See | Anthony Doerr | 2014 | 14 | 4.6 | NaN | NaN |
3 | All the Light We Cannot See | Anthony Doerr | 2015 | 14 | 4.6 | 14.0 | 4.6 |
4 | Becoming | Michelle Obama | 2018 | 11 | 4.8 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
290 | Wonder | R. J. Palacio | 2017 | 9 | 4.8 | 9.0 | 4.8 |
291 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 2016 | 8 | 4.7 | NaN | NaN |
292 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 2017 | 8 | 4.7 | 8.0 | 4.7 |
293 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 2018 | 8 | 4.7 | 8.0 | 4.7 |
294 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 2019 | 8 | 4.7 | 8.0 | 4.7 |
295 rows × 7 columns
1.3 Flag Box Sets Seperately¶
Box sets can skew single-title analysis due to higher prices or aggregated reviews. These were flagged by detecting certain keywords in the book title.
SQL Example:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT *
FROM book_sales bs
WHERE NOT (
LOWER(Name) LIKE '%box%'
OR LOWER(Name) LIKE '%set%'
OR LOWER(Name) LIKE '%collection%'
OR LOWER(Name) LIKE '%trilogy%'
OR LOWER(Name) LIKE '%series%'
OR LOWER(Name) LIKE '%boxed%'
)
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | User Rating | Reviews | Price | Year | Genre | |
---|---|---|---|---|---|---|---|
0 | 10-Day Green Smoothie Cleanse | JJ Smith | 4.7 | 17350 | 8 | 2016 | Non Fiction |
1 | 11/22/63: A Novel | Stephen King | 4.6 | 2052 | 22 | 2011 | Fiction |
2 | 12 Rules for Life: An Antidote to Chaos | Jordan B. Peterson | 4.7 | 18979 | 15 | 2018 | Non Fiction |
3 | 1984 (Signet Classics) | George Orwell | 4.7 | 21424 | 6 | 2017 | Fiction |
4 | 5,000 Awesome Facts (About Everything!) (Natio... | National Geographic Kids | 4.8 | 7665 | 12 | 2019 | Non Fiction |
... | ... | ... | ... | ... | ... | ... | ... |
511 | Wrecking Ball (Diary of a Wimpy Kid Book 14) | Jeff Kinney | 4.9 | 9413 | 8 | 2019 | Fiction |
512 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2016 | Non Fiction |
513 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2017 | Non Fiction |
514 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2018 | Non Fiction |
515 | You Are a Badass: How to Stop Doubting Your Gr... | Jen Sincero | 4.7 | 14331 | 8 | 2019 | Non Fiction |
516 rows × 7 columns
1.4 Handle Null Values Thoughtfully¶
Missing or 'NULL' values can compromise data integrity and analysis quality, particularly in critical fields such as 'Name' or 'Author'.
SQL Check for Nulls:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT
SUM(CASE WHEN Name IS NULL THEN 1 ELSE 0 END) AS null_name_count,
SUM(CASE WHEN Author IS NULL THEN 1 ELSE 0 END) AS null_author_count
FROM book_sales bs ;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
null_name_count | null_author_count | |
---|---|---|
0 | 0 | 0 |
Result: The query returned 0 for both counts, including there are no missing values in these critical fields.
Since no null values were found, no deletion or imputation was necessary, and the dataset retained all records for analysis.
Summary: For data quality assurance, null-check queries were performed, confirming there are no missing 'Name' or 'Author entries. This validation step ensures that subsequent analyses using these fields will be reliable.
2. Exploratory Data Analysis¶
This stage examines core attributes of the dataset to identify leading titles, high-performing authors, price distributions, and historical patterns. Queries focus on summarising the dataset in ways that reveal initial trends for deeper analysis.
2.1 To 10 Highest-Rated Books Overall¶
Determines the highest-rated titles by user rating, with review counts used to break ties.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | User_Rating | Reviews | Price | |
---|---|---|---|---|---|
0 | Oh, the Places You'll Go! | Dr. Seuss | 4.9 | 21834 | 8 |
1 | Harry Potter and the Chamber of Secrets: The I... | J.K. Rowling | 4.9 | 19622 | 30 |
2 | Jesus Calling: Enjoying Peace in His Presence ... | Sarah Young | 4.9 | 19576 | 8 |
3 | The Very Hungry Caterpillar | Eric Carle | 4.9 | 19546 | 5 |
4 | Brown Bear, Brown Bear, What Do You See? | Bill Martin Jr. | 4.9 | 14344 | 5 |
5 | Dog Man: Fetch-22: From the Creator of Captain... | Dav Pilkey | 4.9 | 12619 | 8 |
6 | Last Week Tonight with John Oliver Presents A ... | Jill Twiss | 4.9 | 11881 | 13 |
7 | Harry Potter and the Sorcerer's Stone: The Ill... | J.K. Rowling | 4.9 | 10052 | 22 |
8 | Wrecking Ball (Diary of a Wimpy Kid Book 14) | Jeff Kinney | 4.9 | 9413 | 8 |
9 | Strange Planet (Strange Planet Series) | Nathan W. Pyle | 4.9 | 9382 | 6 |
%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/Top10Highest-RatedBooks?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link:embed=y&:showVizHome=no" width="1000" height="400"></iframe>
2.2 Top 10 Books by Genre¶
Ranks the best-rated Fiction and Non-Fiction titles separately to account for differences in audience and market behaviour.
SQL - Fiction:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
WHERE Genre = 'Fiction'
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | User_Rating | Reviews | Price | |
---|---|---|---|---|---|
0 | Oh, the Places You'll Go! | Dr. Seuss | 4.9 | 21834 | 8 |
1 | Harry Potter and the Chamber of Secrets: The I... | J.K. Rowling | 4.9 | 19622 | 30 |
2 | The Very Hungry Caterpillar | Eric Carle | 4.9 | 19546 | 5 |
3 | Brown Bear, Brown Bear, What Do You See? | Bill Martin Jr. | 4.9 | 14344 | 5 |
4 | Dog Man: Fetch-22: From the Creator of Captain... | Dav Pilkey | 4.9 | 12619 | 8 |
5 | Last Week Tonight with John Oliver Presents A ... | Jill Twiss | 4.9 | 11881 | 13 |
6 | Harry Potter and the Sorcerer's Stone: The Ill... | J.K. Rowling | 4.9 | 10052 | 22 |
7 | Wrecking Ball (Diary of a Wimpy Kid Book 14) | Jeff Kinney | 4.9 | 9413 | 8 |
8 | Strange Planet (Strange Planet Series) | Nathan W. Pyle | 4.9 | 9382 | 6 |
9 | Dog Man: For Whom the Ball Rolls: From the Cre... | Dav Pilkey | 4.9 | 9089 | 8 |
SQL - Non-Fiction:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
WHERE Genre = 'Non Fiction'
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | User_Rating | Reviews | Price | |
---|---|---|---|---|---|
0 | Jesus Calling: Enjoying Peace in His Presence ... | Sarah Young | 4.9 | 19576 | 8 |
1 | The Magnolia Story | Chip Gaines | 4.9 | 7861 | 5 |
2 | Unfreedom of the Press | Mark R. Levin | 4.9 | 5956 | 11 |
3 | Hamilton: The Revolution | Lin-Manuel Miranda | 4.9 | 5867 | 54 |
4 | Obama: An Intimate Portrait | Pete Souza | 4.9 | 3192 | 22 |
5 | Humans of New York : Stories | Brandon Stanton | 4.9 | 2812 | 17 |
6 | Becoming | Michelle Obama | 4.8 | 61133 | 11 |
7 | Unbroken: A World War II Story of Survival, Re... | Laura Hillenbrand | 4.8 | 29673 | 16 |
8 | The 5 Love Languages: The Secret to Love that ... | Gary Chapman | 4.8 | 25554 | 8 |
9 | The Boys in the Boat: Nine Americans and Their... | Daniel James Brown | 4.8 | 23358 | 12 |
%%html
2.3 Most Frequent Authors¶
Counts the number of appearances each author has in the dataset to highlight the most consistently successful writers.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Author, COUNT(*) AS appearances
FROM book_sales
GROUP BY Author
ORDER BY appearances DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Author | appearances | |
---|---|---|
0 | Jeff Kinney | 12 |
1 | Suzanne Collins | 11 |
2 | Rick Riordan | 11 |
3 | Gary Chapman | 11 |
4 | American Psychological Association | 10 |
5 | Gallup | 9 |
6 | Dr. Seuss | 9 |
7 | Rob Elliott | 8 |
8 | Stephenie Meyer | 7 |
9 | Stephen R. Covey | 7 |
%%html
<iframe src="https://public.tableau.com/views/MostFrequentAuthors/MostFrequentAuthors?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="350"></iframe>
2.4 Price Distribution¶
Groups titles into price bands to assess the market split between low-cost, mid-range, and premium books.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT
CASE
WHEN Price < 5 THEN 'Under $5'
WHEN Price < 10 THEN '$5–9.99'
WHEN Price < 15 THEN '$10–14.99'
ELSE '$15+'
END AS price_range,
COUNT(*) AS num_books
FROM book_sales
GROUP BY price_range
ORDER BY price_range;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
price_range | num_books | |
---|---|---|
0 | $10–14.99 | 149 |
1 | $15+ | 163 |
2 | $5–9.99 | 187 |
3 | Under $5 | 51 |
%%html
<iframe src="https://public.tableau.com/views/PriceDistribution_17552559415380/PriceDistribution?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
2.5 Year with the Most Bestsellers¶
Identifies the years with the highest volume of bestselling entries.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Year, COUNT(*) AS num_books
FROM book_sales
GROUP BY Year
ORDER BY num_books DESC
LIMIT 1;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Year | num_books | |
---|---|---|
0 | 2019 | 50 |
2.6 Average User Rating by Year¶
Calculates yearly average ratings to observe any changes in reader satisfaction trends.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Year, AVG([User Rating]) AS avg_rating
FROM book_sales
GROUP BY Year
ORDER BY Year;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Year | avg_rating | |
---|---|---|
0 | 2009 | 4.584 |
1 | 2010 | 4.558 |
2 | 2011 | 4.558 |
3 | 2012 | 4.532 |
4 | 2013 | 4.554 |
5 | 2014 | 4.622 |
6 | 2015 | 4.648 |
7 | 2016 | 4.678 |
8 | 2017 | 4.660 |
9 | 2018 | 4.668 |
10 | 2019 | 4.740 |
%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/AverageUserRatingPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
Summary: These outputs establish a data-driven overview of leading titles, high-impact authors, pricing patterns, and key years. The results form the basis for subsequent trend, correlation, and performance analyses.
3. Trends Over Time¶
This section explores key book sales trends using SQL queries for data extraction, Python 9pandas0 for data analysis, and Tableau for insightful visualisations.
3.1 Authors With Longest Consecutive Appearances¶
Identify which authors maintained bestseller status for the most consecutive years.
Step 1: SQL Query¶
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Author, COUNT (*) AS appearances
FROM book_sales bs
GROUP BY Author
ORDER BY appearances DESC;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Author | appearances | |
---|---|---|
0 | Jeff Kinney | 12 |
1 | Suzanne Collins | 11 |
2 | Rick Riordan | 11 |
3 | Gary Chapman | 11 |
4 | American Psychological Association | 10 |
... | ... | ... |
243 | Adult Coloring Book Designs | 1 |
244 | Admiral William H. McRaven | 1 |
245 | Adir Levy | 1 |
246 | Adam Mansbach | 1 |
247 | Adam Gasiewski | 1 |
248 rows × 2 columns
%%html
<iframe src="https://public.tableau.com/views/AuthorswithLongestConsecutiveappearances/AuthorswithLongestConsecutiveAppearances?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="400"></iframe>
Result Summary¶
The analysis reveals a handful of authors with remarkable staying power, consistently producing bestsellers year after year. These long streaks may reflect both popularity and prolific publishing. In Tableau, line charts showed these trajectories, with top authors like Jeff Kinney maintaining consecutive annual hits, setting them apart from others.
3.2 How Average Reviews and Ratings Evolved¶
We tracked how reader engagement and satisfaction changed over time.
SQL Query¶
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT
Year,
AVG(Reviews) AS avg_reviews,
AVG([User Rating]) AS avg_user_rating
FROM
book_sales
GROUP BY
Year
ORDER BY
Year;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Year | avg_reviews | avg_user_rating | |
---|---|---|---|
0 | 2009 | 4710.12 | 4.584 |
1 | 2010 | 5479.62 | 4.558 |
2 | 2011 | 8100.82 | 4.558 |
3 | 2012 | 13090.92 | 4.532 |
4 | 2013 | 13098.14 | 4.554 |
5 | 2014 | 15859.94 | 4.622 |
6 | 2015 | 14233.38 | 4.648 |
7 | 2016 | 14196.00 | 4.678 |
8 | 2017 | 12888.40 | 4.660 |
9 | 2018 | 13930.42 | 4.668 |
10 | 2019 | 15898.34 | 4.740 |
Tableau Visualisations¶
%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/AverageUserRatingPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
%%html
<iframe src="https://public.tableau.com/views/AverageReviewsPerYear/AverageReviewsPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
Python Analysis¶
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT
Year,
AVG(Reviews) AS avg_reviews,
AVG([User Rating]) AS avg_user_rating
FROM
book_sales
GROUP BY
Year
ORDER BY
Year;
"""
df = pd.read_sql_query(query, conn)
yearly_stats = df.groupby('Year').agg({
'avg_reviews': 'mean',
'avg_user_rating': 'mean'
}).reset_index()
yearly_stats
Year | avg_reviews | avg_user_rating | |
---|---|---|---|
0 | 2009 | 4710.12 | 4.584 |
1 | 2010 | 5479.62 | 4.558 |
2 | 2011 | 8100.82 | 4.558 |
3 | 2012 | 13090.92 | 4.532 |
4 | 2013 | 13098.14 | 4.554 |
5 | 2014 | 15859.94 | 4.622 |
6 | 2015 | 14233.38 | 4.648 |
7 | 2016 | 14196.00 | 4.678 |
8 | 2017 | 12888.40 | 4.660 |
9 | 2018 | 13930.42 | 4.668 |
10 | 2019 | 15898.34 | 4.740 |
Result Summary¶
The results show average reviews increased steadily from 2010 to 2019, indicating more customers sharing their opinions each year. Average ratings remained high and relatively stable, with slight upward movement suggesting consistent satisfaction with top books. Tableau line charts highlighted years with noticeable jumps in reviews, possible effects of growing online retail and review culture.
3.3 Price Trends for Multi-Year Bestsellers¶
I investigated whether prices rise, fall, or remain steady for books that were bestsellers across multiple years.
Step 1: SQL Query¶
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Year, Price
FROM book_sales
WHERE Name IN (
SELECT Name FROM book_sales GROUP BY Name HAVING COUNT(DISTINCT Year) > 1
)
ORDER BY Name, Year;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Year | Price | |
---|---|---|---|
0 | A Man Called Ove: A Novel | 2016 | 8 |
1 | A Man Called Ove: A Novel | 2017 | 8 |
2 | All the Light We Cannot See | 2014 | 14 |
3 | All the Light We Cannot See | 2015 | 14 |
4 | Becoming | 2018 | 11 |
... | ... | ... | ... |
290 | Wonder | 2017 | 9 |
291 | You Are a Badass: How to Stop Doubting Your Gr... | 2016 | 8 |
292 | You Are a Badass: How to Stop Doubting Your Gr... | 2017 | 8 |
293 | You Are a Badass: How to Stop Doubting Your Gr... | 2018 | 8 |
294 | You Are a Badass: How to Stop Doubting Your Gr... | 2019 | 8 |
295 rows × 3 columns
Step 2: Python Processing¶
multi_year = df.groupby('Name').filter(lambda x: x['Year'].nunique() > 1)
multi_year[['Name', 'Year', 'Price']].sort_values(['Name', 'Year']).to_csv('multi_year_books.csv', index=False)
multi_year[['Name', 'Year', 'Price']].head()
Name | Year | Price | |
---|---|---|---|
0 | A Man Called Ove: A Novel | 2016 | 8 |
1 | A Man Called Ove: A Novel | 2017 | 8 |
2 | All the Light We Cannot See | 2014 | 14 |
3 | All the Light We Cannot See | 2015 | 14 |
4 | Becoming | 2018 | 11 |
Step 3: Tableau Visualisation¶
%%html
<iframe src="https://public.tableau.com/views/Top10BestsellingBookswithMultipleYears/Top10BestsellingBookswithMultipleYears?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
Result Summary¶
Visualising these books in Tableau, most multi-year bestsellers showed stable pricing, but there were notable cases of price drops in later years, perhaps as part of promotions or to boost continued sales. A few titles even saw price hikes, possibly reflecting increased demand or new editions. This analysis reveals how publishers adjust strategies to maximise longevity and revenue.
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Year, Genre, AVG([User Rating]) AS avg_rating, AVG(Price) AS avg_price, AVG(Reviews) AS avg_reviews
FROM book_sales
GROUP BY Year, Genre
ORDER BY Year, Genre;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Year | Genre | avg_rating | avg_price | avg_reviews | |
---|---|---|---|---|---|
0 | 2009 | Fiction | 4.591667 | 15.583333 | 6534.333333 |
1 | 2009 | Non Fiction | 4.576923 | 15.230769 | 3026.230769 |
2 | 2010 | Fiction | 4.615000 | 9.700000 | 8409.250000 |
3 | 2010 | Non Fiction | 4.520000 | 16.000000 | 3526.533333 |
4 | 2011 | Fiction | 4.619048 | 11.619048 | 10335.285714 |
5 | 2011 | Non Fiction | 4.513793 | 17.620690 | 6482.758621 |
6 | 2012 | Fiction | 4.495238 | 12.285714 | 19896.238095 |
7 | 2012 | Non Fiction | 4.558621 | 17.482759 | 8162.931034 |
8 | 2013 | Fiction | 4.545833 | 10.708333 | 19986.833333 |
9 | 2013 | Non Fiction | 4.561538 | 18.192308 | 6739.346154 |
10 | 2014 | Fiction | 4.631034 | 10.172414 | 19382.862069 |
11 | 2014 | Non Fiction | 4.609524 | 20.809524 | 10994.952381 |
12 | 2015 | Fiction | 4.652941 | 9.352941 | 23706.117647 |
13 | 2015 | Non Fiction | 4.645455 | 10.969697 | 9353.484848 |
14 | 2016 | Fiction | 4.715789 | 12.631579 | 19563.263158 |
15 | 2016 | Non Fiction | 4.654839 | 13.516129 | 10906.387097 |
16 | 2017 | Fiction | 4.737500 | 8.833333 | 14611.833333 |
17 | 2017 | Non Fiction | 4.588462 | 13.730769 | 11297.538462 |
18 | 2018 | Fiction | 4.738095 | 8.761905 | 12710.428571 |
19 | 2018 | Non Fiction | 4.617241 | 11.793103 | 14813.862069 |
20 | 2019 | Fiction | 4.820000 | 9.350000 | 18507.150000 |
21 | 2019 | Non Fiction | 4.686667 | 10.566667 | 14159.133333 |
Step 2: Tableau Visualisation¶
%%html
<iframe src="https://public.tableau.com/views/Fictionvs_Non-FictionRatingsPriceandReviewsOverTime/Fictionvs_Non-FictionRatingsPriceandReviewsOverTime?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
Result Summary¶
This visualisation tracks the comparitive trends of User Ratings, Average Prices, and Average Reviews for Fiction and non-Fiction bestsellers from 2008 to 2020.
- User Ratings: Both Fiction and Non-Fiction consistently receive high average ratings. Occasional fluctuations highlight periods with standout books or shifting reader preferences.
- Average Price: Non-Fiction titles are typically priced higher than Fiction, though prices for both genres show a general downward trend in recent years, possible reflecting increased competition or changes in publishing models.
- Average Reviews: Reader engagement has grown over time. Fiction books tend to attract more reviews overall, but Non-Fiction titles are catching up, indicating rising popularity and interest.
Dashed lines on each panel visualise the direction and magnitude of these changes, making it easy to compare genre performance and observe market shifts across the year.
Patterns and Anomalies: Key Takeaways¶
- Certain authors consistently dominate year after year; their sustained streaks reflect both popularity and loyalty.
- Average reviews have increased annually which is evidence of growing community feedback and digital engagement.
- Multi-year bestsellers typically hold their price, but special offers or new editions can drive changes, revealing publisher tactics.
- Fiction and Non-Fiction behave differently: Fiction attracts more reviews; Non-Fiction sustains higher prices. Both maintain strong ratings, but market shifts are revealed in Tableau charts.
Conclusion¶
Through integrated SQL/Python analysis and Tableau visualisation, the data reveals how bestselling book trends shift, who stands out, and how consumer behaviour evolves. These insights inform future marketing, pricing and publishing strategies.
4. Popularity vs. Price Analysis¶
Section 5 investigates the relationships between bestseller book price, reader popularity (using reviews as a proxy), and user ratings. This section showcases correlation analysis and comparitive SQL skills.
4.1 Is There a Correlation Between Book Price and Number of Reviews?¶
Goal: Determine if higher-priced books receive more or fewer reviews, using review count as a measure of popularity.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Price, AVG(Reviews) AS avg_reviews, COUNT(*) AS num_books
FROM book_sales
GROUP BY Price
ORDER BY Price
Limit 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Price | avg_reviews | num_books | |
---|---|---|---|
0 | 0 | 11124.083333 | 12 |
1 | 1 | 720.000000 | 1 |
2 | 2 | 5633.600000 | 5 |
3 | 3 | 5983.000000 | 1 |
4 | 4 | 10750.937500 | 32 |
5 | 5 | 10621.555556 | 36 |
6 | 6 | 11685.710526 | 38 |
7 | 7 | 15178.521739 | 23 |
8 | 8 | 17235.461538 | 52 |
9 | 9 | 10307.578947 | 38 |
Tableau Visualisation¶
%%html
<iframe src="https://public.tableau.com/views/AverageReviewsbyBookPrice/AverageReviewsbyBookPrice?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
This scatter plot illustrates the relationship between book price and average number of reviews. The chart reveals a clear negative correlation: as book prices increase, the average number of reviews tends to decrease. Most books clustered in the lower price range ($0–$100) receive significantly higher review counts, while higher-priced books ($200 and above) generally attract fewer reviews. This pattern suggests that more affordable books are more popular among readers, whereas expensive titles are less likely to generate widespread engagement.
4.2 Do Higher-Rated Books Tend to Cost More or Less?¶
Goal: Analyse whether higher user ratings are associated with higher or lower book prices.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Price, AVG([User Rating]) AS avg_rating, COUNT(*) AS num_books
FROM book_sales
GROUP BY Price
ORDER BY Price
Limit 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Price | avg_rating | num_books | |
---|---|---|---|
0 | 0 | 4.766667 | 12 |
1 | 1 | 4.500000 | 1 |
2 | 2 | 4.520000 | 5 |
3 | 3 | 4.700000 | 1 |
4 | 4 | 4.690625 | 32 |
5 | 5 | 4.761111 | 36 |
6 | 6 | 4.594737 | 38 |
7 | 7 | 4.600000 | 23 |
8 | 8 | 4.738462 | 52 |
9 | 9 | 4.634211 | 38 |
Tableau Visualisation¶
%%html
<iframe src="https://public.tableau.com/views/AverageUserRatingAcrossPriceRanges/AverageUserRatingAcrossPriceRanges?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>
Summary¶
This chart displays the average user rating for bestseller books grouped by price range. The visualisation reveals that user ratings remain consistently high across all price bins, with little variation between cheaper and more expensive books. This suggests that book price does not significantly influence user satisfaction; readers rate books positively regardless of how much they cost.
4.3 Which Genres Give More "Value" for the Price?¶
Goal: Find which genre offers the most average rating per pound (rating_per_pound
), highlighting the best "value" for readers.
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Genre,
AVG([User Rating]) AS avg_rating,
AVG(Price) AS avg_price,
AVG([User Rating]) / NULLIF(AVG(Price), 0) AS rating_per_pound
FROM book_sales
GROUP BY Genre
ORDER BY rating_per_pound DESC;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Genre | avg_rating | avg_price | rating_per_pound | |
---|---|---|---|---|
0 | Fiction | 4.648333 | 10.850000 | 0.428418 |
1 | Non Fiction | 4.595161 | 14.841935 | 0.309607 |
This compares book genres based on their average reader rating per pound, revealing which categories deliver the greatest value to readers. By dividing the average rating by average price for each genre, the visualisation highlights those genres that offer the highest satisfaction for each pound spent. The results make it clear that some genres consistently provide a better reading experience at lower cost, helping readers identify where they can get the most enjoyment for their money and guiding publishers or authors interested in market positioning.
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Name, Author, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Name, Author
ORDER BY years_on_list DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Name | Author | years_on_list | |
---|---|---|---|
0 | Publication Manual of the American Psychologic... | American Psychological Association | 10 |
1 | StrengthsFinder 2.0 | Gallup | 9 |
2 | Oh, the Places You'll Go! | Dr. Seuss | 8 |
3 | The 7 Habits of Highly Effective People: Power... | Stephen R. Covey | 7 |
4 | The Very Hungry Caterpillar | Eric Carle | 7 |
5 | Jesus Calling: Enjoying Peace in His Presence ... | Sarah Young | 6 |
6 | The Four Agreements: A Practical Guide to Pers... | Don Miguel Ruiz | 6 |
7 | First 100 Words | Roger Priddy | 5 |
8 | Giraffes Can't Dance | Giles Andreae | 5 |
9 | How to Win Friends & Influence People | Dale Carnegie | 5 |
Summary:
This query highlights the titles that have achieved enduring success by appearing as bestsellers in the greatest number of different years. Identifying these books showcases which stories or authors maintain strong appeal over time, reflecting sustained reader interest and cultural impact.
5.2 On Average, How Many Years Does a Bestseller Remain in the Dataset?¶
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT AVG(years_on_list) AS avg_years_on_list
FROM (
SELECT Name, Author, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Name, Author
) AS book_years;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
avg_years_on_list | |
---|---|
0 | 1.558405 |
Summary:
This query calculates the typical lifespan of a bestseller, offering insight into how quickly books rise and fall in popularity. The average number of years spent on the list highlights broader patterns in reading trends and the overall volatility or stability of the bestseller market.
5.3. Which Authors Have Books That Reappear the Most Over Time?¶
SQL Query:
# Connect to the database
conn = sqlite3.connect(db_path)
query = """
SELECT Author, SUM(years_on_list) AS total_book_years_on_list
FROM (
SELECT Author, Name, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Author, Name
) AS book_years
GROUP BY Author
ORDER BY total_book_years_on_list DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df
Author | total_book_years_on_list | |
---|---|---|
0 | Jeff Kinney | 12 |
1 | Suzanne Collins | 11 |
2 | Rick Riordan | 11 |
3 | Gary Chapman | 11 |
4 | American Psychological Association | 10 |
5 | Gallup | 9 |
6 | Dr. Seuss | 9 |
7 | Rob Elliott | 8 |
8 | Stephenie Meyer | 7 |
9 | Stephen R. Covey | 7 |
Summary:
This aggregation surfaces authors whose works collectively dominate the bestseller lists over time. It reflects both the productivity and enduring popularity of top writers, spotlighting those who consistently capture reader attention year after year.
Conclusion¶
This project provides a comprehensive analysis of bestselling books, exploring key factors that drive popularity, value, and longevity in the market. By examining relationships between price, ratings, and review counts, comparing Fiction and Non-Fiction trnds, and investigating which books and authors maintain bestseller status over multiple yars, the analysis offers valuable insights into reader preferences and market dynamics. The results highlight consistent patterns such as the tendency for affordable books to attract more reviews, the high user satisfaction across genres and price ranges, and the enduring appeal of certain authors and titles. These findings not only deepen our understanding of bestseler performance but also offer practical guidance for publishers, authors, and readers seeking to navigate and succeed in the competitive landscape of the book market.
# Close the connection when done
conn.close()