Comprehensive analysis of bestselling books using SQL and Python to uncover market trends and reader preferences.

book_sales

In [ ]:

%pip install pandas
%pip install nbconvert

Import dependencies and connect to db¶

In [118]:

import sqlite3
import pandas as pd

# Path to your SQLite database file
db_path = '/Users/sararodrigues/Documents/Data Analyst/book_sales'  # or .db for example

# Connect to the database
conn = sqlite3.connect(db_path)

# Example: Query data into a pandas DataFrame
query = "SELECT * FROM book_sales LIMIT 10;"
df = pd.read_sql_query(query, conn)

# Show the data
print(df.head())

# Close the connection when done
conn.close()

                                                Name  \
0                      10-Day Green Smoothie Cleanse   
1                                  11/22/63: A Novel   
2            12 Rules for Life: An Antidote to Chaos   
3                             1984 (Signet Classics)   
4  5,000 Awesome Facts (About Everything!) (Natio...   

                     Author  User Rating  Reviews  Price  Year        Genre  
0                  JJ Smith          4.7    17350      8  2016  Non Fiction  
1              Stephen King          4.6     2052     22  2011      Fiction  
2        Jordan B. Peterson          4.7    18979     15  2018  Non Fiction  
3             George Orwell          4.7    21424      6  2017      Fiction  
4  National Geographic Kids          4.8     7665     12  2019  Non Fiction

Analysing Trends and Insights in Bestselling Books Using SQL¶

Project Overview¶

This project presents an in-depth analysis of a bestselling books dataset, which includes attributes such as book title, author, user rating, number of reviews, price, year of bestseller status, and genre. The dataset spans multiple years and features both fiction and non-friction titles.

The analysis is designed to uncover key market trends and reader engagement insights, answering questions such as which authors dominate bestseller lists, how book popularity and pricing evolve over time, and what factors contribute to enduring bestseller success.

Areas of focus include:

Data preparation and cleaning for accuracy and consistency.
Exploratory analysis to identify standout books, author performance, and overall market patterns.
Time-series and trend analysis across years and genres.
Correlation and value assessment of price, popularity, and ratings.
Advanced SQL methodologies such as window functions, ranking, and Common Table Expressions (CTEs).
Clear presentation of actionable results for authors, publishers, and retailers.

By integrating advanced SQL techniques with business-focused reporting, this project demonstrates how data analysts can support strategic, decision-making in the publishing industry.

1. Dataset Preparation and Cleaning¶

Accurate and reliable analysis starts with thorough data preparation. This section details the key cleaning steps applied to the bestselling books dataset to ensure consistency and reliability of insights.

1. 1 Remove Exact Duplicates¶

Duplicate records can distort summary statistics and rankings. Using SQL window functions, exact duplicates-rows that have identical values across all columns were identified and removed, keeping only one representative record for each duplicate set.

SQL Example:

In [119]:

# Connect to the database
conn = sqlite3.connect(db_path)

# Example: Query data into a pandas DataFrame
query = """
WITH Deduped AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY Name, Author, [User Rating], Reviews, Price, Year, Genre
      ORDER BY Name
    ) AS rn
  FROM Book_Sales bs  
)
SELECT *
FROM Deduped
WHERE rn = 1;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[119]:

	Name	Author	User Rating	Reviews	Price	Year	Genre	rn
0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction	1
1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction	1
2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction	1
3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction	1
4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction	1
...	...	...	...	...	...	...	...	...
545	Wrecking Ball (Diary of a Wimpy Kid Book 14)	Jeff Kinney	4.9	9413	8	2019	Fiction	1
546	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2016	Non Fiction	1
547	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2017	Non Fiction	1
548	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2018	Non Fiction	1
549	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2019	Non Fiction	1

550 rows × 8 columns

1.2 Identify Multi-Year Bestsellers¶

Some books appear on the bestseller list in multiple years. Recognising these allows tracking how attributes like price and rating change over time.

SQL Example:

In [120]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Name, Author, Year, Price, [User Rating],
       LAG(Price) OVER (PARTITION BY Name, Author ORDER BY Year) AS Prev_Year_Price,
       LAG([User Rating]) OVER (PARTITION BY Name, Author ORDER BY Year) AS Prev_Year_Rating
FROM book_sales bs 
WHERE (Name, Author) IN (
  SELECT Name, Author
  FROM book_sales bs2 
  GROUP BY Name, Author
  HAVING COUNT(DISTINCT Year) > 1
)
ORDER BY Name, Author, Year;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[120]:

	Name	Author	Year	Price	User Rating	Prev_Year_Price	Prev_Year_Rating
0	A Man Called Ove: A Novel	Fredrik Backman	2016	8	4.6	NaN	NaN
1	A Man Called Ove: A Novel	Fredrik Backman	2017	8	4.6	8.0	4.6
2	All the Light We Cannot See	Anthony Doerr	2014	14	4.6	NaN	NaN
3	All the Light We Cannot See	Anthony Doerr	2015	14	4.6	14.0	4.6
4	Becoming	Michelle Obama	2018	11	4.8	NaN	NaN
...	...	...	...	...	...	...	...
290	Wonder	R. J. Palacio	2017	9	4.8	9.0	4.8
291	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	2016	8	4.7	NaN	NaN
292	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	2017	8	4.7	8.0	4.7
293	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	2018	8	4.7	8.0	4.7
294	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	2019	8	4.7	8.0	4.7

295 rows × 7 columns

1.3 Flag Box Sets Seperately¶

Box sets can skew single-title analysis due to higher prices or aggregated reviews. These were flagged by detecting certain keywords in the book title.

SQL Example:

In [121]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT *
FROM book_sales bs 
WHERE NOT (
  LOWER(Name) LIKE '%box%' 
  OR LOWER(Name) LIKE '%set%' 
  OR LOWER(Name) LIKE '%collection%' 
  OR LOWER(Name) LIKE '%trilogy%' 
  OR LOWER(Name) LIKE '%series%' 
  OR LOWER(Name) LIKE '%boxed%'
)
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[121]:

	Name	Author	User Rating	Reviews	Price	Year	Genre
0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction
1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction
2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction
3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction
4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction
...	...	...	...	...	...	...	...
511	Wrecking Ball (Diary of a Wimpy Kid Book 14)	Jeff Kinney	4.9	9413	8	2019	Fiction
512	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2016	Non Fiction
513	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2017	Non Fiction
514	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2018	Non Fiction
515	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2019	Non Fiction

516 rows × 7 columns

1.4 Handle Null Values Thoughtfully¶

Missing or 'NULL' values can compromise data integrity and analysis quality, particularly in critical fields such as 'Name' or 'Author'.

SQL Check for Nulls:

In [122]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT
SUM(CASE WHEN Name IS NULL THEN 1 ELSE 0 END) AS null_name_count,
SUM(CASE WHEN Author IS NULL THEN 1 ELSE 0 END) AS null_author_count
FROM book_sales bs ;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[122]:

	null_name_count	null_author_count
0	0	0

Result: The query returned 0 for both counts, including there are no missing values in these critical fields.

Since no null values were found, no deletion or imputation was necessary, and the dataset retained all records for analysis.

Summary: For data quality assurance, null-check queries were performed, confirming there are no missing 'Name' or 'Author entries. This validation step ensures that subsequent analyses using these fields will be reliable.

2. Exploratory Data Analysis¶

This stage examines core attributes of the dataset to identify leading titles, high-performing authors, price distributions, and historical patterns. Queries focus on summarising the dataset in ways that reveal initial trends for deeper analysis.

2.1 To 10 Highest-Rated Books Overall¶

Determines the highest-rated titles by user rating, with review counts used to break ties.

SQL Query:

In [123]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[123]:

	Name	Author	User_Rating	Reviews	Price
0	Oh, the Places You'll Go!	Dr. Seuss	4.9	21834	8
1	Harry Potter and the Chamber of Secrets: The I...	J.K. Rowling	4.9	19622	30
2	Jesus Calling: Enjoying Peace in His Presence ...	Sarah Young	4.9	19576	8
3	The Very Hungry Caterpillar	Eric Carle	4.9	19546	5
4	Brown Bear, Brown Bear, What Do You See?	Bill Martin Jr.	4.9	14344	5
5	Dog Man: Fetch-22: From the Creator of Captain...	Dav Pilkey	4.9	12619	8
6	Last Week Tonight with John Oliver Presents A ...	Jill Twiss	4.9	11881	13
7	Harry Potter and the Sorcerer's Stone: The Ill...	J.K. Rowling	4.9	10052	22
8	Wrecking Ball (Diary of a Wimpy Kid Book 14)	Jeff Kinney	4.9	9413	8
9	Strange Planet (Strange Planet Series)	Nathan W. Pyle	4.9	9382	6

In [124]:

%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/Top10Highest-RatedBooks?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link:embed=y&:showVizHome=no" width="1000" height="400"></iframe>

2.2 Top 10 Books by Genre¶

Ranks the best-rated Fiction and Non-Fiction titles separately to account for differences in audience and market behaviour.

SQL - Fiction:

In [125]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
WHERE Genre = 'Fiction'
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[125]:

	Name	Author	User_Rating	Reviews	Price
0	Oh, the Places You'll Go!	Dr. Seuss	4.9	21834	8
1	Harry Potter and the Chamber of Secrets: The I...	J.K. Rowling	4.9	19622	30
2	The Very Hungry Caterpillar	Eric Carle	4.9	19546	5
3	Brown Bear, Brown Bear, What Do You See?	Bill Martin Jr.	4.9	14344	5
4	Dog Man: Fetch-22: From the Creator of Captain...	Dav Pilkey	4.9	12619	8
5	Last Week Tonight with John Oliver Presents A ...	Jill Twiss	4.9	11881	13
6	Harry Potter and the Sorcerer's Stone: The Ill...	J.K. Rowling	4.9	10052	22
7	Wrecking Ball (Diary of a Wimpy Kid Book 14)	Jeff Kinney	4.9	9413	8
8	Strange Planet (Strange Planet Series)	Nathan W. Pyle	4.9	9382	6
9	Dog Man: For Whom the Ball Rolls: From the Cre...	Dav Pilkey	4.9	9089	8

SQL - Non-Fiction:

In [126]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Name, Author, MAX([User Rating]) AS User_Rating, MAX(Reviews) AS Reviews, MAX(Price) AS Price
FROM book_sales
WHERE Genre = 'Non Fiction'
GROUP BY Name, Author
ORDER BY User_Rating DESC, Reviews DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[126]:

	Name	Author	User_Rating	Reviews	Price
0	Jesus Calling: Enjoying Peace in His Presence ...	Sarah Young	4.9	19576	8
1	The Magnolia Story	Chip Gaines	4.9	7861	5
2	Unfreedom of the Press	Mark R. Levin	4.9	5956	11
3	Hamilton: The Revolution	Lin-Manuel Miranda	4.9	5867	54
4	Obama: An Intimate Portrait	Pete Souza	4.9	3192	22
5	Humans of New York : Stories	Brandon Stanton	4.9	2812	17
6	Becoming	Michelle Obama	4.8	61133	11
7	Unbroken: A World War II Story of Survival, Re...	Laura Hillenbrand	4.8	29673	16
8	The 5 Love Languages: The Secret to Love that ...	Gary Chapman	4.8	25554	8
9	The Boys in the Boat: Nine Americans and Their...	Daniel James Brown	4.8	23358	12

%%html

2.3 Most Frequent Authors¶

Counts the number of appearances each author has in the dataset to highlight the most consistently successful writers.

SQL Query:

In [127]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Author, COUNT(*) AS appearances
FROM book_sales
GROUP BY Author
ORDER BY appearances DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[127]:

	Author	appearances
0	Jeff Kinney	12
1	Suzanne Collins	11
2	Rick Riordan	11
3	Gary Chapman	11
4	American Psychological Association	10
5	Gallup	9
6	Dr. Seuss	9
7	Rob Elliott	8
8	Stephenie Meyer	7
9	Stephen R. Covey	7

In [148]:

%%html
<iframe src="https://public.tableau.com/views/MostFrequentAuthors/MostFrequentAuthors?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="350"></iframe>

2.4 Price Distribution¶

Groups titles into price bands to assess the market split between low-cost, mid-range, and premium books.

SQL Query:

In [149]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT
CASE
WHEN Price < 5 THEN 'Under $5'
WHEN Price < 10 THEN '$5–9.99'
WHEN Price < 15 THEN '$10–14.99'
ELSE '$15+'
END AS price_range,
COUNT(*) AS num_books
FROM book_sales
GROUP BY price_range
ORDER BY price_range;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[149]:

	price_range	num_books
0	$10–14.99	149
1	$15+	163
2	$5–9.99	187
3	Under $5	51

In [168]:

%%html
<iframe src="https://public.tableau.com/views/PriceDistribution_17552559415380/PriceDistribution?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

2.5 Year with the Most Bestsellers¶

Identifies the years with the highest volume of bestselling entries.

SQL Query:

In [169]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Year, COUNT(*) AS num_books
FROM book_sales
GROUP BY Year
ORDER BY num_books DESC
LIMIT 1;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[169]:

	Year	num_books
0	2019	50

2.6 Average User Rating by Year¶

Calculates yearly average ratings to observe any changes in reader satisfaction trends.

SQL Query:

In [170]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Year, AVG([User Rating]) AS avg_rating
FROM book_sales
GROUP BY Year
ORDER BY Year;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[170]:

	Year	avg_rating
0	2009	4.584
1	2010	4.558
2	2011	4.558
3	2012	4.532
4	2013	4.554
5	2014	4.622
6	2015	4.648
7	2016	4.678
8	2017	4.660
9	2018	4.668
10	2019	4.740

In [171]:

%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/AverageUserRatingPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

Summary: These outputs establish a data-driven overview of leading titles, high-impact authors, pricing patterns, and key years. The results form the basis for subsequent trend, correlation, and performance analyses.

3. Trends Over Time¶

This section explores key book sales trends using SQL queries for data extraction, Python 9pandas0 for data analysis, and Tableau for insightful visualisations.

3.1 Authors With Longest Consecutive Appearances¶

Identify which authors maintained bestseller status for the most consecutive years.

Step 1: SQL Query¶

In [172]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Author, COUNT (*) AS appearances 
FROM book_sales bs 
GROUP BY Author
ORDER BY appearances DESC;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[172]:

	Author	appearances
0	Jeff Kinney	12
1	Suzanne Collins	11
2	Rick Riordan	11
3	Gary Chapman	11
4	American Psychological Association	10
...	...	...
243	Adult Coloring Book Designs	1
244	Admiral William H. McRaven	1
245	Adir Levy	1
246	Adam Mansbach	1
247	Adam Gasiewski	1

248 rows × 2 columns

In [182]:

%%html
<iframe src="https://public.tableau.com/views/AuthorswithLongestConsecutiveappearances/AuthorswithLongestConsecutiveAppearances?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="400"></iframe>

Result Summary¶

The analysis reveals a handful of authors with remarkable staying power, consistently producing bestsellers year after year. These long streaks may reflect both popularity and prolific publishing. In Tableau, line charts showed these trajectories, with top authors like Jeff Kinney maintaining consecutive annual hits, setting them apart from others.

3.2 How Average Reviews and Ratings Evolved¶

We tracked how reader engagement and satisfaction changed over time.

SQL Query¶

In [215]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT
    Year,
    AVG(Reviews) AS avg_reviews,
    AVG([User Rating]) AS avg_user_rating
FROM
    book_sales
GROUP BY
    Year
ORDER BY
    Year;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[215]:

	Year	avg_reviews	avg_user_rating
0	2009	4710.12	4.584
1	2010	5479.62	4.558
2	2011	8100.82	4.558
3	2012	13090.92	4.532
4	2013	13098.14	4.554
5	2014	15859.94	4.622
6	2015	14233.38	4.648
7	2016	14196.00	4.678
8	2017	12888.40	4.660
9	2018	13930.42	4.668
10	2019	15898.34	4.740

Tableau Visualisations¶

In [216]:

%%html
<iframe src="https://public.tableau.com/views/booksalesvisualisations/AverageUserRatingPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

In [217]:

%%html
<iframe src="https://public.tableau.com/views/AverageReviewsPerYear/AverageReviewsPerYear?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

Python Analysis¶

In [239]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT
    Year,
    AVG(Reviews) AS avg_reviews,
    AVG([User Rating]) AS avg_user_rating
FROM
    book_sales
GROUP BY
    Year
ORDER BY
    Year;
"""
df = pd.read_sql_query(query, conn)



yearly_stats = df.groupby('Year').agg({
    'avg_reviews': 'mean',
    'avg_user_rating': 'mean'
}).reset_index()

yearly_stats

Out[239]:

	Year	avg_reviews	avg_user_rating
0	2009	4710.12	4.584
1	2010	5479.62	4.558
2	2011	8100.82	4.558
3	2012	13090.92	4.532
4	2013	13098.14	4.554
5	2014	15859.94	4.622
6	2015	14233.38	4.648
7	2016	14196.00	4.678
8	2017	12888.40	4.660
9	2018	13930.42	4.668
10	2019	15898.34	4.740

Result Summary¶

The results show average reviews increased steadily from 2010 to 2019, indicating more customers sharing their opinions each year. Average ratings remained high and relatively stable, with slight upward movement suggesting consistent satisfaction with top books. Tableau line charts highlighted years with noticeable jumps in reviews, possible effects of growing online retail and review culture.

3.3 Price Trends for Multi-Year Bestsellers¶

I investigated whether prices rise, fall, or remain steady for books that were bestsellers across multiple years.

Step 1: SQL Query¶

In [ ]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Name, Year, Price
FROM book_sales
WHERE Name IN (
SELECT Name FROM book_sales GROUP BY Name HAVING COUNT(DISTINCT Year) > 1
)
ORDER BY Name, Year;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[ ]:

	Name	Year	Price
0	A Man Called Ove: A Novel	2016	8
1	A Man Called Ove: A Novel	2017	8
2	All the Light We Cannot See	2014	14
3	All the Light We Cannot See	2015	14
4	Becoming	2018	11
...	...	...	...
290	Wonder	2017	9
291	You Are a Badass: How to Stop Doubting Your Gr...	2016	8
292	You Are a Badass: How to Stop Doubting Your Gr...	2017	8
293	You Are a Badass: How to Stop Doubting Your Gr...	2018	8
294	You Are a Badass: How to Stop Doubting Your Gr...	2019	8

295 rows × 3 columns

Step 2: Python Processing¶

In [246]:

multi_year = df.groupby('Name').filter(lambda x: x['Year'].nunique() > 1)
multi_year[['Name', 'Year', 'Price']].sort_values(['Name', 'Year']).to_csv('multi_year_books.csv', index=False)
multi_year[['Name', 'Year', 'Price']].head()

Out[246]:

	Name	Year	Price
0	A Man Called Ove: A Novel	2016	8
1	A Man Called Ove: A Novel	2017	8
2	All the Light We Cannot See	2014	14
3	All the Light We Cannot See	2015	14
4	Becoming	2018	11

Step 3: Tableau Visualisation¶

In [249]:

%%html
<iframe src="https://public.tableau.com/views/Top10BestsellingBookswithMultipleYears/Top10BestsellingBookswithMultipleYears?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

Result Summary¶

Visualising these books in Tableau, most multi-year bestsellers showed stable pricing, but there were notable cases of price drops in later years, perhaps as part of promotions or to boost continued sales. A few titles even saw price hikes, possibly reflecting increased demand or new editions. This analysis reveals how publishers adjust strategies to maximise longevity and revenue.

3.4 Fiction vs. Non-Fiction Comparative Trends¶

Compare ratings, reviews, and price trends for each genre.

Step 1: SQL Query¶

In [250]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Year, Genre, AVG([User Rating]) AS avg_rating, AVG(Price) AS avg_price, AVG(Reviews) AS avg_reviews
FROM book_sales
GROUP BY Year, Genre
ORDER BY Year, Genre;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[250]:

	Year	Genre	avg_rating	avg_price	avg_reviews
0	2009	Fiction	4.591667	15.583333	6534.333333
1	2009	Non Fiction	4.576923	15.230769	3026.230769
2	2010	Fiction	4.615000	9.700000	8409.250000
3	2010	Non Fiction	4.520000	16.000000	3526.533333
4	2011	Fiction	4.619048	11.619048	10335.285714
5	2011	Non Fiction	4.513793	17.620690	6482.758621
6	2012	Fiction	4.495238	12.285714	19896.238095
7	2012	Non Fiction	4.558621	17.482759	8162.931034
8	2013	Fiction	4.545833	10.708333	19986.833333
9	2013	Non Fiction	4.561538	18.192308	6739.346154
10	2014	Fiction	4.631034	10.172414	19382.862069
11	2014	Non Fiction	4.609524	20.809524	10994.952381
12	2015	Fiction	4.652941	9.352941	23706.117647
13	2015	Non Fiction	4.645455	10.969697	9353.484848
14	2016	Fiction	4.715789	12.631579	19563.263158
15	2016	Non Fiction	4.654839	13.516129	10906.387097
16	2017	Fiction	4.737500	8.833333	14611.833333
17	2017	Non Fiction	4.588462	13.730769	11297.538462
18	2018	Fiction	4.738095	8.761905	12710.428571
19	2018	Non Fiction	4.617241	11.793103	14813.862069
20	2019	Fiction	4.820000	9.350000	18507.150000
21	2019	Non Fiction	4.686667	10.566667	14159.133333

Step 2: Tableau Visualisation¶

In [252]:

%%html
<iframe src="https://public.tableau.com/views/Fictionvs_Non-FictionRatingsPriceandReviewsOverTime/Fictionvs_Non-FictionRatingsPriceandReviewsOverTime?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

Result Summary¶

This visualisation tracks the comparitive trends of User Ratings, Average Prices, and Average Reviews for Fiction and non-Fiction bestsellers from 2008 to 2020.

User Ratings: Both Fiction and Non-Fiction consistently receive high average ratings. Occasional fluctuations highlight periods with standout books or shifting reader preferences.
Average Price: Non-Fiction titles are typically priced higher than Fiction, though prices for both genres show a general downward trend in recent years, possible reflecting increased competition or changes in publishing models.
Average Reviews: Reader engagement has grown over time. Fiction books tend to attract more reviews overall, but Non-Fiction titles are catching up, indicating rising popularity and interest.

Dashed lines on each panel visualise the direction and magnitude of these changes, making it easy to compare genre performance and observe market shifts across the year.

Patterns and Anomalies: Key Takeaways¶

Certain authors consistently dominate year after year; their sustained streaks reflect both popularity and loyalty.
Average reviews have increased annually which is evidence of growing community feedback and digital engagement.
Multi-year bestsellers typically hold their price, but special offers or new editions can drive changes, revealing publisher tactics.
Fiction and Non-Fiction behave differently: Fiction attracts more reviews; Non-Fiction sustains higher prices. Both maintain strong ratings, but market shifts are revealed in Tableau charts.

Conclusion¶

Through integrated SQL/Python analysis and Tableau visualisation, the data reveals how bestselling book trends shift, who stands out, and how consumer behaviour evolves. These insights inform future marketing, pricing and publishing strategies.

4. Popularity vs. Price Analysis¶

Section 5 investigates the relationships between bestseller book price, reader popularity (using reviews as a proxy), and user ratings. This section showcases correlation analysis and comparitive SQL skills.

4.1 Is There a Correlation Between Book Price and Number of Reviews?¶

Goal: Determine if higher-priced books receive more or fewer reviews, using review count as a measure of popularity.

SQL Query:

In [257]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """
SELECT Price, AVG(Reviews) AS avg_reviews, COUNT(*) AS num_books
FROM book_sales
GROUP BY Price
ORDER BY Price
Limit 10;
"""
df = pd.read_sql_query(query, conn)

# Show the data
df

Out[257]:

	Price	avg_reviews	num_books
0	0	11124.083333	12
1	1	720.000000	1
2	2	5633.600000	5
3	3	5983.000000	1
4	4	10750.937500	32
5	5	10621.555556	36
6	6	11685.710526	38
7	7	15178.521739	23
8	8	17235.461538	52
9	9	10307.578947	38

Tableau Visualisation¶

In [259]:

%%html
<iframe src="https://public.tableau.com/views/AverageReviewsbyBookPrice/AverageReviewsbyBookPrice?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

This scatter plot illustrates the relationship between book price and average number of reviews. The chart reveals a clear negative correlation: as book prices increase, the average number of reviews tends to decrease. Most books clustered in the lower price range ($0–$100) receive significantly higher review counts, while higher-priced books ($200 and above) generally attract fewer reviews. This pattern suggests that more affordable books are more popular among readers, whereas expensive titles are less likely to generate widespread engagement.

4.2 Do Higher-Rated Books Tend to Cost More or Less?¶

Goal: Analyse whether higher user ratings are associated with higher or lower book prices.

SQL Query:

In [267]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """

SELECT Price, AVG([User Rating]) AS avg_rating, COUNT(*) AS num_books
FROM book_sales
GROUP BY Price
ORDER BY Price
Limit 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df

Out[267]:

	Price	avg_rating	num_books
0	0	4.766667	12
1	1	4.500000	1
2	2	4.520000	5
3	3	4.700000	1
4	4	4.690625	32
5	5	4.761111	36
6	6	4.594737	38
7	7	4.600000	23
8	8	4.738462	52
9	9	4.634211	38

Tableau Visualisation¶

In [269]:

%%html
<iframe src="https://public.tableau.com/views/AverageUserRatingAcrossPriceRanges/AverageUserRatingAcrossPriceRanges?:language=en-GB&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_linkembed=y&:showVizHome=no" width="1000" height="600"></iframe>

Summary¶

This chart displays the average user rating for bestseller books grouped by price range. The visualisation reveals that user ratings remain consistently high across all price bins, with little variation between cheaper and more expensive books. This suggests that book price does not significantly influence user satisfaction; readers rate books positively regardless of how much they cost.

4.3 Which Genres Give More "Value" for the Price?¶

Goal: Find which genre offers the most average rating per pound (rating_per_pound), highlighting the best "value" for readers.

SQL Query:

In [271]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """

SELECT Genre,
AVG([User Rating]) AS avg_rating,
AVG(Price) AS avg_price,
AVG([User Rating]) / NULLIF(AVG(Price), 0) AS rating_per_pound
FROM book_sales
GROUP BY Genre
ORDER BY rating_per_pound DESC;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df

Out[271]:

	Genre	avg_rating	avg_price	rating_per_pound
0	Fiction	4.648333	10.850000	0.428418
1	Non Fiction	4.595161	14.841935	0.309607

This compares book genres based on their average reader rating per pound, revealing which categories deliver the greatest value to readers. By dividing the average rating by average price for each genre, the visualisation highlights those genres that offer the highest satisfaction for each pound spent. The results make it clear that some genres consistently provide a better reading experience at lower cost, helping readers identify where they can get the most enjoyment for their money and guiding publishers or authors interested in market positioning.

5. Bestseller Longevity Analysis¶

This section explores how long books and authors remain on the bestseller list, analysing patterns of longevity and repeated popularity within the dataset.

5.1 Which Books Stayed on the List for the Largest Number of Years?¶

SQL Query:

In [273]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """

SELECT Name, Author, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Name, Author
ORDER BY years_on_list DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df

Out[273]:

	Name	Author	years_on_list
0	Publication Manual of the American Psychologic...	American Psychological Association	10
1	StrengthsFinder 2.0	Gallup	9
2	Oh, the Places You'll Go!	Dr. Seuss	8
3	The 7 Habits of Highly Effective People: Power...	Stephen R. Covey	7
4	The Very Hungry Caterpillar	Eric Carle	7
5	Jesus Calling: Enjoying Peace in His Presence ...	Sarah Young	6
6	The Four Agreements: A Practical Guide to Pers...	Don Miguel Ruiz	6
7	First 100 Words	Roger Priddy	5
8	Giraffes Can't Dance	Giles Andreae	5
9	How to Win Friends & Influence People	Dale Carnegie	5

Summary:

This query highlights the titles that have achieved enduring success by appearing as bestsellers in the greatest number of different years. Identifying these books showcases which stories or authors maintain strong appeal over time, reflecting sustained reader interest and cultural impact.

5.2 On Average, How Many Years Does a Bestseller Remain in the Dataset?¶

SQL Query:

In [275]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """

SELECT AVG(years_on_list) AS avg_years_on_list
FROM (
SELECT Name, Author, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Name, Author
) AS book_years;
"""
df = pd.read_sql_query(query, conn)
# Show the data
df

Out[275]:

	avg_years_on_list
0	1.558405

Summary:

This query calculates the typical lifespan of a bestseller, offering insight into how quickly books rise and fall in popularity. The average number of years spent on the list highlights broader patterns in reading trends and the overall volatility or stability of the bestseller market.

5.3. Which Authors Have Books That Reappear the Most Over Time?¶

SQL Query:

In [277]:

# Connect to the database
conn = sqlite3.connect(db_path)

query = """

SELECT Author, SUM(years_on_list) AS total_book_years_on_list
FROM (
SELECT Author, Name, COUNT(DISTINCT Year) AS years_on_list
FROM book_sales
GROUP BY Author, Name
) AS book_years
GROUP BY Author
ORDER BY total_book_years_on_list DESC
LIMIT 10;

"""
df = pd.read_sql_query(query, conn)
# Show the data
df

Out[277]:

	Author	total_book_years_on_list
0	Jeff Kinney	12
1	Suzanne Collins	11
2	Rick Riordan	11
3	Gary Chapman	11
4	American Psychological Association	10
5	Gallup	9
6	Dr. Seuss	9
7	Rob Elliott	8
8	Stephenie Meyer	7
9	Stephen R. Covey	7

Summary:

This aggregation surfaces authors whose works collectively dominate the bestseller lists over time. It reflects both the productivity and enduring popularity of top writers, spotlighting those who consistently capture reader attention year after year.

Conclusion¶

This project provides a comprehensive analysis of bestselling books, exploring key factors that drive popularity, value, and longevity in the market. By examining relationships between price, ratings, and review counts, comparing Fiction and Non-Fiction trnds, and investigating which books and authors maintain bestseller status over multiple yars, the analysis offers valuable insights into reader preferences and market dynamics. The results highlight consistent patterns such as the tendency for affordable books to attract more reviews, the high user satisfaction across genres and price ranges, and the enduring appeal of certain authors and titles. These findings not only deepen our understanding of bestseler performance but also offer practical guidance for publishers, authors, and readers seeking to navigate and succeed in the competitive landscape of the book market.

In [278]:

# Close the connection when done
conn.close()

Book Sales Analysis

Import dependencies and connect to db¶

Analysing Trends and Insights in Bestselling Books Using SQL¶

Project Overview¶

1. Dataset Preparation and Cleaning¶

1. 1 Remove Exact Duplicates¶

1.2 Identify Multi-Year Bestsellers¶

1.3 Flag Box Sets Seperately¶

1.4 Handle Null Values Thoughtfully¶

2. Exploratory Data Analysis¶

2.1 To 10 Highest-Rated Books Overall¶

2.2 Top 10 Books by Genre¶

2.3 Most Frequent Authors¶

2.4 Price Distribution¶

2.5 Year with the Most Bestsellers¶

2.6 Average User Rating by Year¶

3. Trends Over Time¶

3.1 Authors With Longest Consecutive Appearances¶

Step 1: SQL Query¶

Result Summary¶

3.2 How Average Reviews and Ratings Evolved¶

SQL Query¶

Tableau Visualisations¶

Python Analysis¶

Result Summary¶

3.3 Price Trends for Multi-Year Bestsellers¶

Step 1: SQL Query¶

Step 2: Python Processing¶

Step 3: Tableau Visualisation¶

Result Summary¶

3.4 Fiction vs. Non-Fiction Comparative Trends¶

Step 1: SQL Query¶

Step 2: Tableau Visualisation¶

Result Summary¶

Patterns and Anomalies: Key Takeaways¶

Conclusion¶

4. Popularity vs. Price Analysis¶

4.1 Is There a Correlation Between Book Price and Number of Reviews?¶

Tableau Visualisation¶

4.2 Do Higher-Rated Books Tend to Cost More or Less?¶

Tableau Visualisation¶

Summary¶

4.3 Which Genres Give More "Value" for the Price?¶

5. Bestseller Longevity Analysis¶

5.1 Which Books Stayed on the List for the Largest Number of Years?¶

5.2 On Average, How Many Years Does a Bestseller Remain in the Dataset?¶

5.3. Which Authors Have Books That Reappear the Most Over Time?¶

Conclusion¶