Regression is difficult? Not anymore now

“This blog explains the concept of regression through a fictional story, making it's meaning easier to understand and it's working simpler to decode.

Rooshabhkumar Mehta

7/24/202515 min read

After lunch classrooms are feeling sleepy and lazy but worth attending if a good statistician is trying to translate the formulas for the common man and women.

Dear Readers, welcome to an engaging class at a highly regarded college in Mumbai, set along the shoreline. Through the window, the sun met the sea, and the sand along the shore reflected the sparkling water. Although the sun was drifting toward the horizon, it still blazed brightly overhead. Cool winds, carrying a trace of warmth, flowed through the window into the classroom, and the group of fifteen students slowly sank into a shared afternoon drowsiness.

Professor Vijaya Tikkoo, a woman in her forties, entered the classroom and at once sensed the room’s fading energy. She drew a deep breath and wandered a few steps from the podium to the window, where the breeze from the Arabian Sea played gently with her curly hair and her sari paloo flowing like Bollywood iconic scene.

She asked a student to write a title of today’s discussion “regression” on the black board

Prof. Vijaya Tikkoo asked with a smile, What exactly is unclear about this technique?” Noticing the class’s post-lunch drowsiness and low energy, she paused for a few moments, read the students’-tired faces, and then turned and walked toward the window.

Looking at the grand window facing the Arabian sea facing balcony falling on shore of Mumbai.

Prof Vijaya gazed at the sea, sipping tea from a mug just served by a college chaiwallah bhaiya. She breathed deeply, occasionally peering out the window. Her green eyes caught the golden sunlight, shining like sunlit fields just above the horizon.

by this time the students were speaking very slowly with each other, they were talking in whispers and using gestures, trying to arrive at the correct answer.

And Prof Vijaya says I would like to start with this nuance”

Regression is an art of predicting Y by studying the effect of X.

Looking at the students’ faces, Professor Vijaya sensed their boredom with the topic. The course coordinator had already told her that the class struggled to understand regression, even after other faculty had explained it with practical examples. While she had physically entered the classroom only moments earlier, her mind had been working for nearly half an hour, searching for the right analogy to make regression clear. She finally thought of a classroom exercise that might seem unrelated at first but could help the students grasp the concept. With that idea in mind, she moved toward the experimenting that in classroom.

Pointing to a boy named Suresh, she said, Suresh, if you don’t mind, could you name a friend in this class who can describe your qualities?

Suresh stood up and replied, Swapnil is my friend. He really cares about me.

Professor Vijaya and the class turned toward Swapnil. Swapnil stood, smiled at Suresh, then looked at Professor Vijaya and said.

“Ma’am, he is smart and intelligent, knows a lot about finance, plays basketball well, and is dedicated to the gym. He can be short-tempered at times, but he is mostly a good friend.”

Professor Vijaya nodded and said, “Hmm, fair enough.” She glanced out the window, and for a moment the sunlight lit up her green eyes. Around the room, the students looked surprised and whispered to one another, wondering why she had asked the question. A few quietly continued discussing Suresh’s qualities, while the topic of regression gradually slipped out of the students’ discussion, though not from the professor’s mind.

After a few seconds pause, Professor Vijaya turned towards the Suresh and asked him “Suresh any other friend of yours who could describe you well and tell the more stuff

Suresh paused for a few seconds and then said firmly and spontaneously, Ma’am, Raj is also a good friend of mine, and he understands me well

Professor Vijaya then turned her face towards the raj and entire class shifted their attention towards the Raj, Raj Stood up with the facial expressions of contemplation and said I think Suresh is smart, as Swapnil said he is knowledgeable in finance, dedicated in gym and the playing the basketball, and he can be friend people well.” Raj said this with pauses of some seconds and figure gestures for counting the qualities of Suresh.

The professor pointed this out and whispered, “There you are. Much of what you said overlaps with Swapnil’s description, but you also added something new: that Suresh connects well with people.

She said now with slow and clear words with sharp eye contact with students and in deep voice “Each new response adds variation to the data. The moment any new set of values are added, the mean changes! and when the mean changes, the standard deviation can change as well!

She said with keeping smile on her face “That is one way to visualize quoted unquoted ‘variation in data’. Another thing to notice is that Swapnil and Raj used several similar Observation points. But, Raj, added a little more, well you must notice their views are broadly alike, which suggests correlation between variables. Even so, Raj and Swapnil together still do not fully explain Suresh’s personality, untold part of Suresh’s personality can be viewed as EEERRror (she gave more emphasis on this word) in this journey. When independent variables (IVs) are correlated, their patterns match with each other. They may explain the dependent variable to some extent, but not completely. Some part of dependent variable’s pattern is still not explained by IVs.

Standing near the blackboard, Professor Vijaya suddenly turned. At that very moment, a gust of wind swept through the classroom, lifting her hair as she spoke carefully to clarify the meaning of error in regression. “The unknown or unspoken qualities of Suresh are not the error, she explained. The error lies in the difference between what we predict and what we actually observe. Similarly, in statistical data, we estimate the dependent variable based on its assumed relationship with the independent variables. The gap between the predicted value and the observed value of the dependent variable is called the error.”

The students now understood the concept clearly, as if the gust of wind and Professor Vijaya’s words had together swept away their confusion. She tied back her hair without pin, while on the other side the students, smiling with relief, broke into gentle applause.

The students listened closely. Many of their furrowed brows eased, showing that they were beginning to understand the abstract ideas of variation, change, and correlation through this analogy.

Professor Vijaya Next ask two very good female friend of Suresh, Kamya and Nancy, She moved toward the classroom blackboard and pronounce these names asked their knowledge about suresh’s personality.

Kamya Ma’am I would like to add something new to what Swanil and Raj said, I have notice those general Behavioral traits or habits, but Suresh taking a very good care of his grandmother and, real sister, I am friend of his girlfriend (the moment she said calls members and professor Vijaya sated giving smiled for a bit but then everyone starts listening), and she said Suresh is very carrying, and feminist someone who can fight for you no matter what.

Bright living room with modern inventory
Bright living room with modern inventory

Nancy said, Ma’am, I’ve heard similar things from Suresh’s girlfriend, and his sister is also a close friend of mine. She often talks about how their mother’s life changed during COVID.

For a moment, the class fell silent, and the smiles on everyone’s faces faded slightly. With a small, generous gesture, Professor Vijaya’s loosely tied hair came undone, but she remained calm, watching the discussion and encouraging the girls to continue.

But he takes such good care of his sister that she never feels the absence of her mother-the warmth, comfort, and guidance,” Nancy continued softly. “She always says that my brother has shaped himself in a way that I never feel like I’ve lost my mother.’

A few girls in the class managed faint smiles, their eyes glistening with unshed tears. Soon, a gentle round of applause began. The professor, too, showed a warm expression of kindness, smiling as she acknowledged the moment with a graceful nod and a subtle gesture of appreciation.

Professor took a deep breath and started writing all these observations on the black board.

She said, “In our regression model, we understand the dependent variable by examining its relationship with the independent variables. Each independent variable (IV) tells us something about the dependent variable—either repeating what other IVs already show or contributing new information.

This process also shows how independent variables often overlap in what they reveal about the dependent variable. Because they do not fully explain it, their combined explanatory power is captured by R-square—the percentage of variation in the dependent variable explained by the independent variables.

Now she is about to summarize this and said further.

Here, we should also note the theoretical concept of multicollinearity, an important consideration in regression. One practical way to address it is by grouping highly correlated variables. It is often better to focus on these groups rather than on the individual influence of each independent variable, provided the real-world context supports this approach.

These groups are more commonly called Factors. Highly correlated independent variables can be combined through factor analysis and then used to examine their effect on the dependent variable. For example, the qualities of Suresh described by Swapnil and Raj can be grouped as ‘Behavioral Traits, Skills & Habits,’ while the qualities described by Kamya and Nancy can be grouped as ‘Personality Traits (Internal, Stable Qualities).” Professor Vijaya wrote these groups on the blackboard and gave the students time to absorb the idea.

Variable groups also making an impact on the dependent variable, if it is theoretically not advisable to carry all the variables (that are highly correlated among) you should focus on the groups of the variables. Well, this is more popularly known as the factors. Highly correlated independent variables can be factored together using Factor analysis and then used for their effect of Dependent variables. Say for the instance, qualities of Suresh Spoken by Swapnil and Raj are the ‘Behavioral Traits, Skills & Habits’ on the other hand qualities spoken by Kamya and Nancy are the ‘Personality Traits (Internal, Stable Qualities)’. Professor Vijaya wrote this group wise on the black board and gave time student to conceptualize.

Professor Vijay after few minutes of contemplation started exampling the check in Regression Result table

In real world, some events and outcomes are affected by other events (assumed to be affected/ evidently affected). Well in both cases humans are curious to measure this relation and hence that way to peep into future. That mathematically estimated-future-insights resulting from Regression can contribute to contemporary time policy formulations.

A century old technique gained good fame and has been offering estimation till this time on linearly dependent data (where linear patterns are dominant at large and visible too).

She paused and gave emphasis on this line in whispering words.

My dear students its very important to check pattern graphically before you hit the few clicks, of run AI-generated algorithm to run regression, if you the pattern you can choose right model for that.”

“The relationship between dependent and independent variables is not always linear, so check the pattern before making assumptions.” She said in normal pitch of voice.

It is simple to understand the role of Regression that forecast Y based on Styding X.

Unlike other descriptive statistics, this is quite different yet basic, simple to interpret yet require subjectivity of the field to get insightful point for decision making.

Now, she moved to the center of the classroom and began explaining the Regression result tables in the simple terms.

So, let's understand what scares you a lot in regression, the system generated tables

She wrote 'ANOVA' on the blackboard, and the students eagerly began taking notes.

She said, “ANOVA tests whether your assumption about the relationship between the independent and dependent variables is statistically supported. Well, this line you people have read multiple times.”

With smile she added further “Before running a regression in Excel or any other software, you usually begin with an assumption, for example, that irrigation affects rice production or that flexible work schedules influence productivity, a particular amount of drug can elevate B12 level

She took a pause and said with emphasizing on the line that “ANOVA tells you basically, the way relations among the variables you do assume, is real or not.

Well, this status called ‘assumed relations are real’ or not is backed by the data you chose for regression

She noticed the expressions of student and got the sense that they are not confused so added further “ANOVA Simply clarify the relationship structure that you were thinking prior to the analysis evidently exists as per your data patterns or not.

Here, if the P-value is less than your chosen Confidence level then your prior assumption is real. In that case we normally conclude that Ho is rejected and Regression model is significant

Seeing that the students no longer looked confused, she added, “In multiple regression, several independent variables work together to predict one dependent variable. ANOVA evaluates whether the overall relationship is statistically significant, while the t-test is used to assess the significance of each individual independent variable in relation to the dependent variable. And P-value supposed to be inferred same we like before, meaning if it is less than the chosen confidence level then independent variable is significantly affecting Dependent, and if it is then ” She paused, letting the silence settle, then spoke with a slow, peaceful tone, emphasising the line as she said “and if that is affecting Dependent, It will tell us about the dependent variable, it will tell us from the past number pattern that for what number of independent what number on the dependent variable is likely to be observed.

A student here, raised hand and professor Vijayas’s due permission asked “Ma’am If ANOVA tells us what we are assuming about variables and their relationships are real or not-significant based on statistics. Individual independent variable’s significant effect can be confirmed using t-test value. But what about the prediction?

Professor Vijay took a deep breath and gaze outside the window from her position near the black board and after couple of seconds pause, she said.

See, the story of regression starts with one basic question: Is the model real?

We check this through ANOVA.

If the model is real, the next question is: Are all the variables significant?

If yes, then we ask: How much change do these variables bring in the dependent variable?

This is estimated through b₁, b₂, and so on.

Here, the word estimation is deliberately we are going to use as this entire journey of telling the future value of Y based on X, can be carried out by sample data.

The prime interest of researchers lies in quantifying the effect of X on Y.

These estimates can be single numbers like b₁ or b₂, or they can be expressed as ranges, such as confidence intervals.

The most fascinating and interpretable result of linear regression is the meaning of b₁.

b₁ tells us how much Y changes when X increases by 1 unit. For this people make manipulation like deliberately changing the value of X by some units and trying to estimate the values of Y, well it is to be done only after the confirmation of t-test at the individual level.

In other words, it measures the per‑unit effect of X on Y.

But then an important question arises:"

She gazed out the window for a moment before raising the questions. The classroom fell so quiet that the waves along the shoreline, barely a hundred meters away, seemed to play softly in the background while the students continued writing with deep interest.

"How accurately can the model measure Y?

Or, how accurate is the predicted version of Y that is generated from X?

Can it have errors?"

Suresh, Kamya and many such students whisper “of course

Professor Vijaya then asked one girl namely Priya to speak her view

Priya said “Ma’am in real world it may be the case where any regression model, may have many more variables but we do not have data, thus we can think about those variables having strong gut feeling, despite that we cannot include them as they are not there in the data file

“Ma’am may be due to these phenomena dependent variables predictions are sometimes above the real value or sometimes below the real values

With the smile professor Vijaya raised a question “If it has errors this way in term of predicting Y, like you said sometimes above or below the real values then how much can we trust its prediction ability?”

Students again stopped writing and show the facial expression of curiosity on their faces.

Noticing this Professor Vijaya said

There is no black‑and‑white answer to these questions because regression is a procedure, and estimates are produced through a systematic process.

So, let us understand its different paradigms one by one.

This Starts, from the subjectivity of the field, there could be many variables like X in the real environment that are affecting Y (some of them are known to researchers through past literature/ intuitions observed in current study). Like Priya said, Moreover, Due to many constraints (be it financial, administrative, other ) some variables do not showed in the data file, in that case model might be compromised as full list of variables are not available to predict Y.

Problem is not only related to failure to bring affecting variables, but also to ability of gather sample data only rather than population data. OR if not this way, may be some lacks knowingly or unknowingly left out in the sampling procedure.

This way it is too overambitious task to expect the 100 % accurate predictions from regression.

But in the above case, rather than being depressed about tiny sample against the gigantic population, or lesser variables that are affecting Y we should be optimistic about some hidden stories to be revealed.

Keep your motivation high, make sure your journey do not skid out of the theoretical assumptions and road of rules, stay hydrated with real meaning of the measures of regression.

This will help to let the cat out to the bag.

One someone says beautifully, it’s always to have something rather than nothing.

The students began applauding, smiling and gesturing to show that their doubts had been cleared.

In the absence of majority of the variables, you though out to have potential to become X that are affecting Y, let focus on what actually you do have or blessed with, in your spreadsheet.

The major set of measures that are defining the strength of regression models are not only helping you to gather the points on your stories but also gives you the statistical confirmation on model fitting (like good model/ bad model).

B1 as it intends to tell you the amount change in Y according to 1 unit change in X.

After the discussion, she could see that the students felt satisfied and confident in their understanding. Seeing their smiles, she wrote the equation on the board and continued her explanation.

b₁ = ∑( X – X̄ )( Y – Ȳ ) / ∑( X – X̄ )²

"b1 is the ratio basically how X variable numbers and Y variable numbers vary together and variation within X.

This way b1 can tell how Y changes due to changes in X.

Bo narrates the scenario when independent variable is not having any effect on Y, in that case is that fair to assume Y disappear or zero.

Some students whispered “no” again, glancing at one another with cautious confidence while keeping their attention on Professor Vijaya.

Professor Vijaya smiled and said firmly, “Yes, it is not logical to assume that the dependent variable becomes zero when the independent variable is absent. Instead, b0 tells us the expected value of Y when X is zero. For example, if we are predicting product sales based on advertising spending, it would be unfair to assume that sales become zero simply because no money is spent on advertising. Many other independent variables may still influence sales; advertising expenditure is only one of them. In practice, if we use only one independent variable because of practical constraints and b0 is large or statistically significant, it may indicate that other important variables should also be included to explain Y more completely.

Professor Vijaya then move towards the back board and write 5 points.

Remember for the Linear regression

• It will quantify the change in Y with respect to 1 unit change in X (through b1)

• It will estimate the version of Y in the absence of X (theoretically X=0)

• ANOVA will tell in the first go if Your theory (or assumption about the relations of X and Y) is statistically real (significant, P-value < 0.05 ) or Imaginary (insignificant P-value > 0.05)

• In linear regression there are some assumptions you must make sure are fulfilling then only you can trust the model

• R-square basically tell you about the % Y can be explained by X

-Rooshabhkumar Mehta

Reference: Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate data analysis (6th ed.). Pearson Prentice Hall.

special thanks to Google Gemini AI for picture generation

Disclaimer

This blog contains fictional stories created for educational and illustrative purposes. All characters, events, and examples are imagined and should not be interpreted as real or as professional advice in statistics, research, medicine, finance, or any other field.

The author makes no guarantees about accuracy or completeness and is not liable for any actions taken based on this content. Readers should verify all concepts through reliable academic or professional sources.