Regression analysis is one of the most widely used statistical techniques. This method also forms the basis for many further approaches. Ordinary Least Squares (or OLS) regression provides insight into the dependencies of the variables and enables predictions. But do we really understand the logic and scope of this method? In this tutorial, we'll go through the basics of OLS regression in R. As an example, we use a dataset from a B2B logistics company. For morein-depth advice on data analysiswithin R you can alwayscontact usfor professional help – our experts know all applications of statistical analysis in R!

## Exploring relationships with OLS regression in R

Scientists are often interested in understanding the relationship between two (or more) concepts. Statistics offers the tools to examine such dependencies using linear OLS regression analysis. However, the starting point of this analysis is always the scientist's idea of the influence that an independent variable (X) has on another dependent variable (Y). In symbols this idea orhypothesiscould look like this:

X \rightarrow Y

This equation expresses that X is affected by Y. The OLS linear regression analysis allows us to test the idea from a scientific point of view.

## Sample dataset: Do rush orders predict total orders?

In this article we use the statistical software R for the analysis and aSample data set for daily demand forecast ordersto illustrate our steps. The dataset can be downloaded as a .csv file and thenimported into Rwith the following statement:

`data<-read.csv("Daily_Demand_Forecasting_Orders.csv", head=TRUE, sep=";")`

weak (data)

The data set includes 60 daily observations (i.e. rows of the table) of demand forecast orders from a real Brazilian logistics company, measured by 13 parameters (i.e. columns of the table). However, for the purposes of this OLS regression in R, we will only focus on two columns or variables, namely:

- Urgent orders (number)
- Total orders (amount)

We will analyze whether the amount of urgent orders has a significant impact on the amount of total orders. Finally, urgent orders can require more resources and thus play a leading role in defining the overall business capability of the company. In addition, we show how OLS linear regression can be used to predict the number of total orders based on the available information.

## OLS regression in R: visual representation and formula

The idea of OLS regression is easiest to explain graphically. Let's say we're interested in how total orders are affected by urgent orders.

Our two variables can be plotted on the axes of the 2D plot using the following R code:

`append (data)`

plot(x=Urgent.order, y=Target..Total.orders.,main="Urgent and Total Orders", xlab="Urgent Orders", ylab="Total Orders")

Number of rush and total orders from a logistics company over a period of 60 days (n=60).

The points on the chart above are the real observations of the daily amounts of urgent and total orders. Each point on the graph represents one day in the 60-day observation period.

A trend can be seen visually: the higher the number of rush orders the company has, the higher the number of its total orders. So far that seems logical. But unfortunately, our visual inspection is not considered a solid analytical tool. Therefore, we use the analytical tools of regression analysis in R to gain further insights.

### Investigation of possible linear relationships

Rush orders represent an independent variable denoted by X. Total orders, on the other hand, is the dependent variable denoted by Y. Let's first assume the simplest form of interaction between X and Y, the linear one: Geometrically speaking, this relationship would form a straight line. It would mean that we could draw a line through the cloud of observations. But how are we supposed to draw that line? One could think of many possibilities:

3 possible regression lines for the observation points

### Finds the best line for the OLS regression

Above we saw 3 possible regression lines. But which of these is best for OLS regression?

Intuitively it is clear that we want a line to be as close as possible to the observed points. This is exactly the idea of regression analysis. If we want the line to be optimal, we want the distance (red spans in the image below) between the line and the observation points to be minimal:

Regression line drawn through the observation points (green) along with some of the residuals (red)

### How is the minimum distance determined?

So we want the distance between the regression line and the observed values to be minimal - but what distance is minimal?

You can say the*distance*should be as close to zero as possible. But checking every little gap in the graph above doesn't sound very rational.

Alternatively, we could ask for it*sum of distances*instead be minimal. However, some of these distances have a positive direction, while others have a negative direction. Therefore, summing up would result in a nearly zero score even if the gaps are large.

But fortunately, statisticians have found an elegant solution: we ask for it*Sum of squared distances*, or in mathematical terms, the*Sum of the squared residuals*be minimal. Labels the residuals with*e*, we get the mathematical representation:

\sum_{i=1}^{n}=e_{i}^{2} \rightarrow min

Because of this logic, the procedure is called ordinary least squares estimation or OLS regression analysis.

As outlined above, OLS regression is a standard statistical technique and is implemented in all statistical software. In R there arebasic function`lm()`

, which performs the regression in R and calculates the optimal regression line.

Before analyzing the R output, let's review regression as a linear dependency. It is known that a line can be formulated analytically as:

y=\alpha+\beta\cdot x

Since the line in our case only approximates the points but does not run through them exactly, we use a slightly different representation:

y=\alpha+\beta\cdot x+e

This formula accounts for the error or remainder term*e*. Under these conditionsAdenotes the point of intersection with the y-axis, andBstands for the slope of the regression line.

## Incorporating R functionality: Interpreting OLS regression output in R

Below we outline the syntax to produce an OLS regression output in R.

The R function`lm`

(linear model) is used and the output containing the relevant information is called from the summary function.

`model_l<-lm(Target..Total.orders.~Urgent.order,data=data)`

Summary (model_l)

We will now go through and interpret the output step by step.

### OLS-Regression in R: The*Financial support*Section

The first part of the Call output summarizes the regression analysis model in R. In our case, it is a bivariate OLS regression since we only have two variables: a dependent variable and an independent variable. Note that there can only ever be one dependent variable (Y). A regression with two or more independent variables (X) is called multiple OLS regression. These more advanced types of regression are beyond the scope of this article.

### OLS-Regression in R: The*Stay*Section

The next section, Residuals, contains the information about the model's residuals. Here the residuals are summarized by descriptive statistics.

### OLS-Regression in R: The*coefficients*Section

The most important information is contained in the "Coefficients" section. Here we see the least squares estimators of both interceptsAand the slopeBour model. Slope is related to the independent variable and can therefore be found in the Urgent.order row of the table above.

The results of the least squares estimate appear in the Estimate column of the table. consequently,A= 14.676 andB= 2,407.

#### Interpretation of the OLS regression coefficients

The intercept coefficient can be interpreted as follows: it is the value of the dependent variable Y (total orders) when the independent variable X (rush orders) is zero. In our case, this would mean that the total number of daily orders without urgent orders is around 14-15. However, the value of the intercept makes little practical use for our regression analysis in R.

In contrast, in OLS regression analysis, slope interpretation is always paramount. The slopeBCoefficient shows how the dependent variable responds to a one unit change in the independent variable. So if urgent orders increase by one, the total number of orders would increase by 2,407. As we initially assumed, this could happen because urgent orders require intensive resource allocation. This information is crucial for business planning.

Other columns in the table include standard error estimates and t-test statistic values. The t-test is used to test the following hypothesis:

H_{0}: The independent variable has no significant impact on*j*

H_{1}: The independent variable has a significant impact on*j*

The decision on the hypothesis is made on the basis of thep-valuein the last column of the table "Pr(>|t|)". If the p-value is less than the error rate, which is usually set at 5% or 0.05, we can reject the null hypothesis and accept the alternative. In our case, we conclude that urgent offers have a significant impact on total offers since the p-value is 0.0000000000372.

### OLS-Regression in R: The*Model*Section

The last part of the output looks at the overall performance of the OLS regression model. Here we see the R-squared measure, which describes the percentage of the total variance explained by the model. The highest possible value of R-squared is 1, which means that the model explains 100% of the real-world dependencies. The lowest possible value of R-squared is 0, which means the model is practically useless. The output includes both multiple R-squared and adjusted R-squared measures. However, for bivariate regression analysis in R, both measures are appropriate to access model performance. In our case, the model reveals about 52% of the total variance or information, which is not a bad result.

The F-test provides further important information. Unlike the t-test, this test does not affect a single variable or coefficient, but instead tests the model as a whole:

H_{0}: No variable has no significant impact on*j.*In other words, the model does not provide any additional information.

H_{1}: At least one variable has a significant impact on*j.*In other words, the model provides significant insights into real-world dependencies.

The decision rule here (and in any other statistical test) is the same as for the t-test above: if the p-value is less than 0.05, we can reject the null hypothesis and accept the alternative, and vice versa. In our case, the F-test statistic is 66.09 with 1.58 df = degrees of freedom with p-value=0.00000000003719. This means that the model is statistically significant.

In summary, it can be said that the number of rush orders from the logistics company has a statistically significant positive effect on the total order volume. We arrived at this result using the bivariate OLS regression analysis in R.

## Did we do everything right? Diagnostics of OLS regression in R

After the regression in R is done, the question arises whether the model is mathematically justifiable. Like any other theory, OLS regression analysis is based on several assumptions, namely:

- Linearity:
*j*depends on*X*by linear dependency - Independence: Observation points(x_{i} , y_{i})are independent of each other, which means that the residuals are also independent
- Normality: The residuals
*e*are normally distributed - Homoscedasticity: Probably the least intuitively clear assumption. This means that the variance of the residuals remains constant throughout the regression dataset.

### How to check OLS regression assumptions in R with the*Plot()*function

There are various methods and even readily available packages to test the assumptions of OLS in R. In this article, we'll look at one of those options using the built-in regression feature in R,`Plot()`

:

`par(mfrow=c(2,2))`

Plot (model_l)

The first command is used to change the visual output and allow four charts to be displayed at the same time:

Charts verifying assumptions for OLS regression

The first plot "Residuals vs. Fitted" is useful for assessing linearity and homoscedasticity: linearity is satisfied if the residuals (points on the plot) are predominantly distributed around the zero line. Homoscedasticity means that there is no visible pattern in the residuals. This is sometimes referred to as the distribution of residuals like "stars in the night sky". Another diagnostic tool that can be used is the Breusch-Pagan homoscedasticity test. However, this is not the subject of this introductory tutorial. For an in-depthStatistical Adviceof OLS regression you canContactOur experts at any time!

The second plot, or QQ plot, is used to test the normality assumption: the closer the residual points are to the 45-degree dotted line, the more likely the normality assumption is satisfied. As we can see, this is mostly the case with the observed values.

The third plot is also useful for checking the homoscedasticity assumption: as in the first plot, we don't want to see any particular pattern in the residuals here.

Based on the above considerations, one can conclude that the assumptions of the bivariate regression in R are satisfied. We can therefore consider the results of the least squares estimate to be significant.

For further and moreDetailed analysisof the assumptions of the regression analysis in R, you can always consult the professional statistical analysis services of Novustat.

## What does the future hold: predictions with the OLS model in R

One of the most exciting features of regression analysis in R is the ability to predict values of the dependent variable Y given the values of the independent variable X.

In our example, we might want to predict the total order volume of our logistics company when urgent orders are 100 items. To do this, we first need to define a new data set with these values:

Then with thebuilt-in function`predict()`

we get the corresponding total order values below the 95% significance level:

Therefore, we can estimate that for urgent orders of 100 items, the total orders will be 255 items. The`predict()`

The feature also allows for the creation of prediction/confidence intervals and includes other prediction options not included in our introductory tutorial.

## Moving on: Possible variations and extensions of the OLS model in R

This tutorial showed the basic workings of OLS regression in R. In addition, we also explained the statistical and mathematical logic behind this approach.

We used a simple example of bivariate regression analysis in R. However, it is often advantageous to include more than one predictor (independent variable X) in the model. This can often improve the model's goodness of fit and predictive power.

However, we have addressed all the important issues in performing regression analysis in R. Novustat. Novustat experts will be happy to support you and advise you statistically on further specific questions on the subject.

## FAQs

### How to perform OLS regression in R? ›

**The following step-by-step example shows how to perform OLS regression in R.**

- Step 1: Create the Data. For this example, we'll create a dataset that contains the following two variables for 15 students: ...
- Step 2: Visualize the Data. ...
- Step 3: Perform OLS Regression. ...
- Step 4: Create Residual Plots.

**What is OLS regression analysis in R? ›**

OLS Regression in R programming is **a type of statistical technique, that is used for modeling**. It is also used for the analysis of linear relationships between a response variable. If the relationship between the two variables is linear, a straight line can be drawn to model their relationship.

**How do you explain OLS regression? ›**

Ordinary least squares (OLS) regression is **a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable**; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the ...

**Is OLS regression the same as multiple regression? ›**

**Multiple regression is an extension of linear (OLS) regression** that uses just one explanatory variable. MLR is used extensively in econometrics and financial inference.

**How do you calculate regression using OLS? ›**

**OLS: Ordinary Least Square Method**

- Set a difference between dependent variable and its estimation:
- Square the difference:
- Take summation for all data.
- To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero,

**Why do you use OLS regression? ›**

Intuitively speaking, the aim of the ordinary least squares method is **to minimize the prediction error, between the predicted and real values**. One may ask themselves why we choose to minimize the sum of squared errors instead of the sum of errors directly.

**What are the limitations of OLS? ›**

The four frequently violated OLS assumptions that do not affect GRNN are **(1) linear functional relationship, (2) data distribution, (3) resilience to outliers, and (4) independence of observations**. The reasons why these assumptions cause problems for OLS are reviewed and then reexamined in GRNN context.

**What are the assumptions of OLS multiple regression? ›**

Five main assumptions underlying multiple regression models must be satisfied: **(1) linearity, (2) homoskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables**.

**What are the advantages of OLS? ›**

**Ordinary least squares (OLS) models**

- Advantages: The statistical method reveals information about cost structures and distinguishes between different variables' roles in affecting output. ...
- Disadvantages: Large data set is necessary in order to obtain reliable results.

**What are the four assumptions behind OLS linear regression? ›**

Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.

### Is OLS the same as logistic regression? ›

**In OLS regression, a linear relationship between the dependent and independent variable is a must, but in logistic regression, one does not assume such things**. The relationship between the dependent and independent variable may be linear or non-linear.

**How do you improve OLS regression? ›**

**How to improve the accuracy of a Regression Model**

- Handling Null/Missing Values.
- Data Visualization.
- Feature Selection and Scaling.
- 3A. Feature Engineering.
- 3B. Feature Transformation.
- Use of Ensemble and Boosting Algorithms.
- Hyperparameter Tuning.

**Under what conditions is an OLS model appropriate to use? ›**

In a nutshell, your linear model should **produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables**. If these assumptions hold true, the OLS procedure creates the best possible estimates.

**Why is OLS the best estimator? ›**

OLS Estimator is Efficient

An estimator that is unbiased and has the minimum variance is the best (efficient). The OLS estimator is the best (efficient) estimator because **OLS estimators have the least variance among all linear and unbiased estimators**.

**What reduces OLS regression? ›**

The most commonly used procedure used for regression analysis is called ordinary least squares (OLS). The OLS procedure minimizes **the sum of squared residuals**.

**Is OLS lm in R? ›**

OLS regression in R

**The standard function for regression analysis in R is lm** . Its first argument is the estimation formula, which starts with the name of the dependent variable – in our case y – followed by the tilde sign ~ .

**Does LM function in R use OLS? ›**

**the R function such as lm() is used to create the OLS regression model**. In the event of the model generates a straight line equation it resembles linear regression. OLS Regression is a good fit Machine learning model for a numerical data set. The bivariate regression takes the form of the below equation.

**How do you calculate r squared OLS? ›**

**R 2 = 1 − sum squared regression (SSR)** total sum of squares (SST) , = 1 − ∑ ( y i − y i ^ ) 2 ∑ ( y i − y ¯ ) 2 . The sum squared regression is the sum of the residuals squared, and the total sum of squares is the sum of the distance the data is away from the mean all squared.

**Can you use OLS in non linear regression? ›**

The tasks of teaching using OLS in the nonlinear regression analysis are discussed. Models which are nonlinear in parameters, in sense, that **by suitable (log) transformation the models can be made linear in parameters**. In this case method of Ordinary Least Square (OLS) has been used for transformed equations.

**Does OLS assume linearity? ›**

The Assumption of Linearity (OLS Assumption 1) – **If you fit a linear model to a data that is non-linearly related, the model will be incorrect and hence unreliable**. When you use the model for extrapolation, you are likely to get erroneous results. Hence, you should always plot a graph of observed predicted values.

### Does OLS assume linear relationship? ›

Thus, **linearity in parameters is an essential assumption for OLS regression**. However, whenever we choose to go for OLS regression, we just need to ensure that the 'y' and 'x' (or the transformed 'y' and the transformed 'x') are linearly related. The linearity of β's is assumed in the OLS estimation procedure itself.

**Does OLS assume normal distribution? ›**

**OLS does not require that the error term follows a normal distribution** to produce unbiased estimates with the minimum variance. However, satisfying this assumption allows you to perform statistical hypothesis testing and generate reliable confidence intervals and prediction intervals.

**Why is OLS the best linear unbiased estimator? ›**

OLS Estimator is Efficient

An estimator that is unbiased and has the minimum variance is the best (efficient). The OLS estimator is the best (efficient) estimator because **OLS estimators have the least variance among all linear and unbiased estimators**.

**Why OLS is not used in logistic regression? ›**

OLS assumes that the distribution should be normally distributed, but in logistic regression, **the distribution may be normal, poisson, or binominal**.

**What is the difference between glm and OLS? ›**

In OLS the assumption is that the residuals follow a normal distribution with mean zero, and constant variance. This is not the case in glm, where the variance in the predicted values to be a function of E(y).

**What is a good R-squared value for regression? ›**

For example, in scientific studies, the R-squared may need to be above 0.95 for a regression model to be considered reliable. In other domains, an R-squared of just 0.3 may be sufficient if there is extreme variability in the dataset.

**What is the difference between R and R-squared? ›**

**The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model**.

**What is the formula for the OLS estimator? ›**

In all cases the formula for OLS estimator remains the same: **^β = (X′X)−1X′y**, the only difference is in how we interpret this result. OLS estimation can be viewed as a projection onto the linear space spanned by the regressors.

**What are the limitations of OLS regression? ›**

The four frequently violated OLS assumptions that do not affect GRNN are **(1) linear functional relationship, (2) data distribution, (3) resilience to outliers, and (4) independence of observations**. The reasons why these assumptions cause problems for OLS are reviewed and then reexamined in GRNN context.