Overview of the Partial Least Squares Node
The Partial Least Squares (PLS) node is located on the Model tab of the SAS Enterprise Miner toolbar. Partial least squares is a tool that you can use to model continuous and binary targets. The SAS Enterprise Miner PLS node is based on SAS/STAT PROC PLS. The SAS Enterprise Miner PLS node analyzes one target variable and produces DATA step score code and standard predictive model assessment results.
Data mining problems that might traditionally be approached using multiple linear regression techniques become more difficult when there are many input variables or there is significant collinearity between variables. In these instances, regression models tend to overfit the training data and do not perform well when modeling other data. Often this is the case when just a few latent variables among the many input variables are responsible for most of the variation in response or target variable values.
Partial least squares is a tool that is useful for extracting the latent input variables that account for the greatest variation in the predicted target. To many data miners, PLS means “projection to latent structures.” PLS is useful for identifying latent variables from a large pool. But the analytical results of the PLS tool are not useful for identifying variables of minor or no importance. Those tasks should be performed using other SAS Enterprise Miner data mining tools, such as the Variable Selection node.
It is difficult to identify the weighting of the latent input predictor variables that PLS uses, because they are based on cross-product relations with the target variable instead of the covariances between the input variables themselves as is more commonly seen in common factor analysis.
Partial Least Squares Node Algorithm
The PLS algorithm is a multivariate extension of a multiple linear regression that was developed by Herman Wold in the 1960s as an econometric technique. Since then, PLS has been widely used in industrial modeling and process control systems where processes can have hundreds of input variables and scores of outputs. PLS is used today in data mining projects for marketing, social sciences, and education.
The PLS algorithm reduces the set of variables (both input and target) to principal component matrices. The input variable components are used to predict the scores on the target variable components, and then the target variable component scores are used to predict the value of the target variable.
Why does the PLS linear method work when collinearity exists between input variables? Because the principal component scores for the target variable are linear combinations of the original input variables, no correlation exists between the component score variables that become inputs in the predictive model.
Resources : SAS Help