^{1}Department of Pharmaceutical Quality Assurance, A.R. College of Pharmacy & G.H. Patel Institute of Pharmacy, Vallabh Vidyanagar
^{2}Department of Pharmaceutical Chemistry, A.R. College of Pharmacy & G.H. Patel Institute of Pharmacy, Vallabh Vidyanagar
Chemometrics is the use of mathematical and statistical methods to improve the understanding of chemical information and to correlate quality parameters or physical properties to analytical instrument data. It is a data-driven multidisciplinary science that allows maximum collection and extraction of useful information from the analytical data of numerous application areas such as chemistry, biochemistry, medicine, biology and chemical engineering. This review focuses mainly on numerous chemometric models used such as like parallel factor analysis (PARAFAC), Tucker-3, N-partial least square (NPLS), and bilinear models like principal component regression (PCR) and partial least squares (PLS) and their relevant applications in the field of pharmaceutical sciences. Chemometric approaches can be used to analyze the data obtained from various instruments including near infrared (NIR), attenuated total reflectance Fourier transform infrared (ATR-FTIR), high-performance liquid chromatography (HPLC), and terahertz pulse spectroscopy. The technique has been used in the quality assurance and quality control of pharmaceutical solid dosage forms. Also, this review gives idea about list of software used for multivariate analysis.
Chemometrics is the application of statistical and mathematical methods to analytical data to permit maximum collection and extraction of useful information. The utility of chemometric techniques as tools enabling multidimensional calibration of selected spectroscopic, electrochemical, and chromatographic methods is demonstrated. Application of this approach mainly for interpretation of UV-Vis and near-IR (NIR) spectra, as well as for data obtained by other instrumental methods, makes identification and quantitative analysis of active substances in complex mixtures possible, especially in the analysis of pharmaceutical preparations present in the market. Such analytical work is carried out by the use of advanced chemical instruments and data processing, which has led to a need for advanced methods to design experiments, calibrate instruments, and analyze the resulting data. This review will concentrate on gaining an understanding of how chemometrics can be useful in the modern analytical laboratory. A selection of the most challenging problems faced in pharmaceutical analysis is presented, the potential for chemometrics is considered, and some consequent implications for utilization are discussed. Advances in electronics and computing over the past 30 years have revolutionized the analytical laboratory. Technological developments have allowed instruments to become smaller, faster, and cheaper, while continuing to increase accuracy, precision, and availability. Data analysis methods have also benefitted from advances in technical computing; commercially available mathematical programming packages allow scientists to perform complex calculations with a few simple keystrokes. Furthermore, software sold with many commercial instruments contains automatic data processing algorithms (e.g., Fourier transform analysis, data filtering, and peak recognition). The advances in computing allow researchers to obtain increasing amounts of chemically relevant information from their data; however, this is not always achieved using simple data-processing techniques. Svante Wold first coined the term ‘‘kemometri’’ (‘chemometrics’’ in English) in 1972 by combining the words kemo for chemistry and metri for measure. Presently, the journal Chemometrics and Intelligent Laboratory Systems defines chemometrics as ‘‘the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analyzing chemical data’’. The field of chemometrics has also benefitted from technological advances in the past 30 years, causing the number of researchers using chemometric methods to grow. Chemometrics has brought us a large number of valuable tools that allow us to resolve and display complex chemical information from often dynamic unstable systems. In general, the main areas of focus for these techniques have been multivariate calibration; pattern recognition, classification, and discriminatory analysis; and reactant and reaction process modeling and monitoring. The purpose of this review is to describe various multivariate calibration chemometrics methods in combination with UV-Vis spectrophotometry, near-IR spectroscopy (NIRS), fluorescence spectroscopy, electro analysis, chromatographic separation, and flow-injection analysis for the analysis of drugs in pharmaceutical preparations. [1] The term “Chemometrics” was coined by Svente Wold in 1972 and is concerned with the application of mathematical and statistical techniques to extract chemical and physical information from complex data with the application of computer science. Various algorithm and analogue ways are available for processing and evaluating the data. They can be promoted to various fields like medicine, pharmacy, food control, environmental monitoring and it is continuing to diverge into new fields such as metabonomic. Chemometrics is applied in areas comprising experimental parameter optimization, data quality improvement, identification and quantification of targeted chemical components, pattern recognition techniques for clustering and classification, multivariate model establishment to correlate chromatographic properties with molecular descriptors and prediction of properties or activities of chemical compounds or technological materials (quantitative structure-activity or structure property relationships) to find out hidden relationships existing between the available data and the desired information. [2] Chemometrics has been evolving as a sub discipline in chemistry for over 30 years as the need for advanced statistical and mathematical methods has increased with the increasing sophistication of chemical instrumentation and processes. “Chemometrics is a chemical discipline that uses mathematics, statistics, and formal logic (a) to design or select optimal experimental procedures; (b) to provide maximum relevant chemical information by analyzing chemical data; and (c) to obtain knowledge about chemical systems.” In the 1972 review of Statistical and Mathematical Methods in Analytical Chemistry, there are only two areas of active study; “Curve Fitting” and “Statistical Control” where curve fitting is attributed to be the analytical chemists’ area of interest whereas chemical engineers were reported to be primarily concerned with quality control. In 1971, Svante World of the University of Umea in Sweden coined the term “chemometrics” in a grant proposal and shortly after, his collaboration with Bruce Kowalski of the University of Washington, brought the name to the United States. As part of the development of chemometrics as a separate sub discipline, the International Chemometrics Society was formed in 1974. The first paper with chemometrics in the title appeared in 1975. In which Kowalski suggested that chemometrics had developed to the point that it was now a functioning research area in the chemical sciences. He points to the value of pattern recognition and indicates that there were vehicles for publication of research results in journals such as the Journal of Chemical Information and Computer Sciences, but that it would be appropriate for Analytical Chemistry to publish such work. Although there were a limited number of papers published in these journals, there were significant impediments to publishing chemometrics articles as there remained considerable skepticism in the analytical community as to the need for complex data analysis tools. In the view of many established chemists, the need for complicated data analysis tools was a sign that the proper experiments were not performed, rather than understanding that advanced data analysis was integral to the maximal use of evolving new technologies. The introduction of the section “Computer Techniques and Optimization” in Analytica Chimica Acta in 1977 was the first journal publication that was clearly dedicated to this developing area. In 1980, Analytical Chemistry changed the name of the review section on Statistical and Mathematical Methods in Analytical Chemistry to Chemometrics, and entered the mainstream of the field. Subsequently, in 1982, the separate section in Analytica Chimica Acta was terminated because chemometrics had become sufficiently well accepted to eliminate the need for this special attention. Subsequently, two journals dedicated to chemometrics were launched: Chemometrics and Intelligent Laboratory Systems and Journal of Chemometrics, but now to provide a vehicle for discussion of more of the details of the methods while applications would generally be published in the broader analytical journals. Thus, the critical aspect of chemometric methods is that they had become commonly accepted into the practice of chemistry. There has continued to be the simultaneous development of new analytical instruments that have produced data that demanded new and more effective data analysis methods while the increasing capability of personal computers has permitted more computationally intensive calculations to be performed without the need for access to “super” computers. This combination of developments has opened many new options for data analytical method improvement. Thus, over the intervening years, chemometrics has emerged to have a significant role within analytical chemistry including the incorporation into the operating systems of a number of commercial analytical instruments. This paper will review the major areas of chemometrics related to analytical chemistry including multivariate calibration, pattern recognition, and mathematical mixture resolution and then highlight some of the new directions that chemometric methods are taking. [3]
Chemometrics [4]
The term chemometrics was first coined in 1971 to describe the growing use of mathematical models, statistical principles, and other logic-based methods in the field of chemistry and, in particular, the field of analytical chemistry. Chemometrics is an interdisciplinary field that involves multivariate statistics, mathematical modeling, computer science, and analytical chemistry. Some major application areas of chemometrics include
In many respects, the field of chemometrics is the child of statistics, computers, and the “information age.” Rapid technological advances, especially in the area of computerized instruments for analytical chemistry, have enabled and necessitated phenomenal growth in the field of chemometrics over the past 30 years. For most of this period, developments have focused on multivariate methods. Since the world around us is inherently multivariate, it makes sense to treat multiple measurements simultaneously in any data analysis procedure. For example, when we measure the ultraviolet (UV) absorbance of a solution, it is easy to measure its entire spectrum quickly and rapidly with low noise, rather than measuring its absorbance at a single wavelength. By properly considering the distribution of multiple variables simultaneously, we obtain more information than could be obtained by considering each variable individually. This is one of the so-called multivariate advantages. The additional information comes to us in the form of correlation. When we look at one variable at a time, we neglect correlation between variables, and hence miss part of the picture. A recent paper by Bro described four additional advantages of multivariate methods compared with univariate methods. Noise reduction is possible when multiple redundant variables are analyzed simultaneously by proper multivariate methods. For example, low-noise factors can be obtained when principal component analysis is used to extract a few meaningful factors from UV spectra measured at hundreds of wavelengths. Another important multivariate advantage is that partially selective measurements can be used, and by use of proper multivariate methods, results can be obtained free of the effects of interfering signals. A third advantage is that false samples can be easily discovered, for example in spectroscopic analysis. For any well characterized chemometric method, aliquots of material measured in the future should be properly explained by linear combinations of the training set or calibration spectra. If new, foreign materials are present that give spectroscopic signals slightly different from the expected ingredients, these can be detected in the spectral residuals and the corresponding aliquot flagged as an outlier or “false sample.” The advantages of chemometrics are often the consequence of using multivariate methods. The reader will find these, and other advantages highlighted throughout the book. [4]
What is Chemometrics? [5]
The definition of chemometrics is is evident in its name, where chemo– means chemical and metrics means measurement; thus, chemometrics is the study of chemical (and biochemical) measurements and is a branch of analytical chemistry. Examples of chemometric applications include:
ORIGIN AND DEVELOPMENT OF CHEMOMETRICS [6,7]
In 1971, a Swedish scientist Svante Wold coined the term “kemometri,” in Swedish and in English it is equivalent to “chemometrics”. The science of chemometrics can briefly be described as the interaction of certain mathematical and statistical methods in chemical measurement processes. It has been developed as a consequence of the change in the data obtained with the emergence of new analytical techniques as well as microprocessors. During 1986–1987 two journals – named “Chemometrics and Intelligent Laboratory Systems” and “Journal of Chemometrics” – were published. The breakthrough in chemometrics came in the 21st century by various software development companies, which promoted equipment intellectualization and offered new methods for the construction of new and high-dimensional hyphenated equipment. This hyphenated equipment has opened many new options for data analytical method improvement. Now, chemometrics plays a major role in analytical chemistry. [6]
Chemometrics is a branch of science that is used for extraction of the data related to chemical and physical phenomena involved in the manufacturing process by the application of the statistical and mathematical methods. It can be applied in predictive issues solving like predicting the target properties, desired features. Also can be used for the descriptive issue solving like the model composition, identification and understanding. Chemometrics shows its application in the multivariate data collection and analysis. Various algorithms and analogous ways are available for processing and evaluating the data. They can be implemented to various fields, like medicine, pharmacy, food control, and environmental monitoring. Some of the Chemometric models for analysis are mentioned below. [7]
DIFFERENT CHEMOMETRICS MODELS FOR ANALYSIS OF DATA [8]
Chemometrics methods can be categorized in several different ways like clustering, regression and explorative methods. Chemometricians have adopted methods from other research fields such as econometrics and psychometrics where bilinear partial squares and multiway methods, respectively, had been applied and refined. Methods are separated according to how they explore
the data arrays and a distinction can be drawn between bilinear, non-linear and multiway methods as well as between projection, latent variable and factor based methods. However, some methods overlap between the above categorizations.
Bilinear model:
Bilinearity means the system is linear with relevance to its decomposition, i.e. the system is linear in its estimated parameters. In bilinear models, the knowledge of data is arranged in data matrices so that each horizontal row contains samples and vertical column has variables. Bilinear chemometric techniques include the following:
Principal Component Analysis (PCA)
This is a simple and non-parametric technique used for extracting the relevant information from the data sets on the basis of their affinity and differences. It is widely used in multivariate data analysis. PCA decreases the dimensionality and multivariate data compression in the fields of science. During process monitoring, it can be used to develop a correlation structure between variables and also to examine the changes. Thus it reduces the number of variables in the process. If for a series of sites, or objects, or persons, a variety of variables are measured, then each variable will have a variance, and usually, the variables will be associated with each other that is, there will be statically variance between pairs of variables. Today, PCA is one of the massive utilized multivariate models because of its wide applicability for multivariate problems. [8-12,28-30]
Special cases of PCA:
The most widely used special cases of PCA are principal component regression (PCR), soft independent modelling of class analogy (SIMCA) and multi-way principal component analysis (MPCA).
Principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). Typically, it considers regressing the outcome (also known as the response or the dependent variable) on a set of covariates (also known as predictors, or explanatory variables, or independent variables) based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model. In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. One major use of PCR lies in overcoming the multi collinearity problem which arises when two or more of the explanatory variables are close to being collinear. Soft independent modelling by class analogy (SIMCA) is a statistical method for supervised classification of data. The method requires a training data set consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes. SIMCA as a method of classification has gained widespread use especially in applied statistical fields such as chemometrics and spectroscopic data analysis.
Multi-way principal components analysis (MPCA) is an efficient tool for reducing higher dimensional data arrays. MPCA allows the detection of spatial and temporal factors of influence and the classification of the parameters can be considered according to these factors. Rene et al., developed a multi-way principal components analysis for a complex data array resulting from physicochemical characterization of natural waters. Molloy et al., applied a multiway principal component analysis for Identification of process improvements in pharmaceutical manufacture. The need for some form of PCA which is weighted according to measurement errors has been recognized for some time. Informally this has been addressed through a variety of scaling procedures, such as range scaling and auto scaling as well as more elaborate schemes. Under the right conditions these can reduce PCA to a maximum likelihood method. However, a number of researchers have developed methods to more rigorously incorporate measurement errors into the modeling process. This is normally done by minimizing the usual weighted residual sum of squares in accordance with some p-dimensional model. Mathematically, if X is an m×n data matrix, this corresponds to minimization of
Equation-1
where ˆxij corresponds to the estimated value of the measurement. In the general case where there are no offsets in the model, this is given by
Equation -2
where A is m×p and B is p×n. By analogy to PCA, A and B correspond to score and loading matrices, but in equation (2) the individual vectors are not required to be orthogonal. A variety of methods have been devised to obtain A and B through minimization of equation (1) and these differ largely in their representation of the problem, the constraints applied to the solution and their approach to the nonlinear optimization. Gabriel and Zamir6 describe a method based on ‘criss-cross regressions’ as a means to obtain lower-rank approximations of the matrix X. Paatero and Tapper have described what they call ‘positive matrix factorization’ (PMF) and have applied this to environmental problems. In addition to satisfying the minimization criterion, PMF also requires A and B to be positive. In this work, MLPCA is presented as an alternative to the above methods. Although MLPCA is similar to these methods in that it seeks to minimize equation (1), it has some important differences. First, it is formulated in terms of singular value decomposition (SVD), which is a very common method for implementing PCA. Second, the standard MLPCA algorithm consists of an alternating least squares procedure which is robust, easy to implement and very efficient compared with conventional gradient search methods. Finally, unlike the methods described in the preceding paragraph, which require measurement errors to be independent, MLPCA allows the inclusion of error covariance in virtually any form. The MLPCA method described here should not be confused with maximum likelihood common factor analysis (MLCFA) that appears frequently in the literature outside of chemistry. Although the terms are often used interchangeably by chemists, PCA and factor analysis are distinctly different approaches to multivariate analysis. The principles of MLCFA were originally developed by Lawley and Maxwell and later employed in programs such as LISREL. More recently, MLCFA has appeared in the chemical literature with claims that it performs better than PC. However, MLCFA was developed with the intention of finding structural models for random variables. As such, it estimates covariance matrices for random variables and does not generally use information about measurement errors. Another errors-in-variables method that has become popular recently is total least squares (TLS). This method uses SVD for the purpose of developing a regression model and is similar to MLPCA in some ways. However, it is less general in its ability to obtain maximum likelihood estimates of model parameters.
Partial Least Squares (PLS) [8,13-16,31]
Partial least squares (PLS) are a regression method for multivariate data. It is one of the widely implemented methods which describe the relationship between different sets of different observed variables by the means of latent variables. The basic theory of this method is that it modifies relations between sets of the observed variables by a small number of latent variables (not directly noticed or consistent) by assimilate regression, dimension reduction techniques, and modelling tools. The latent variables increase the covariance between the different sets of variables. PLS is identical to canonical correlation analysis (CCA) and can be used as a discrimination tool and dimension reduction method like principal component analysis (PCA). PLS is widely used approach as it can process large chemical data. The determination of flow properties of pharmaceutical powders by near infrared spectroscopy (NIR) spectroscopy was done using Partial least square technique. Cordeiro et al. conducted multivariate spectroscopic determination of lamivudine- zidovudine associations by partial least square regression (PLS). Regression by means of projections to latent structures (PLS) is today a widely used chemometrics data analytical tool. It applies to any regression problem in industrial research, development, and production (RDP), regardless of whether the data set is short/wide or long/lean, or contains linear or non-linear systematic structure, with or without missing data, and possibly are also ordered in two or more blocks across multiple model layers. PLS exists in many different shapes and implementations. The two-block predictive PLS version is the most often used form in science and technology. These latter are a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLS derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. PLS has the desirable property that the precision of the model parameters improves with the increasing number of relevant variables and observations. The regression problem, i.e., how to model one or several dependent variables, responses, Y, by means of a set of predictor variables, X, is one of the most common data-analytical problems in science and technology. Examples in RDP include relating Y = analyte concentration to X = spectral data measured on the chemical samples (Example 1), relating Y = toxicant exposure levels to X = gene expression profiles for rats for the different doses (Example 2), and relating Y = the quality and quantity of manufactured products to X = the conditions of the manufacturing process (Example 3). Traditionally, this modelling of Y by means of X is done using MLR, which works well as long as the X-variables are fairly few and fairly uncorrelated, i.e., X has full rank. With modern measuring instrumentation, including spectrometers, chromatographs, sensor batteries, and bio-analytical platforms, the X-variables tend to be many and also strongly correlated. We shall therefore not call them "independent", but instead "predictors", or just X-variables, because they usually are correlated, noisy, and also incomplete.
In handling numerous and collinear X-variables, and response profiles (Y), PLS allows us to investigate more complex problems than before, and analyze available data in a more realistic way. However, some humility and caution is warranted; we are still far from a good understanding of the complications of chemical, biological, and technological systems. Also, quantitative multivariate analysis is still in its infancy, particularly in applications with many variables and few observations (objects, cases).
Multi-way models [8,17,18,19]
Multi-way models are used when the data is multivariate and linear in more than 2 dimensions. These can be considered to devise a model in n-dimensions so that the system is linear in n-dimensions. A three-linear system is often visualized as a data cube and is called a 3-way data or 3-way array whereas the bilinear system is a rectangular matrix that can be considered as 2-way data. The multi-way modelling originated from psychological data treatment where bilinear data analysing methods were not adequate. Multi-way methods are also applied to process control as well as regression analysis. Some of the advantages of multi-way models are that they have been recognized as useful tools for monitoring batch data since they improve the understanding of the process and summarize its behaviour in a batch wise manner. The methods like multiway principal component analysis (MPCA) and multiway partial least squares (MPLS) improve the process understanding and review its behaviour in a batch-wise manner and are therefore recognized as tools for monitoring batch data. However, if the initial data contains higher amplitude then it becomes difficult for the models to describe the computed data and therefore multi-way methods that work with three-way or the higher arrangement like parallel factor analysis (PARAFAC) and PARAFAC-2, Tucker-3, and N-partial least squares (N-PLS) are the methods of choice.
Parallel factor analysis (PARAFAC) [8, 20-24]
Parallel factor analysis (PARAFAC) is a disintegration method used for the modelling of three way or higher data and is chiefly intended for data having compatible variable profiles within each batch. The history to interpret PARAFAC is as follows. Cattell (1944) reviewed seven principles for the choice of rotation in component analysis and achieved the principle of “Parallel proportional profiles” as the most basic assumption. This assumption means that the two data matrices with the same variables ought to contain the identical components. By using this assumption as a constraint Harshman (1970) projected a new method to analyse two or more data matrices that contain scores for the same person on the same variables and termed the method as PARAFAC. PARAFAC could be a generalization of the principal component analysis (PCA) projection method for a multi-way array. The PARAFAC model has a second-order advantage, i.e. it can handle interferents in new samples by fitting the new interferent with an extra component.
Parallel factor analysis-2 (PARAFAC-2) [8, 20-24]
PARAFAC-2 is also designed for modelling N-way data but, in contrast to PARAFAC, it handles experiments of different lengths and variable profiles that are deviated or in a different phase. It can handle data variable profiles that are deviated or/are in a different phase. In PARAFAC, trilinearity is a basic situation whereas PARAFAC-2 allows trilinearity. However, it is to be noted that PARAFAC may be used for fitting nonlinearity to some extent in one mode only in cases where data shifts from linearity are regular. Both the techniques are mainly applied to inspect chemical evidence from experiments that form a 3-way or higher data structure. For example, chromatographic data, fluorescence spectroscopy, temporal varied spectroscopy data with overlapping spectral profiles, and process data.
Tucker-3 [8, 25]
The Tucker-3 method can be used for consolidating and data analysis of the N-way array. The Tucker-3 model consists of storing matrices in n modes, factors that are typically rectangular and a (P, Q, R) dimensional core array G. The Tucker-3 core array differs from the PARAFAC core by having at least one off-diagonal core element as non-zero, whereas the PARAFAC has a so called super-diagonal core array and thus PARAFAC can be expressed as a special case of the Tucker-3 model. This can be used for survey of N-way array data as it consists of N modes of loading matrices. The generalization of the Tucker-3 model, and the fact that it covers the PARAFAC model as a certain case has made it an often used model for decomposition, compression, and interpretation in many applications.
N- Partial least squares (N-PLS) [8, 26]
For handling a multiway data extension of PLS method namely N-PLS was made acquainted that uses dependent and independent variables for finding the latent variables for describing maximal covariance. N-PLS decomposition starts by constructing a distinct PARAFAC-like model for dependent response variables and maximize the covariance between the two matrices. Some researched the convergence of triglyceride in people by utilizing fluorescence spectroscopy within the region 220-900 nm. Nonlinear partial least squares with cubic B-spline-work based nonlinear change was utilized as the chemometric strategy. Wavelengths within the region of 300-367 nm and 386-392 nm in the first derivative of the original fluorescence spectrum were the enhanced wavelength combination for the prediction model.
Classical Least Squares (CLS) [1,27]
CLS involves the application of multiple linear regression (MLR) to the classical expression of the Beer-Lambert law of spectroscopy:
A = KC (or dA/d? = KC)
where the matrixes A and dA/d? represent the zero-order absorbance and derivative absorbance matrixes, respectively; C is the concentration matrix; and K is the calibration coefficient matrix . The CLS technique can only be applied to systems where every constituent in the sample is known. If there is the possibility of contaminants in the (unknown) samples that were not present in the calibration mixtures, the model will not be able to predict the constituent concentrations accurately.
Inverse Least Squares (ILS) [1,27]
ILS, sometimes known as P-matrix calibration, is called this because it originally involves the application of MLR to the inverse expression of the Beer-Lambert law of spectroscopy:
C = PA (or C = P × dA/d?)
where the matrixes A and dA/d? represent the zero-order absorbance and derivative absorbance matrixes, respectively; C is the concentration matrix; and P is the calibration coefficient matrix. Since it is not necessary to know the composition of the training mixtures beyond the constituents of interest, the ILS method is better suited to more complex types of analyses not handled by CLS. Disadvantages of the ILS method are that wavelength selection can be difficult and time-consuming, the number of wavelengths used in the model is limited by the number of calibration samples, and a large number of samples are required for accurate calibration.
Principal Component Regression (PCR) and Partial Least Squares Regression (PLS) [1]
Among the different regression methods available for multivariate calibration, the factor analysis-based methods, including PLS and PCR, have received considerable attention in the chemometrics literature. PLS and PCR can be used directly for ill conditioned data by extracting the latent variables (factors). The number of latent variables is lower than the number of objects. These techniques are powerful multivariate statistical tools that have been successfully and widely applied to the quantitative analysis of spectroscopic data because of their ability to overcome problems common to this data such as collinearity, band overlaps and interactions, and the ease of their implementation due to the availability of software. Here, only a brief introduction about PCR and PLS is given as the techniques are routinely used. PCR is a widely used regression model for data having a large degree of covariance in the independent or predictor variables or where ill-conditioned matrixes are present. Instead of regressing the concentrations of a measurement system onto the original measured variables spectrum, PCR implements a principal component analysis (PCA) decomposition of the spectrum X data before regressing the concentrations information onto the principal component scores. Some vectors having small magnitude are omitted to avoid the collinearity problem. PCR solves this by elimination of lower ranked principal components, which in turn reduces noise (error) present within the system. PLS regression is related to both PCR and MLR. PCR aims to find the factors that capture most of the variance within the data before regression onto the concentration variables, whereas MLR seeks a single factor that correlates both the data and their concentrations. PLS attempts to maximize the covariance, thus capturing the variance and correlating the data together. As PLS searches for the factor space most congruent to both matrixes, its predictions are far superior to PCR. PCR and PLS techniques share many similarities, and the theoretical relationships between them has been covered extensively in the literature. PLS and PCR perform data decomposition into spectral loadings and scores prior to model building with the aid of these new variables. In PCR, the data decomposition is done using only spectral information, while PLS uses spectral and concentration data. Historically, PCR predates PLS. However, since its introduction, PLS appears, by most accounts, to have become the method of choice among chemists. PLS almost always required fewer latent variables than PCR, but this did not appear to influence predictive ability. However, global models, such as PLS, implicitly endeavor to include the variation due to external effects in the model, in much the same way as unknown chemical interferences can be included in an inverse calibration model. Provided the interfering variation is present in the calibration set, an inverse calibration model can, in the ideal case of additively and linearity, easily correct for the variation due to unknown interferences. It is assumed in global calibration models that the new sources of spectral variation can be modeled by including a limited number of additional PLS factors. Owing to increase in the calibration model’s dimensionality, it becomes necessary to measure a large number of samples under changing conditions in order to make a good estimation of the additional parameters. When highly nonlinear effects are present in the spectra, many additional PLS factors are necessary to model the spectral differences, and occasionally, it is not possible to model these spectral differences.
Artificial Neural Networks (ANNs) [1]
The calibration problem is normally thought to be a linear one. However, Long et al. found that the calibration becomes nonlinear when high levels of noise are present. In this case, an ANN approach can provide better prediction results. ANNs were originally designed to mimic the function of the human brain. They consist of a number of simple processing units (or neurons) linked by weighted modifiable interconnections. ANNs have been developed for quantitative analysis of samples during the last decade. Compared with MLR, ANN is a more flexible modeling methodology, since both linear and nonlinear functions can be used (or combined) in the processing units. This allows more complex relationships between a high dimensional descriptor space and the given retention data, and may lead to better predictive power of the resulting ANN model compared with MLR. However, the major disadvantage of ANN is also directly related to the complex model infrastructure. Compared to MLR, it suffers from the perception of being a “black box” heuristic tool. ANN models generally are more difficult to interpret. Furthermore, to get robust calibrations by ANN, the number of samples must be higher than the number of weights to be estimated, which normally implies the use of a large number of calibration samples. Before building ANNs, methods for reducing the input dimensionality by mathematical preprocessing (fast Fourier transform, PCA, and variance analysis) are often used.
Locally Weighted Regression (LWR) [1]
LWR applies PCR or PLS combined with a weighting scheme such that calibration samples closest to the sample to be predicted are given higher weight. Many variants are possible and the variant that probably should be preferred is one that has PLS for all samples with equal weights as the limiting case. In that case, when the data are linear and not clustered, the local method automatically becomes a global method. LWR was shown to perform well when clustered or nonlinear data were analyzed, but it has not been studied to the same extent as the preceding methods, so fewer diagnostics are available, and it is less well-known in which cases the method should be preferred or avoided. One disadvantage is that it is more time-consuming than other methods, because it requires that several sets of latent variables be determined. There also are indications that it is less robust towards wavelength shifts than other methods.
Radial Basis Functions (RBF) Combined with PLS [1]
RBF-PLS is a local method. It seems to perform particularly well for difficult data structures. Too little is known, however, about its practical use. It is not clear yet in which exact circumstances it works well and what pitfalls may occur.
Table 1: Concepts of the main chemometrics techniques applied in questioned document analysis [31]
Software used for multivariate analysis
Table 2: List of software used for multivariate analysis [31]
TECHNIQUES USED IN CHEMOMETRICS [7]
Nowadays various spectroscopic techniques like HPLC, NIR, FTIR are being combined with various chemometric models like multivariate analysis methods, PLS, CLS, PCR, and so forth, for the evaluation of different pharmaceutical properties of tablets, powders, granules, and so forth. The most popular technique used is NIR spectroscopy.
APPLICATIONS [8]
Applications of chemometrics in pharmaceutical field:
Chemometrics can be broadly applied to exploratory analysis, regression analysis and classification of data studies. The specific applications of various chemometrics models in pharmaceutical field are as follows.
A Diagnosis and drug synthesis:
B. Powder flow properties [8,28]
To conclude the pharmaceutical properties such as mean particle size, angle of repose, tablet porosity, and tablet hardness. NIR spectra of the Antipyrine granules were measured. This was analysed by principal component regression analysis. With the increase in the water amount, the mean particle size of the granules was found to increase from 81?m to 650?m, and it was possible to make larger spherical granules with small particle size distribution using a high-speed mixer.
C. Water content determination in excipients [8,32]
Water content of hygroscopic pharmaceutical excipients largely affects the manufacturing processes and the performance of the final product. The water content of commonly used tablet disintegrants namely crospovidone, croscarmellose sodium and sodium starch glycolate was studied by simple linear regression.
D. Dissolution studies
Ana Rita explored the application of near-infrared spectroscopy and multivariate data analysis to monitor in-situ and in real-time dissolution tests of an immediate release formulation containing folic acid and four excipients.
E. Tablet-parametric method
Predicted the drug content and hardness of intact tablets of theophylline using artificial neural network and near-infrared spectroscopy. Tanabe et al. employed NIR spectroscopic methods in combination with principal component regression (PCR) analysis for predicting hardness of the tablet formulations consisting of berberine chloride, lactose, and potato starch. The reflectance NIR spectra of various compressed tablets were used as a calibration set to establish a calibration model to calculate tablet hardness where in the predicted and the actual hardness values exhibited a straight line, an r2 of 0.925.
F. Formulation development
Formulation and assessment of protein-loaded solid dispersions by nondestructive methods like powder X-ray diffraction (PXRD), near infrared chemical imaging (NIRCI) were performed in combination with principal component analysis and partial least square regression.
G. Salt and polymorph screening
mixtures of polymorphs of carbamazepine can be detected with PCA score plots and that multivariate regression methods, such as PLS, can be used to estimate and determine the composition of these mixtures.
H. Pharmaceutical analysis
Various chemometric methods in combination with UV-Visible spectrophotometry, NIR spectroscopy, fluorescence spectroscopy, electroanalysis, chromatographic separation, and flow-injection analysis for the analysis of drugs in pharmaceutical preparations have been reported.
Chemometrics is of great industrial importance in various chromatographic research areas as a lot of experimental work should be carried out with respect to optimization of different columns, test compounds, mobile phases and their pH, flow rate and peak shape parameters. The chromatographic techniques coupled with chemometric tools provide useful information on separation and elution time. The validation parameters like robustness and ruggedness are also best evaluated with the help of chemometrics. For binary mixture analysis HPLC is combined with different calibrating techniques like PLS, PCR, CLS, and so forth; hence, they are collectively called HPLC-CLS, HPLC-PCR, and HPLC-PLS. Several liquid chromatographic methods in combination with chemometrics have been applied and reported in various method optimization and validation studies in diversified research areas. Chemometrics has also been widely used in combination with various spectroscopic techniques for the analysis of active molecules in dosage forms, plant materials and biological samples. Chemometrics- assisted simple UV- spectroscopic determination of Carbamazepine in human serum was reported and the results were compared with reference methods.
Further multivariate chemometrics has been used in metabolomics (study of small molecule metabolite profiles) along with GC/MS and NMR techniques to characterize the metabolic profiles of bio fluids, understand the mechanism of pathogenesis and uncover potential biomarkers of disease progression.
J. Chemometric aids in medicine and pharmacy [32,33]
Present automated laboratory instruments in biological/medical research produce a vast volume of measurement data, which are difficult to absorb and interpret. Therefore, a challenging task of powerful mathematical and statistical methods of chemometrics is to reduce them and reveal all useful information. An important role of chemometrics in medicine was known a long time ago, shortly after it was widespread among life sciences. There exist several areas of medicine as well as pharmacy where the aid of chemometrics is essential and indispensable:
K. Quality control of laboratory tests, measurement standardization [31]
common way of evaluating the biochemical state of “normality” or “abnormality” is based on the statistically derived reference intervals, which serve as the standards by which the laboratory test results are judged. Careful design of experimental protocol is the key in carrying out any evaluation of clinical diagnostic value. Reference interval development has classically relied on concepts elaborated by the International Federation of Clinical Chemistry Expert Panel on Reference Values during the 1980s. An important part of quality control is statistical comparison of agreement of laboratory tests. Selection of an optimal measurement method is usually based on comparison of a newly developed procedure with the traditional one. In such a case both compared methods represent random variables - as neither of them is error-free. Consequently, the use of ordinary standard least squares regression method is not appropriate since the basic assumption about the error-free independent variable is violated, which may cause critical errors when using standard ways of regression.
L. Medical diagnosis confirmation or prediction, assessment of effectiveness of laboratory tests [31]
Advanced implementation of chemometric and statistical algorithms facilitates clarifying of many important practical applications in medicine. It was found that an appropriate exploitation of multivariate statistics, e.g. the outputs of principal component analysis, techniques of discriminant analysis, logistic regression, together with the standard statistical tools like correlation analysis, analysis of variance (ANOVA) as well as the ROC curves may enable a better medical diagnosis confirmation and/or prediction.
M) Monitoring of the health state of the patients [31]
Further task of chemometrics is its assistance in detecting the changes in patient’s condition concerning progress of the disease and reaction to the medical treatment. Another target of chemometrics is determination of patient prognosis – prediction of the future medical state of a patient on the basis of present laboratory test results and clinical care. Several typical examples are shortly discussed in the next paragraphs.
N. Drug design, structure-activity and structure-property relationships [31]
Multivariate chemometric models can be used also for prediction of bio-properties from chemical data obtained either from various instrumental measurements or by calculation of the descriptors derived from the molecular structure. Drug development is a tedious and resource demanding process consisting of several steps. Chemometrics plays important role especially in the starting step – in the search of the compounds with a strong activity against the originators of diseases. In the last decades the QSAR (Quantitative Structure – Activity Relationships) became an important tool in drug discovery and in toxicology. The goal of QSAR is investigation how to predict the biological activity of a set of compounds in demand and elucidation of the specific action of the administrated drug; it deals with generalization of relations between the chemical and biological properties of the examined set of chemically resembling derivatives.
CONCLUSION:
REFERENCES:
Singh I, Juneja P, Kaur B, Kumar P. Pharmaceutical applications of chemometric techniques. International Scholarly Research Notices 2013;2013(1):795178.
Drashti N. Bhalodia , Dharmendrasinh A. Baria , A Review On Chemometrics In Pharmaceutical Analysis, Int. J. of Pharm. Sci., 2024, Vol 2, Issue 9, 58-78. https://doi.org/10.5281/zenodo.13625694