Essential raw material for stable operation

By Diana Bouchard

In today's industrial plants, process information systems collect vast quantities of data every day. A single variable read once a second for a single day generates almost half a million bytes of information. 

Suddenly, after years of having too little data, we have more than we can process or understand.

Don't mistake quantity for quality. Here are some areas of concern when facing a data mountain.

Missing values: These occur for two reasons, either the process is shut down, or some stage of the data collection process has failed. If the process shuts down, eliminate that period from the analysis. For short runs of missing values, interpolation can fill the gap. With long runs, interpolation is less trustworthy because of the large number of artificial values it introduces. 

Outliers: These grossly out-of-range values arise from sensor or communications malfunctions, data recording anomalies, human error, or occasionally a real exceptional event. A single, out-of-order exceptional data point can ruin an otherwise decent linear regression equation fitted to process data. How exceptional must a data point be to be an outlier? Choose a cutoff strategy beforehand and apply it uniformly, rather than arbitrarily removing points that "look like" outliers.

Not enough variability: A variable that does not vary contributes no useful information. Low variability can result from too-aggressive round-off of data readings or a measuring instrument that is not sensitive enough. Good automatic control also removes variability. One cure for low variability is the classic "bump test," which forces a normally stable variable to change, inducing changes in associated process variables. 

Too much variability (noise): If measurement noise is swamping the signal, look for over-sensitivity of the measurement system stemming from excessive gain, poor disturbance rejection, or poor sensor calibration or maintenance. In other cases, the noise may reflect true process instability. To deal with noise, start by examining the sensor and measurement system. Filtering techniques can also mathematically separate signal from noise. 

Sampling frequency: In a continuous process, every data value is a sample from a continuous stream of constantly varying information. Choice of sampling frequency is a tradeoff between information and economy of storage. Remember information lost due to low sampling frequency is gone forever, while higher-resolution data can always mathematically resample later.  A variable must sample often enough to capture the phenomena of interest. If a quality variable changes on a one-hour timescale, then it cannot be studied using daily average data.  

Filtering: Filtering is the processing of "raw" process data to remove components assumed to be invalid (out-of-range values or artificial zeroes) or else meaningless (high-frequency noise). Since the signal-to-noise ratio in observed process data is unknown, there is always a risk of removing real process information. For this reason, it is desirable to record the raw process data and filter afterwards. Careful filtering can remove the distracting effects of outliers and noise and allow the most important process interactions to come through.  

Data compression: Data compression came about to reduce storage requirements for process data, back when mass storage was expensive and of limited capacity.  Prices of mass storage have declined sharply in recent years; routine data compression is no longer justifiable. The roots of poor data quality are usually in one's data collection procedures and the configuration of your process information system. Decide which variables to measure and record. Basic science and engineering or experienced plant personnel can usually tell you "what affects what" and which variables are important.  Without process context, data become useless. Record the location and frequency of each measurement; tag number if available; how the value registered (sensor, lab test, and the like); normal operating value and range; accuracy and reliability; and any controllers that affect the variable.   Process data are the essential raw material for stable operation, consistent high quality, and effective process control. 

Diana Bouchard ( is a senior member of ISA and has a M.Sc. in Computer Science.