The effect of sample size on model error is examined through several commercial data sets, using five trade-off techniques: ACA, ACA/HB, CVA, HB-Reg and CBC/HB. Using the total sample to generate surrogate holdout cards, numerous subsamples are drawn, utilities estimated and model results compared to the total sample model. Latent class analysis is used to model the effect of sample size, number of parameters and number of tasks on model error.
Effect of sample size on study precision is always an issue to commercial market researchers. Sample size is generally the single largest out-of-pocket cost component of a commercial study. Determining the minimum acceptable sample size plays an important role in the design of an efficient commercial study.
For simple statistical measures, such as confidence intervals around proportions estimates, the effect of sample size on error is well known (see Figure 1). For more complex statistical processes, such as conjoint models, the effect of sample size on error is much more difficult to estimate. Even the definition of error is open to several interpretations.
Many issues face practitioners when determining sample size:
Some of these issues are statistical in nature, such as number of attributes and levels, and some of these issues are managerial in nature, such as value of the information, cost and timing. The commercial researcher needs to address both types of issues when determining sample size.
The intent of this paper is to examine a variety of commercial data sets in an empirical way to see if some comments can be made about the effect of sample size on model error. Additionally, the impact of several factors: number of attributes and levels, number of tasks and trade-off technique, on model error will also be investigated.
For each of five trade-off techniques, ACA, ACA/HB, CVA, HB-Reg, and CBC/HB, three commercial data sets were examined (the data sets for ACA, and CVA also served as the data sets for ACA/HB and HBReg, respectively). Sample size for each data set ranged between 431 and 2,400.
Since these data sets were collected from a variety of commercial marketing research firms, there was little control over the number of attributes and levels or the number of tasks. Thus, while there was some variation in these attributes, there was less experimental control than would be desired, particularly with respect to trade-off technique.
Table 1
Notice in Table 1 above that the number of parameters and number of tasks are somewhat correlated with trade-off technique. CBC/HB data sets tended to have fewer degrees of freedom (number of tasks minus the number of parameters) than CVA data sets. ACA data sets had a much greater number of parameters than either CBC/HB or CVA data sets. These correlations occur quite naturally in the commercial sector. Historically, choice models have been estimated at the aggregate level while CVA models are estimated at the individual level. By aggregating across respondents, choice study designers could afford to use fewer tasks than necessary for estimating individual level conjoint models. Hierarchical Bayes methods allow for the estimation of individual level choice models without making any additional demands on the study’s experimental design. A major benefit of ACA is its ability to accommodate a large number of parameters.
For each data set, models were estimated using a randomly drawn subset of the total sample, for the sample sizes of 200, 100, 50 and 30. In the cases of ACA and CVA, no new utility estimation was required, since each respondent’s utilities are a function of just that respondent. However, for CBC/HB, HB-Reg and ACA/HB, new utility estimations occurred for each draw, since each respondent’s utilities are a function of not only that respondent, but also the “total” sample. For each sample size, random draws were replicated up to 30 times. The number of replicates increased as sample size decreased. There were five replicates for n=200, 10 for n=100, 20 for n=50 and 30 for n=30. The intent here was to stabilize the estimates to get a true sense of the accuracy of models at that sample size.
Since it was anticipated that many, if not all, of the commercial data sets to be analyzed in this paper would not contain holdout choice tasks, models derived from reduced samples were compared to models derived from the total sample. That is, in order to evaluate how well a smaller sample size was performing, 10 first choice simulations were run for both the total sample model and each of the reduced sample models, with the total sample model serving to generate surrogate holdout tasks. Thus, MAEs (Mean Absolute Error) were the measure with which models were evaluated (each sub-sample model being compared to the total sample model). 990 models (5 techniques x 3 data sets x 66 sample sizes/replicate combinations) were estimated and evaluated. 9,900 simulations were run (990 models x 10 simulations) as the basis for the MAE estimations.
Additionally, correlations were run, at the aggregate level, between the mean utilities from each of the sub-sample models and the total sample model. Correlation results were reported in the form 100 * (1-rsquared), and called, for the duration of this paper, mean percentage of error (MPE).
It should be noted that there is an indeterminacy inherent in conjoint utility scaling that makes correlation analysis potentially meaningless. Therefore, all utilities were scaled so that the levels within attribute summed to zero (effects coding). This allowed for meaningful correlation analysis to occur.
Since each subsample was being compared to a larger sample, of which it was also a part, there was a sample bias inherent in the calculation of error terms.
Several studies using synthetic data were conducted to determine the magnitude of the sample bias and develop correction factors to adjust the raw error terms for sample bias.
For each of four different scenarios, random numbers between 1 and 20 were generated 10 times for two data sets of sample size 200. In the first scenario, the first 100 data points were identical for the two data sets and the last 100 were independent of one another. In the second scenario, the first 75 data points were identical for the two data sets and the last 125 were independent of one another. In the third scenario, the first 50 data points were identical for the two data sets and the last 150 were independent of one another. And in the last scenario, the first 25 data points were identical for the two data sets and the last 175 were independent of one another.
The correlation between the two data sets, r, approximately equals the degree of overlap, n/N, between the two data sets (Table 2).
Table 2
To extend the concept further, a random sample of 200 was generated, a second sample of 100 was created where each member of the second sample was equal to a member of the first sample and a third sample of a random 100 was generated, independent of the first two.
For each of the three samples, the mean was calculated. This process was replicated 13 times and the mean data are reported below (Table 3).
The absolute difference (MAE) between the first two data sets is 0.147308 and the absolute difference between the first and third data sets is 0.218077. By dividing the MAE for the first two data sets by the finite population correction factor (sqrt(1-n/N)), the MAEs become quite similar.
To continue reading, you may download a full version of this article