22. Improved Total Score Distribution Distribution - Form A Distribution – Short Form Distribution – WG II Form D 40.00 35.00 30.00 25.00 20.00 15.00 10.00 WGII_Tot 60 40 20 0 Frequency Mean =27.2396 Std. Dev. =6.34048 N =935
23.
24. Reliability .83 40 Total .70 16 Draw Conclusions .57 12 Evaluate Arguments .80 12 Recognize Assumptions r alpha Number of Items Reported Scale WG II – Form D (N = 1011)
25. Evaluate Arguments Conditional Reliability EA scores were more reliable in the moderately low to moderately high ability range.
26. Factor Structure Confirmatory Factor Analyses of Watson-Glaser II Form D (N = 306) 132 175.66 df Chi-square .03 .93 .94 RMSEA AGFI GFI
27. Intercorrelations Intercorrelations Among Watson-Glaser II Form D Subtest Scores ( N = 636) .41 .47 .84 3.1 11.2 4. Draw Conclusions .26 .66 2.2 7.7 3. Evaluate Arguments .79 3.1 8.2 2. Recognize Assumptions 6.5 27.1 1. WG II Total 3 2 1 SD Mean Scale
28.
29.
30.
31. History of WG Convergent Validity .53 41 Raven’s Advanced Progressive Matrices .70 63 Miller Analogies Test for Professional Selection .68 452 Advanced Numerical Reasoning Appraisal r N Cognitive Ability Tests .43 147 SAT – Verbal .39 147 SAT – Math .53 203 ACT Composite r N Achievement Tests
32. WG II Convergent Validity – Cognitive Ability .67 .36 .32 .60 Fluid Reasoning Composite .22 -.01 .09 .14 Processing Speed Index .46 .10 .34 .42 Verbal Comprehension Index .59 .13 .24 .44 Working Memory Index .56 .25 .20 .46 Perceptual Reasoning Index .62 .21 .31 .52 Full-Scale IQ Draw Conclusions Evaluate Arguments Recognize Assumptions Total Score WAIS-IV Composite/Subtest Watson-Glaser II Form D (N = 63)
33.
34. WG II Predictive Validity Correlations for Watson-Glaser II Scores and Performance Ratings (N = 65) .53 .21 .13 .39 Overall Potential .37 .04 .03 .17 Overall Performance .30 .34 .14 .34 Job Knowledge .45 .15 .25 .38 Creativity .30 .20 .31 .36 Bias Avoidance .46 .17 .32 .43 Evaluating Quality of Reasoning and Evidence .48 .17 .33 .44 Core Critical Thinking Behaviors Draw Conclusions Evaluate Arguments Recognize Assumptions Total Score Supervisory Performance Criteria Watson-Glaser II Form D Score
35. WG II Predictive Validity Performance Comparisons of Highly Ranked Critical Thinkers vs. a Contrast Group Contrast Group (N = 12) Highly Ranked (N = 23) .76 .04 2.9 11.1 2.5 13.1 Draw Conclusions 1.38 <.01 1.9 6.8 1.6 9.2 Evaluate Arguments .95 .01 3.0 7.6 1.3 9.5 Recognize Assumptions 1.27 <.01 6.7 25.5 3.9 31.8 Total Score Cohen's d p value SD Mean SD Mean Watson-Glaser II
Note to sales team – R&D will need to build out our ratings of controversial items after the test is published. Since the ratings were done solely in the US, we don’t know how the controversy may vary cross-culturally. We’ll likely use a multi-faceted approach for addressing this: 1) by analyzing cross-cultural performance on items to see if any items function differently in different cultures; 2) by obtaining “controversy” ratings from our colleagues in other countries like the UK, France, etc.
Both are from Evaluate Arguments, added item is from WG II - Form D
Samples used were primarily comprised of people from our populations of interest: Execs, Directors, Mgrs, Supervisor, Professionals, and Individual Contributors.
Samples used were primarily comprised of people from our populations of interest: Execs, Directors, Mgrs, Supervisor, Professionals, and Individual Contributors.
Note to sales team: The alpha value for EA of .57 is still low by conventional standards, so there will likely be questions about this. One thing worth mentioning is that this is a substantive improvement over the reliability of the previous EA scale, which typically exhibited alphas in the .20s. Another point is that we’ve informally observed that EA may be more multidimensional than the other subtests, which would lower its internal consistency (alpha) reliability. It tends to correlate more strongly with personality in some cases (e.g., MBTI Feeling), so it may be more of a multidimensional combination of personality, attitudes, and cognitive ability than the other subtests. From the Manual (FYI): Cronbach’s alpha and the standard error of measurement (SEM) were calculated for Watson-Glaser II Form D total and subtest scores using Classical Test Theory. Because Form E was developed using a common-item approach (i.e., no single examinee had data on all 40 items), traditional methods of estimating internal consistency were not applicable. The split-half reliability estimation for Form E was carried out using a method based on Item Response Theory (IRT) since IRT has more flexibility to deal with missing data. The reliability was calculated based on the ability estimates calibrated for the odd and even half of the test using the 27 items for which all examinees had complete data (i.e., items retained from Form B). The calculations were completed using a sample drawn from the customer data base ( N =2706). A correction was then applied to estimate the reliability of the 40-item form using the Spearman-Brown prophecy formula. Results are presented in Table 6.2 and descriptions of the sample used to estimate reliability for Form D are presented in Tables 6.3. Internal consistency reliabilities for the total scores were .83 and .81 for Forms D and E, respectively. Consistent with research on previous Watson-Glaser forms, these values indicate that Forms D and E total scores possess adequate reliability. Internal consistency reliabilities for the Form D subtests Recognize Assumptions (.80) and Draw Conclusions (.70) were both adequate. Internal consistency reliability for the Form D Evaluate Arguments subtest was .57, which is low. It is possible that this subtest is measuring a multidimensional construct (see Chapter 2). Overall, subtest scores showed lower estimates of internal consistency reliability as compared to the total score, suggesting that the subtest scores alone should not be used when making selection decisions.
Online assessment delivery is offered, as well as paper-based formats (typically only for large, preferred customers)
Confirmatory factory analysis (CFA) can be used to determine how well a specified theoretical model explains observed relationships among variables. Common indices used to evaluate how well a specified model explains observed relationships include the goodness-of-fit index (GFI), adjusted goodness-of-fit index (AGFI), and the root mean squared error of approximation (RMSEA). GFI and AGFI values each range from 0 to 1, with values exceeding .9 indicating a good fit to the data (Kelloway, 1998). RMSEA values closer to 0 indicate better fit, with values below .10 suggesting a good fit to the data, and values below .05 a very good fit to the data (Steiger, 1990). CFA can also be used to evaluate the comparative fit of several models. Smaller values of chi-square relative to the degrees of freedom in the model indicate relative fit. During tryout stage, a series of confirmatory models were compared: Model 1 specified critical thinking as a single factor; Model 2 specified the three factor model; and, Model 3 specified the historical five-factor model. The results, which are presented in Table 7.1 and Figure 7.1, indicated that Model 1 did not fit the data as well as the other two models. Both Model 2 and model 3 fit the data, and there was no substantive difference between the two in terms of model fit. However, the phi coefficients in the five factor model were problematic and suggest that the constructs are not meaningfully separable. For example, the phi co-efficient was 1.18 between Inference and Deduction and .96 between Deduction and Interpretation. Given this evidence, the three factor model was confirmed as the optimal model for the Watson-Glaser II. During standardization there was an opportunity to replicate the confirmatory factor analyses that were run during the tryout stage. A sample of 636 people participated in the validity studies. The general characteristics of this sample are provided in Table 7.5. Two hundred people did not provide all of the data needed for validation (e.g., job performance ratings), so this subgroup is not described in Table 7.5. The results of the confirmatory factor analysis supported the three factor model (GFI = .97; AGFI = .96; RMSEA = .03), providing further evidence for the three scales of the Watson-Glaser II.
Correlations among the Watson-Glaser II Form D subtests are presented in Table 7.2. The correlations were low to moderate, with Draw Conclusions and Recognize Assumptions correlating highest (.47) and Recognize Assumptions and Evaluate Arguments correlating lowest (.26). These correlations indicate that there is a reasonable level of independence and non-redundancy among the three subtests.
Explanatory notes for EA being low: (from manual) “The lowest was for Evaluate Arguments, which was expected. In previous Watson-Glaser Forms, Evaluate Arguments was the subtest that performed least well psychometrically, and contained several items that some customers found objectionable.” The Form D mean is slightly lower because it is more difficult (a good thing), and the SD is slightly larger because it differentiates better than the Short Form, resulting in more of a spread of scores across examinees. Note that there is no equivalent subscale for Draw Conclusions on the WG Short Form, so the Draw Conclusions correlations were based on the relationship between the WG II Draw Conclusions scale and a composite created using the WG Short Form’s Inference, Deduction, and Interpretation subscales.
Note to sales team – this is from the manual in case anyone asks about online vs. paper equivalence: Occasionally, customers inquire about the equivalence of on-line versus paper administration of the Watson-Glaser. Studies of the effect of test administration mode have generally supported the equivalence of paper and computerized versions of non-speeded cognitive ability tests (Mead & Drasgow, 1993). To ensure that these findings held true for the Watson-Glaser, in 2005, Pearson conducted an equivalency study using paper-and-pencil and computer-administered versions of the Short Form (Watson & Glaser, 2006). This study is presented in this manual for the readers convenient. Given these results no equivalency study was conducted for the Watson-Glaser II. In this study, a counter-balanced design was employed using a sample of 226 adult participants from a variety of occupations. Approximately half of the group ( n = 118) completed the paper form followed by the online version, while other participants ( n = 108) completed the tests in the reverse order. Neither mode of administration yielded consistently higher raw scores, and mean score differences between modes were less than one point (0.5 and 0.7). The variability of scores also was very similar, with standard deviations ranging from 5.5 to 5.7. The coefficients indicate that paper-and-pencil raw scores correlate very highly with online administration raw scores (.86 and .88, respectively). The high correlations provide further support that the two modes of administration can be considered equivalent. Thus, raw scores on one form (paper or online) may be interpreted as having the same meaning as identical raw scores on the other form.
Miller Analogies Test = Verbal Analogies (hence, the highest) Raven’s APM = Nonverbal Reasoning ANRA = Quantitative Reasoning
The recent release of the WAIS-IV created an opportunity to examine the correlation between WAIS-IV and the Watson-Glaser II scale scores. The Watson-Glaser II was administered to 63 individuals with a Bachelor’s degree or higher (a group similar to individuals in the Watson-Glaser II target population) who had recently taken the WAIS-IV (within the prior 11 to 23 months). The sample is described in Table 7.X, which is at the end of this section. The WAIS-IV consists of 15 subtests that measure cognitive ability across 4 domains: Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed. It was expected that Watson-Glaser II total score would be highly related to WAIS-IV total score. Further it was hypothesized that the total score would correlate with the WAIS-IV Working Memory index and the Perceptual Reasoning Index, and to a lesser extent Verbal Comprehension index. Reasoning and working memory are needed to perform critical thinking tasks that involve maintaining information in conscious awareness (e.g., a conclusion, a premise) and mentally manipulating the information to arrive at an answer. The Watson-Glaser II is a verbally loaded test and as such, it should correlate with Verbal Comprehension. Conversely, Processing Speed is an important component of cognitive ability, but it is not viewed as core to critical thinking. Finally, the WAIS IV has a composite, Fluid Reasoning, which measures the ability to manipulate abstractions, rules, generalizations, and logical relationships. It was hypothesized that this composite would be strongly correlated with the Watson-Glaser II total score because both scales measure reasoning. Table 7.4 presents the means, standard deviations, and correlation coefficients. The results indicated that Watson-Glaser II total scores were significantly related to WAIS IV Full Scale IQ score. As predicted, Watson-Glaser II total scores were also significantly related to Working Memory (.44), Perceptual Reasoning (.46), Verbal Comprehension (.42), but not Processing Speed (.14). At the Watson-Glaser II subtest level, it was expected that the scale Draw Conclusions would be more highly correlated with the WAIS-IV than the Recognize Assumptions or Evaluate Arguments scales. Relative to other subtests, Draw Conclusions requires mental manipulation of a larger number of logical relationships, resulting in greater complexity. Therefore it was predicted that, among the three subtests, Draw Conclusions would be more strongly correlated with the Perceptual Reasoning, Working Memory, Verbal Comprehension Indices and the Fluid Reasoning Composite than Recognize Assumptions and Evaluate Arguments. As predicted, the Draw Conclusions scale was more highly correlated with WAIS-IV Perceptual Reasoning ( r = .56 versus r =.20 for Recognize Assumptions and r = .25 for EA) Working Memory ( r = .59 versus r = .24 for RA and r = .13 for EA), Verbal Comprehension (e.g., r = .46 versus r = .34 for RA and r = .10 for EA), and Fluid Reasoning ( r = .62 versus r = .31 for RA and r =.21 for EA). These results also suggest that the scale Recognize Assumptions is more closely associated with working memory, verbal comprehension, and fluid reasoning than Evaluate Arguments.
The Watson-Glaser II Form D and MBTI were administered to 60 medical professionals working in a northeastern hospital network. They were engaged in a leadership development program. Correlations of WG II Form D scores with MBTI scores: MBTI Index Scores Total Score Recognize Assumptions Evaluate Arguments Draw Conclusions Extrovert/Introvert .12 .07 .08 .11 Sensing/iNtuiting .07 .10 -.06 .07 Thinking/Feeling -.10 .09 -.27 -.11 Judging/Perceiving -.17 -.08 -.20 -.13
The relationship between the Watson-Glaser II and job performance was examined using a sample of 68 managers and their supervisors from the claims division of a national insurance company. Managers completed the Watson-Glaser II and supervisors of these participants rated the participants’ job performance across thinking domains (e.g., creativity, analysis, critical thinking, job knowledge) and overall performance and potential. Table 7.7 presents means, standard, deviations, and correlations. Results showed that Watson-Glaser II total score correlated .28 with supervisory ratings on a scale of core critical thinking behaviors and .25 with ratings of overall potential. The pattern of relationships at the subtest level indicated that Draw Conclusions correlated significantly with all performance ratings except Job Knowledge ( r = .23, ns). Evaluate Arguments was significantly related only to Job Knowledge ( r = .26), and Recognize Assumptions was not significantly related to any of the performance dimensions.
A second study examined the relationship between Watson-Glaser II scores and job performance using 35 professionals at a large financial services company. Incumbents were ranked by the human resource staff familiar with their performance. The ranking involved categorizing incumbents into a “top” and “contrast” group based on critical thinking effectiveness demonstrated over time. Pearson provided the human resources staff with a list of behaviors typically exhibited by strong and weak critical thinkers, respectively, to help guide rankings. Table 7.8 presents a comparison of average Watson-Glaser II total and subtest scores achieved for each group. As expected, the group of top ranked critical thinkers achieved higher Watson-Glaser II total and subtest scores than the contrasting group.
The type of norms available and their composition characteristics are updated frequently, so it is best to contact an Account Manager (1 888 298 6227) or access TalentLens.com for the most current offerings. The Watson-Glaser II norms were derived from the existing Watson-Glaser norms through an extrapolation process. The raw scores on the Watson-Glaser II and Watson-Glaser Form A were converted to ability estimates using Rasch-model difficulty parameters. These ability estimates were then converted to a common scale (i.e., scaled scores), facilitating the comparison of scores across forms. This link across forms was used to allow the normative samples collected for Form A to be converted for use with the Watson-Glaser II. Fourteen occupational position or level groups created for Watson-Glaser Form A were selected as the normative samples. These groups contained relatively large numbers (average n = 967).
Online assessment delivery is offered, as well as paper-based formats (typically only for large, preferred customers)