Canadian Human Rights Tribunal

Decision Information

Decision Content

T.D. 2/96 Decision rendered on February 15, 1996:

THE CANADIAN HUMAN RIGHTS ACT (R.S.C., 1985, c. H-6 (as amended))

HUMAN RIGHTS TRIBUNAL

BETWEEN:

PUBLIC SERVICE ALLIANCE OF CANADA

Complainant

- and -

CANADIAN HUMAN RIGHTS COMMISSION

Commission

- and -

TREASURY BOARD

Respondent DECISION OF THE TRIBUNAL Tribunal: Donna Gillis, Chairperson Norman Fetterly, Member Joanne Cowan-McGuigan, Member

Appearances: Andrew Raven Counsel for the Public Service Alliance of Canada

Rosemary Morgan and René Duval Counsel for the Canadian Human Rights Commission

Duff Friesen, Lubomyr Chabursky and Deborah Smith Counsel for Treasury Board

Location of Hearing: Ottawa, Ontario

TABLE OF CONTENTS

I. INTRODUCTION

II. ISSUE

III. LEGISLATION

IV. BURDEN OF PROOF

V. STANDARD OF PROOF

VI. FACTS

A. THE WILLIS PLAN

B. THE WILLIS PROCESS

(i). Data-Gathering

(ii). Willis Questionnaire

(iii). Coordinators

(iv). Screeners and/or Reviewers

C. THE EVALUATION PROCESS

(i). Master Evaluation Committee

(ii). Multiple Evaluation Committees

(iii). Process for Evaluation of Questionnaires

(iv). Training of the Multiple Evaluation Committees

(v). Master Evaluation Committee's Evaluations

(vi). Multiple Evaluation Committees' Evaluations

(vii). Re-Training of Multiple Evaluation Committees

(viii). Sore-Thumbing

D. RELIABILITY TESTING

(i). Inter-Rater Reliability Testing

(ii). IRR Testing in the Multiple Evaluation Committees

(iii). Inter-Committee Reliability Testing

(iv). ICR Testing in the Multiple Evaluation Committees

(v). Wisner 222 Re-Evaluations

E. THE COMMISSION

(i) Commission Investigation

(ii). Sunter's Analysis

F. ROLE OF CONSULTANTS IN RE-EVALUATIONS

G. WHETHER THE RESULTS SHOULD BE ADJUSTED - THE EXPERTS

VII. DECISION AND ANALYSIS

VIII.CONCLUSION

APPENDIX A - COMMITTEE MANDATES

I. INTRODUCTION

1. The Canadian Human Rights Commission (the Commission) is established under the Canadian Human Rights Act, R.S., 1985, c. H-6 as amended (the Act), and is a party in this complaint, representing the public interest.

2. The Commission presented six witnesses qualified to testify as experts. The first witness to appear was Dr. Nan Weiner, an expert in pay equity and compensation. The second expert to testify was Norman D. Willis, an expert in pay equity and job evaluation. They were followed by two expert statisticians, Dr. Richard Shillington, an expert in data analysis and Alan Sunter, an expert in statistics. Also called were two employees of the Commission, Paul Durber and James Sadler. Durber is an expert in pay equity, job evaluation and other general areas of job evaluation and Sadler is an expert in pay equity and job evaluation.

3. The Respondent, Treasury Board (the Employer), is the employer of employees who work in Federal Public Service of Canada listed in Schedule 1, Part 1 of the Public Service Staff Relations Act, 1966-67, c. 72, s.1, p. 35, Schedule 1 (the PSSRA). In addition to Willis, the Employer only called one expert to testify, Fred Owen. Owen was a former Willis consultant and an expert in pay equity and job evaluation.

4. The Complainant, the Public Service Alliance of Canada (the Alliance), is an employee organization within the meaning of the PSSRA. The Alliance has been certified by the PSSRA to act as bargaining agent for a number of bargaining units in the Federal Public Service. The Alliance is the third largest union in Canada representing approximately 170,000 employees, 70 per cent of whom work outside of the National Capital Region. The Alliance is composed of 18 components which are, with the exception of one or two components, male-dominated. The largest bargaining unit represented by the Alliance is the Clerical and Regulatory Group (the CR Group) which consists of approximately 50,000 employees. This bargaining unit is 80 per cent female and includes employees performing an extremely wide range of functions.

5. The Alliance called four experts to testify during the course of this hearing. The first was Dr. Pat Armstrong, accepted by the Tribunal as an expert in job evaluation and pay equity. The Alliance also called Dr. Eugene Swimmer, an expert in labour economics and statistics. The Tribunal accepted one Alliance employee, Margaret Jaekl, as an expert in pay equity and job evaluation. Another individual, Margaret I. Krachun, who at the time of the hearing was employed by the Alliance, was accepted as a layperson with some experience in evaluation gained while a member of one of the evaluation committees.

6. The case originally before the Tribunal arose from complaints filed by both the Alliance and the Professional Institute of the Public Service of Canada (the Institute) alleging violation of s. 11 of the Act. The Institute called one expert witness, Dan Butler, a negotiator with the Institute. He was accepted by the Tribunal as an expert expressing the

2

opinion of the Institute on several issues before the Tribunal, primarily on wage adjustment methodology.

7. The human rights complaints before the Tribunal now pertain only to the complaints of the Alliance. The Institute's complaints are no longer before us. Those complaints were resolved by a negotiated settlement between the Employer and the Institute. A Consent Order was issued by the Tribunal dated May 31, 1995, giving effect to their settlement.

8. In the case of the Alliance, two complaints remain for our determination. The first complaint, dated December 19, 1984, alleges discriminatory practice contrary to ss. 7, 10 and 11 of the Act with respect to employees in the female-dominated CR Group. It is only the s. 11 portion of the 1984 CR Group complaint which has been referred to the Tribunal for ruling. The complaint presented on behalf of the employees in the CR Group affects the rights of approximately 50,000 workers who belong to this group.

9. The second complaint, dated February 16, 1990, alleges the results obtained through the process of the Joint Union-Management Initiative on Equal Pay for Work of Equal Value has demonstrated the existence of wage rates which are in contravention of s. 11 of the Act with respect to employees in the female-dominated occupational groups: Clerical and Regulatory; Secretarial, Stenographic and Typing; Data-Processing; Educational Support; Hospital Services; and Library Science. This complaint of the Alliance was filed with the Commission shortly after the breakdown of the Joint Union-Management Initiative (which will be detailed later). That complaint relies upon the job evaluation data generated by a study resulting from this initiative claiming, in support of its position, that employees in the identified complainant groups continue to suffer wage rate discrimination contrary to s. 11 of the Act, notwithstanding unilateral payments announced by the Employer in January of 1990.

10. From the outset, the Alliance's preferred position was to attempt to resolve equal pay issues through negotiations with the Employer at the bargaining table. It was only when these measures failed to lead to corrective action that the complaint mechanism of the Act was invoked.

11. The human rights complaints of the Alliance are not the first s. 11 complaints the Alliance has presented under the Act. The earlier complaints include the complaint of the Library Science Group (the LS Group) and the Hospital Services Group (the HS Group) on behalf of employees in the female-dominated sub-groups in the General Services Group (the GS Group).

12. In each of these cases, monetary compensation in the form of wage adjustments were paid to affected employees. The LS Group complaint was resolved with the understanding that final corrective action would await the outcome of the study. In the matter of the HS Group complaint, which was the subject of a Tribunal Order of July 15, 1987, another earlier tribunal, it was expressly understood by the parties that s. 11 complaint

3

would likewise await final wage gap computations after the conclusion of the study.

13. Each Federal Public Service employee occupies a position which is classified in accordance with the Employer's classification system. The Employer's classification system is comprised of 69 occupational groups, each with its own classification standard (job evaluation system).

14. In the classification system, positions are classified as belonging to occupational groups, sub-groups (where applicable) and levels. Occupational groups are designated by two-letter abbreviations; sub-groups by three-letter abbreviations. A position is the smallest organizational unit and represents a unique set of tasks and duties performed by an individual. The Employer has the same number of positions as it has employees. On the other hand, a job in the Federal Public Service is a grouping of positions which have the same key duties and responsibilities.

15. The occupational groups are assembled into six occupational categories as follows: (i) the Scientific and Professional Category; (ii) the Administrative and Foreign Service Category; (iii) the Technical Category; (iv) the Administrative Support Category; (v) the Operational Category; and (vi) the Executive Category.

16. In March of 1985, the government initiated pro-active measures to implement the principles of equal pay for work of equal value in the Federal Public Service. It invited unions and management to participate as partners in a senior level Joint Union-Management Initiative (the JUMI). The JUMI was directed by a committee (the JUMI Committee). The JUMI Committee was asked to prepare a detailed implementation plan in the area of equal pay for work of equal value. The unions, not only the Alliance, but other unions as well, accepted the government's invitation. The Alliance, at the time of accepting this invitation, had established a consistent policy of supporting the principle of equal pay for work of equal value. At the time of the voluntary initiative, there were three outstanding complaints before the Commission under s.11 of the Act.

17. The action plan agreed to by the JUMI Committee was to conduct a study (the JUMI Study) pursuant to s. 11 of the Act to determine the degree of sex discrimination in pay and to devise methods for system wide correction in order to eliminate sexually based wage disparities (Exhibit HR-11A, Tab 9, Annex B). The Commission was invited to be a participant of the JUMI Study to fulfil the role of an observer at committee meetings and to provide interpretation and guidance when required by the JUMI Committee. (Exhibit HR-11A, Tab 7). The Commission held all s. 11 complaints, which had been filed before the JUMI Study commenced, in abeyance. The Commission agreed that any new complaints received during the JUMI Study, which might be affected by the study, were to be held in abeyance as well.

18. The JUMI Committee had equal representation from the Employer and eight different unions. The JUMI Committee's first task was to define the parameters of the JUMI Study. Pivotal to its operation was the requirement for joint agreement between management and union representatives on the process to be used during the JUMI Study (the JUMI Process). Neither the

4

unions nor management was to act independently or make decisions in the course of the JUMI Study without joint approval. The JUMI Committee hired Willis & Associates, a consulting firm based in Seattle, Washington, to assist in the Study. Willis & Associates was founded and directed by Norman Willis.

19. Early on in the JUMI Study, the JUMI Committee made it abundantly clear to Willis that he had no decision-making authority in the conduct the JUMI Study. Willis' role was to attend the meetings and to give advice at the request of the JUMI Committee.

20. The JUMI Committee established sub-committees at various stages which were called upon by the JUMI Committee to provide advice, to perform certain tasks, and make recommendations to the JUMI Committee with respect to particular issues. Agreement by members of the JUMI Committee was required in order to form a sub-committee. Each sub-committee thus formed had equal representation from union and management sides.

21. In the fall of 1987, the JUMI Committee established the Equal Pay Study Secretariat (the EPSS) to conduct the administrative work associated with the JUMI Study. The EPSS was managed by a Treasury Board representative, Pierre Collard. The objective of the EPSS was to provide administrative support to the multiple evaluation committees in the JUMI Study and it was responsible for the coordination of all support activities.

22. In addition to hiring Willis & Associates, the JUMI Committee eventually agreed on other important matters. The JUMI Committee agreed to evaluate positions from male- and female-dominated occupational groups using a common evaluation plan. A comparison of wages paid to male- and female-dominated occupational groups performing work of equal value could then be made. The JUMI Committee agreed the study would be position specific using a representative sample of positions. A position-specific study means every different job selected for evaluation is evaluated separately as opposed to predominant use studies in which positions are selected for evaluations that best represent a classification or grouping of jobs. The JUMI Committee agreed only positions from male- and female- dominated occupational groups, as defined in s. 13 of the Equal Wages Guidelines (the Guidelines), were to be included in the representative sample.

23. As of March, 1985, based on s. 13 of the Guidelines (which prescribes the criteria defining sex predominance), the parties agreed there were 9 female-dominated occupational groups, 53 male-dominated occupational groups and 8 gender-neutral occupational groups. For clarity, s. 13 of the Guidelines is reproduced as follows:

13. For the purpose of section 12, an occupational group is composed predominantly of one sex where the number of members of that sex constituted, for the year immediately preceding the day on which the complaint is filed, at least

5

  1. 70 per cent of the occupational group, if the group has less than 100 members;
  2. 60 per cent of the occupational group, if the group has from 100 to 500 members; and
  3. 55 per cent of the occupational group, if the group has more than 500 members.

24. The nine female-dominated occupational groups represented by the Alliance and the Institute with their abbreviations are listed below:

¨ Clerical and Regulatory (CR); ¨ Data Processing (DA); ¨ Education Support (EU); ¨ Home Economics (HE); ¨ Hospital Services (HS); ¨ Library Science (LS); ¨ Nursing (NU); ¨ Occupational and Physical Therapy (OP); and ¨ Secretarial, Stenographic, Typing (ST).

25. Positions from gender-neutral occupational groups or the Executive Category were excluded from the study. The proposed JUMI Study, although service-wide in nature, was not intended to cover all employees providing services for the Government of Canada. The JUMI Study did not include employees of Crown Corporations nor did it include employees of separate employers. For purposes of the legislation, separate employers are identified in Part II of the PSSRA as follows:

¨ Atomic Energy Control Board ¨ Canadian Advisory Council on the Status of Women ¨ Canadian Security Intelligence Service ¨ Communications Security Establishment, Department of National Defence ¨ Economic Council of Canada ¨ Medical Research Council ¨ National Film Board ¨ National Research Council of Canada ¨ Natural Sciences and Engineering Research Council ¨ Northern Canada Power Commission ¨ Northern Pipeline Agency ¨ Office of the Auditor General of Canada ¨ Public Service Staff Relations Board ¨ Science Counsel of Canada ¨ Social Sciences and Humanities Research Council ¨ Staff of the Non-Public Funds, Canadian Forces

26. The sample eventually drawn was representative of positions by groups and levels for female-dominated occupational groups and by group for male-dominated occupational groups. Approximately 2,800 positions from female-dominated occupational groups and 1,500 positions from male- dominated occupational groups were ultimately included in the sample. The sample size and composition met with the approval of Statistics Canada.

6

27. The JUMI Committee agreed to use the Willis Job Evaluation Plan, with some amendments, as the appropriate job evaluation instrument for evaluating the representative sample of positions. The JUMI Committee also agreed to use the Willis Questionnaire, with amendments, to gather information on the positions to be evaluated. A communications strategy was recommended and agreed upon by a JUMI sub-committee to encourage selected incumbents to participate in the JUMI Study and to provide information on their positions. Position information was then collected from September, 1987 until January 1989. 28. The JUMI Committee acting on Willis' advice established, as a first step in the process of evaluation, a Master Evaluation Committee (the MEC). The MEC was asked to evaluate 503 position questionnaires which were to serve as benchmarks and as a frame of reference for all subsequent evaluations by other evaluation committees. The MEC began its important task in September, 1987 and finished it in July, 1988. In the final analysis, the MEC completed 501 benchmark evaluations.

29. After the MEC completed their evaluations, the remaining evaluations were done by 14 evaluation committees, (the multiple evaluation committees). The first five multiple evaluation committees began evaluating in September, 1988. By April, 1989, they had evaluated approximately 1,283 positions. In April, 1989, the multiple evaluation committees were expanded from five to nine. The nine committees included some members of the first five multiple evaluation committees as well as new members. The expanded committees evaluated approximately 1,400 positions between April, 1989 and September, 1989.

30. In May of 1989, the JUMI Committee decided, in view of the slow progress at which the questionnaires were being evaluated, that the sample size should be reduced by approximately 880 positions. The JUMI Committee then agreed to reduce the original sample from 4,300 positions to approximately 3,280 positions. The Office of the Chief Statistician for Statistics Canada was advised of the nature and reasons for the reduction in the sample size and approved the reduction. In the end, the MEC and the 14 multiple evaluation committees evaluated 3,185 positions from the reduced sample of positions.

31. The Commission's representatives functioned as observers throughout the evaluations of the MEC and the multiple evaluation committees. They were present during the meetings of the JUMI Committee and meetings of the multiple evaluation committee chairpersons.

32. Overall, the JUMI process had a number of shortcomings, largely due to the manner in which it operated. According to Willis, the JUMI Committee was ill-formed. Rather than working as a team, the JUMI Committee functioned in a negotiating mode with the unions on one side and the Employer on the other. As described by Willis, each side spoke with one voice. Because the Employer represented a singular position, this required the unions to caucus in order to respond in one voice. Rather than a joint union-management committee working together as a team, the proceedings were akin to union-management bargaining.

7

33. As a result, many decisions took a great deal of effort and time and were not easily or amicably achieved. For example, after the first JUMI Committee meeting which was held on September 16, 1985, it took until September 22, 1986, one year later, for the parties to reach an agreement on the Terms of Reference and Action Plan for the JUMI Study.

34. The length of time needed to carry out the JUMI Study prompted the Chief Commissioner of the Commission, on different occasions, to urge the President of the Treasury Board to resolve the outstanding issues occupying the JUMI Committee.

35. Another problem in the JUMI process was the inability of the management and union sides to reach closure on some major aspects of the JUMI Study. For example, when the MEC had completed its benchmark evaluations, Treasury Board withheld whole-hearted support of those evaluations. Although Treasury Board agreed to proceed with the rest of the evaluations, it continued to harbour doubts and indicated its intention to study the reliability of the MEC benchmarks independently.

36. Problems also arose during the course of the multiple evaluation committees' evaluations. Willis recommended disbanding one of the original five multiple evaluation committees. The JUMI Committee rejected this recommendation and could not agree on a resolution. In addition, there were some multiple evaluation committee challenges to the MEC benchmark evaluations. The JUMI Committee established a smaller version of the MEC (the Mini-MEC) to review and discuss these challenges. The Mini-MEC could not reach a consensus so in the end the matter was never fully resolved.

37. The JUMI Study was intended to encompass four phases. These phases were to be as follows:

Phase I

Agreement on the common evaluation plan to be used to determine the relative value of jobs and on the evaluation of benchmark positions.

Phase II

Agreement on the statistical methodology for sampling actual positions.

Phase III

Sampling and evaluation of actual positions, using the agreed to evaluation plan with benchmarks.

8

Phase IV

Determination of the degree of wage disparity and recommendations on corrective measures. These may include recommendations to resolve discriminatory aspects of the classification system which contribute to wage inequity as defined in Section 11 of the Canadian Human Rights Act.

(Exhibit HR-11A, Tab 9)

38. During the life of the JUMI Study tension between the management and union sides persisted and intensified. There was disagreement between the union and management sides relating to the release of evaluation scores. The JUMI Committee agreed the data would be released after two- thirds of the evaluations were completed. According to Willis, following the release of the MEC evaluation scores on July 13, 1988, relationships in the JUMI Committee began to deteriorate. It then became apparent to Willis that the climate of the JUMI Committee had changed. When the MEC results were made available to the parties, the Employer's classification system became an issue for the Employer. Willis was troubled and mystified by correspondence he received from the management co-chair on August 18, 1988, which indicated the parties were not ad idem on the purpose of the JUMI Study.

39. During the last few months of the JUMI Study, an issue arose between the union and management sides relating to a report released by Willis & Associates concerning re-evaluations by a Willis consultant of 222 multiple committee evaluations. This issue was never resolved by the JUMI Committee and eventually, led to the final breakdown of the JUMI Study.

40. The parties had contemplated eventual agreement upon a joint recommendation to the President of Treasury Board for implementation of pay equity. Phase 4 of the JUMI Study was never achieved. After approximately four years, in December, 1989, the union side withdrew from the JUMI Study on a temporary basis. In January, 1990, the largest participant union in the JUMI Study, the Alliance, permanently withdrew from the JUMI Study.

41. Early in 1990, the Government of Canada made a decision to unilaterally implement immediate measures to achieve equal pay for work of equal value for female-dominated occupational groups in the Federal Public Service. The measures adopted by the government were based on the evaluation results of the JUMI Study with corrective adjustments for gender bias arising from the controversial report by Willis & Associates on the 222 re-evaluations. Those measures were referred to as the public service equal pay adjustments or the equalization payments. The equalization payments were applied to three female-dominated occupational groups, the CR, NU, and ST Groups.

42. Neither the Commission nor the Alliance, or any of the other participant unions, were consulted by the Employer prior to making these voluntary adjustments. The parties were first informed of the Employer's decision when the President of the Treasury Board made an announcement on January 26, 1990. The adjustments involved payments of approximately $317

9

million for wages retroactive to April 1, 1985 and payments of $76 million annually in continuing adjustments. The lump sum payments by the government were made retroactive to March 31, 1985, the month in which the Treasury Board President first announced the establishment of the Joint Union-Management Committee to study how gender based wage discrimination would be eliminated in the Federal Public Service.

43. After the breakdown of the JUMI Study, the Commission and the Alliance made it clear to the Employer the data generated by the JUMI Study would be presented as evidence to a Human Rights Tribunal.

44. The formal investigation of the s. 11 complaints lodged with the Commission commenced following the announcement of the equalization payments. Included in its investigation, was an examination by the Commission of the equalization payments. This exercise was done to ensure full adherence to the Act and Equal Wages Guidelines. Following a formal six month investigation, the Commission decided to refer the s. 11 complaints to a Tribunal. That decision was made on October 16, 1990.

45. During the course of this hearing, when the Commission attempted to introduce the JUMI data into evidence, it was met with the objection by the Employer that the data was inadmissible on the grounds it had been created in an effort to resolve or avoid litigation and should therefore be treated as privileged. A voir dire was conducted by the Tribunal on this issue and following its completion, the Tribunal dismissed the Employer's objection in a ruling rendered August 21, 1992 (see Voir Dire Ruling for further details).

46. The Employer alleges the job evaluation data generated in the course of the study is not sufficiently reliable for the adjudication of the complaints referred to the Tribunal. The Employer is not satisfied with the reliability of the evaluation results. The Employer's equalization payments indicate the extent to which the Employer is willing to rely upon the evaluation results. The Commission and the Alliance are seeking to use the evaluation data for a determination of wage disparity and pay adjustments under s. 11 of the Act.

II. ISSUE

47. As a result of a pro-active initiative by the Employer, the Complainant, together with 13 other public sector unions, and the Respondent entered into a pay equity study called the Joint Union/Management Initiative.

48. The JUMI Study began in 1985 and lasted until January, 1990, when the JUMI Study was aborted firstly by the Complainant and then by the Respondent. The Complainant and the Respondent produced, over that period of time, job evaluation results.

49. Prior to the commencement of the JUMI Study, the Complainant had filed with the Commission a s. 11 wage discrimination complaint against the Respondent. After the breakdown of the JUMI Study, the Complainant filed a second and new complaint against the Respondent.

10

50. The Commission and the Complainant intend to use the job evaluation results from the JUMI Study as evidence of the value of work performed by male and female employees whose jobs are the subject of these complaints. The Commission and the Complainant further intend to use the job evaluation results as proof of a wage gap alleged by these complaints as contrary to s. 11 of the Act.

51. The Respondent submits the job evaluation results are unreliable for purposes of adjudication. More specifically, the Respondent alleges the job evaluation results are biased, in as much as, the male-dominated questionnaires and the female-dominated questionnaires used to produce the results were treated differently by the individuals who performed the evaluations.

52. Therefore, the issue is whether or not the job evaluation results of the JUMI Study are reliable for purposes of the s.11 complaints referred to this Tribunal for deliberation.

III. LEGISLATION

53. The complaints before us allege wage discrimination on the basis of sex contrary to s.11 of the Act. Section 11 states:

11(1) It is a discriminatory practice for an employer to establish or maintain differences in wages between male and female employees employed in the same establishment who are performing work of equal value.

(2) In assessing the value of work performed by employees employed in the same establishment, the criterion to be applied is the composite of the skill, effort and responsibility required in the performance of the work and the conditions under which the work is performed.

...

(5) For greater certainty, sex does not constitute a reasonable factor justifying a difference in wages.

54. The equal pay for work of equal value provisions of s. 11 of the Act was the subject of a Supreme Court of Canada decision in the case of Syndicat des employés de production du Québec et de l'Acadie v. Canada (Canadian Human Rights Commission), [1989] 2 R.C.S. 879 (S.C.C.). That decision dealt with the issue of whether the Canadian Human Rights Commission's decision to dismiss a complaint pursuant to s. 36(3)(b) of the Act is required by law to be made on a quasi-judicial basis and accordingly, reviewable by the Federal Court of Appeal under s. 28 of the Federal Court Act. The majority of the Court held that the Commission's decision was not reviewable by the Federal Court of Appeal under s. 28 of the Federal Court Act and thus, the Commission's decision was not one required to be made on a judicial or quasi-judicial basis.

11

55. Although the interpretation of s. 11 of the Act was not integral to the majority decision, Sopinka J. in delivering for the majority said at p. 903:

The intention of s.11 is to prohibit discrimination by an employer between male and female employees who perform work of equal value and not to guarantee to individual employees equal pay for work of equal value irrespective of sex.

56. In our view, as expressed by Sopinka J. the wording of s. 11 prohibits any practice by an employer to differentiate on the basis of sex when determining the wages or compensation to be paid between its male and female employees who perform work of equal value. For greater certainty, s. 11(5) makes it clear that sex does not constitute a reasonable factor justifying a difference in wages. Other sections of the Act also refer to prohibitions on the basis of sex. Section 3(1) of the Act includes sex as one of the prohibited grounds of discrimination. Section 7 of the Act declares that it is a discriminatory practice to refuse employment or differentiate adversely during the course of employment on a prohibitive ground, i.e., sex. Section 10 of the Act declares that it is a discriminatory practice to establish or pursue a policy or practice or to enter into an agreement affecting recruitment, referral, hiring, promotion, training, apprenticeship, transfer or any other matter relating to employment or prospective employment that deprives or tends to deprive an individual or class of individuals of any employment opportunity on a prohibitive ground of discrimination, i.e., sex.

57. The discriminatory practice alleged in the complaints before the Tribunal is that the Employer maintains a difference in wages between male and female employees employed in the same establishment who are performing work of equal value, contrary to s. 11. There are certain exceptions to the statutory prohibition against wage discrimination as stated by s. 11(4) of the Act. That section reads:

11(4) Notwithstanding subsection (1), it is not a discriminatory practice to pay to male and female employees different wages if the difference is based on a factor prescribed by guidelines, issued by the Canadian Human Rights Commission pursuant to subsection 27(2), to be a reasonable factor that justifies the difference.

58. The brief legislative history of s. 11 finds that the Government of Canada declared in 1976 that it would introduce a human rights bill. The major effect of the bill would be to prohibit discrimination on the grounds of race, colour, national or ethnic origin, religion, age, sex, marital status or physical handicap. In particular, the bill would establish the principle of equal compensation for work of equal value performed by persons of either sex. (Exhibit PIPSC-82).

59. The Background Notes to the Canadian Human Rights Bill, issued by the then Minister of Justice, indicate that the bill would consider, in relation to a prohibited ground, discriminatory practices such as the differentiation in wages based on sex between workers performing work of equal value. The notes state, at p. 4:

12

This provision is designed primarily to cope with female `work ghettoes'; it would enable workers performing one sort of job, such as secretarial work, to have their compensation related not only to that of other secretaries, but also to other jobs of equal value in the firm.

(Exhibit PIPSC-82, p. 3)

60. In 1977, the Government of Canada enacted the Act. The intent of s. 11 of the Act is to ensure that men and women who perform work of equal value receive equal compensation. Section 11 came into force on March 1, 1978. Section 27(2) of the Act authorizes the Canadian Human Rights Commission to pass guidelines interpreting the provisions of the Act. Since the proclamation of the Act in 1978, the Guidelines were twice promulgated by the Commission. The first set of Guidelines passed pursuant to the Act were prepared to assist in the interpretation of s. 11 of the Act and were issued on September 18, 1978. These were revoked by the Guidelines dated November 18, 1986, and gazetted in December, 1986.

61. The 1986 Guidelines describe the manner in which s. 11 of the Act is to be applied and the factors that are considered reasonable to justify a difference in wages between males and females performing work of equal value in the same establishment. The 1986 Guidelines prescribed ten factors justifying a pay differential between male and female employees performing work of equal value. None of these exceptions play a role in these complaints.

62. The dissenting opinion in Syndicat, supra, is helpful because it does address some of the prerequisite elements necessary to build a case under s. 11 of the Act. The dissent was delivered by L'Heureux-Dubé J. in which her Ladyship refers to earlier decisions of the Supreme Court of Canada, namely, Robichaud v. Canada (Treasury Board), [1987] 2 S.C.R. 84 and Canadian National Railway Co. v. Canada (Canadian Human Rights Commission), [1987] 1 S.C.R. 1114 (sub nom: Action Travail des Femmes) which reviewed complaints based on ss. 7 and 10 of the Act respectively. Both decisions make clear statements that intent is not a precondition to a finding of adverse discrimination under the Act. L'Heureux-Dubé J. notes the scope of protection under s. 11 differs from ss. 7 and 10 and says at p. 925:

As intent is not a prerequisite element of adverse discrimination, a complainant may build his or her case under ss. 7 and 10 by presenting evidence of the type adduced by the complainant in the present case. Statistical evidence of professional segregation is a most precious tool in uncovering adverse discrimination. Section 11, however, differs from ss. 7 and 10. Its scope of protection is delineated by the concept of equal value. That provision does not prevent the employer from remunerating differently jobs which are not equal in value. Wage discrimination, in the context of that specific provision, is premised on the equal worth of the work performed by men and women in the same establishment. Accordingly, to be successful, a claim

13

brought under s. 11 must establish the equality of the work for which a discriminatory wage differential is alleged.

63. L'Heureux Dubé J. is of the opinion that a complainant may build a case under ss. 7 and 10 without presenting or including as part of its case the element of intent. In her Ladyship's words, statistical evidence is a most precious tool in uncovering adverse discrimination.

64. L'Heureux-Dubé J. asserts that although the principle of equal pay for work of equal value is expressed in a straight forward manner, its application under s. 11 of the Act raises considerable difficulties. She maintains the concept is simple only in appearance. One element of difficulty is the concept of equality which, in her view, should not receive a technical or restrictive interpretation. In referring to the concept of equality, L'Heureux-Dubé J. says at pp. 926-27:

The prohibition against wage discrimination is part of a broader legislative scheme designed to eradicate all discriminatory practices and to promote equality in employment. In this larger context s.11 addresses the problem of the undervaluing of work performed by women. As this objective transcends the obvious prohibition against paying lower wages for strictly identical work, the notion of equality in s. 11 should not receive a technical or restrictive interpretation.

65. Another such difficulty, according to L'Heureux-Dubé J., persists in the concept of value. At p. 928, she states:

The notions of `skill', `effort', `responsibility' and `conditions' which one finds in the Act and the companion Equal Wage Guidelines are terms of art. They refer to the areas traditionally measured by industrial job evaluation plans.

66. Section 11(2) defines, in general terms, the manner in which the value of the work is to be assessed and establishes four criteria, namely, skill, effort, responsibility and working conditions. The criteria are defined in greater detail in s. 3 of the Guidelines, the companion to s. 11.

67. Madame Justice L'Heureux-Dubé observes that it is more than coincidence that the same four words used in this legislation were also used in the American counterpart and that these words are an indication that job evaluation plans can be used to determine whether jobs are of equal value under s. 11. However, she is of the opinion that the use of a job evaluation plan is not necessarily the only approach to the implementation of the provisions of s. 11. It is the Commission's view, as expressed in evidence by Durber, that in a s. 11 complaint, equality of work can be established through the use of a job evaluation plan but, may also be established through other less formal methodologies.

68. The Tribunal heard expert evidence that the purpose of a job evaluation plan is, in the context of a s. 11 complaint, to determine the relative worth of jobs within an organization. It involves a systematic

14

process which first defines and establishes factors which relate to the four criteria identified in s. 11(2) of the Act. The factors are weighted against each other for their relative importance. Each job is assessed against each factor to develop a hierarchy of jobs. Various steps or stages are involved before a hierarchy is developed which include gathering job information, defining the jobs considered for evaluation, evaluating each job and assigning scores for each compensable factor.

69. L'Heureux-Dubé J. commented on the use of job evaluation plans and the number of steps involved at p. 931:

All steps of such a job evaluation plan involve a measure of subjectivity. Social beliefs which have traditionally led to the undervaluing of women's work may bring a certain measure of bias in the design and application of these methods. To illustrate, job content information which is supplied by the employees can contain certain characteristics which, as a result of underlying values, may be overlooked in the assessment. There may be confusion between truly compensable characteristics and stereotyped notions of what are perceived to be inherent attributes of being a woman.

70. These comments were echoed by pay equity experts who testified at this hearing. While job evaluation procedures can be controlled, to a certain extent, it is still an inherently subjective process. The value assigned to each job is an expression of opinion given by individuals and is a judgment call by the evaluators. According to Willis, the pay equity expert and consultant to the JUMI Committee, such a procedure may incorporate both random and systematic errors of judgment.

71. Willis testified that random errors are to be expected in an undertaking as large as the JUMI and can result from a lack of sufficient job information, assumptions about particular job aspects, inconsistent application of the Willis Plan (the job evaluation plan), or simply from a differing interpretation of the job information. Willis indicated that while random differences are expected and tend to cancel each other out, patterned differences are not expected and do not cancel each other out. These patterned differences, or systematic errors of judgment, according to Willis, are evidence of bias on the part of evaluators and should be avoided.

72. Weiner, one of several pay equity experts who testified before the Tribunal, referred to the wage discrimination identified in s. 11 of the Act as one type of systemic discrimination. She describes the unintentional aspect of systemic discrimination in Volume 6, at p. 875, as follows:

Systemic discrimination is unintentional, impersonal, built into ongoing systems, often referred to as neutral systems, because they were never designed to discriminate.

Also, in Volume 6, at p. 877:

15

Systemic discrimination operates in systems. It goes on and on and on in the policy books and no one designed them to discriminate so it become [sic] much more difficult to identify that discrimination.

73. According to Weiner, this discrimination emanates from the practices and processes of an employer relating to compensation rather than from individual actions.

74. Of significance to the interpretation of systemic discrimination is the Supreme Court of Canada decision in CN, supra. In that decision, the Court upheld an order of a Canadian Human Rights Tribunal which imposed upon the Canadian National Railway a special employment program for employment equity. In upholding the remedial order, Dickson C.J., as he then was, in referring to the proper interpretative attitude toward human rights codes and acts said at p. 1134:

Human rights legislation is intended to give rise, amongst other things, to individual rights of vital importance, rights capable of enforcement, in the final analysis, in a court of law. I recognize that in the construction of such legislation the words of the Act must be given their plain meaning, but it is equally important that the rights enunciated be given their full recognition and effect. We should not search for ways and means to minimize those rights and to enfeeble their proper impact.

75. Dickson, C.J. elaborated on the purpose and objective of human rights legislation and on the Court's general attitude towards the interpretation of such legislation which is to give an interpretation that will advance the legislation's broad purposes. He referred to the Supreme Court's decision in Ontario Human Rights Commission v. Simpsons-Sears Ltd., [1985] 2 S.C.R. 536, which recognized that human rights legislation is directed not only at intentional discrimination but unintentional discrimination as well, and prohibits discrimination in situations of adverse affect discrimination.

76. The Supreme Court of Canada in CN, supra, recognized systemic discrimination in the context of employment equity as distinct from equal pay for work of equal value referred to by Weiner in her discussion relating to s. 11 of the Act. The Supreme Court recognized that s. 15(1) and by extension s. 41(2)(a) of the 1976-77 Canadian Human Rights Act as amended in 1985 were designed to resolve the problem of systemic discrimination. Dickson C.J. described systemic discrimination, at p. 1139, as follows:

In other words, systemic discrimination in an employment context is discrimination that results from the simple operation of established procedures of recruitment, hiring and promotion, none of which is necessarily designed to promote discrimination. The discrimination is then reinforced by the very exclusion of the disadvantaged group because the exclusion fosters the belief, both within and outside the group, that the exclusion is the result of natural forces, for example, that women just can't do the

16

job...To combat systemic discrimination, it is essential to create a climate in which both negative practices and negative attitudes can be challenged and discouraged. The Tribunal sought to accomplish this objective through its Special Temporary Measures Order.

77. In his decision, Dickson C.J. emphasized that the Order of the Tribunal, under review there, was made to implement an employment equity program which was not simply compensatory but also prospective in its provisions so as to confer benefits designed to improve employment opportunities for the affected group in the future. Further, Dickson C.J. reasoned that such a program was designed to break the continuing cycle of systemic discrimination in the employment of women. Dickson C.J. was of the opinion that the goal of the legislation, specifically with reference to s. 41(2)(a), was an attempt to eliminate the insidious barriers which would block future job applicants, that is to say women, from the unfair employment practices that their forebears had experienced as a group. It was not, on the other hand, concerned so much with compensating past victims of discrimination or providing employment opportunities previously denied to specific individuals.

78. Dickson C.J. found the goal was not to compensate past victims or even to provide new opportunities for specific individuals who had been unfairly refused jobs or promotions in the past, rather it was an attempt to ensure that in the future applicant workers from the affected groups would not face the same insidious barriers that blocked their forebears.

79. In that case, the Chief Justice agreed with McGuigan J., the dissenting member of the Federal Court of Appeal who found that s. 41(2)(a) of the Act (now s. 53(2)(a)) is designed to enable human rights tribunals to prevent future discriminatory employment practices against identifiable protected groups. The Chief Justice also reasoned in an employment equity program there simply cannot be a radical dissociation of remedy and prevention. Further, he held prevention is a broad term and it is often necessary to refer to historical patterns of discrimination in order to design appropriate strategies for the future.

80. We find that s. 11 does not specifically recognize the phenomenon we referred to as systemic discrimination and is not a well-designed vehicle for breaking the cycle of discrimination. The comments of Dickson, C.J. in the CN case, supra, need to be taken in context. In that case, an Order made by a Tribunal pursuant to s. 41(2)(a), now s. 53(2)(a), requiring the Canadian National Railway to adopt a special employment equity program in relation to the affected female group who were seeking blue collar jobs, was under appeal. It arose from a complaint-based on discriminatory employment practices and was decided in 1986.

81. The description of systemic discrimination by Dickson, C.J. in the CN case, supra, is, in our view, the kind of unintentional discrimination which s. 11 was designed to eliminate.

82. According to expert opinion, systemic discrimination has no focus or origin, only that it develops over time. It is an attitudinal

17

phenomenon which undervalues female work and thus differentiates against an individual or group based on gender or sex. Research has documented the group of people most commonly affected by this type of discrimination are females, and their wages and salaries, relative to male wages and salaries, are lower. This kind of discrimination is rooted in attitudes, beliefs and mind sets about work traditionally performed by males and work traditionally performed by females.

83. Counsel for the Commission submitted s. 11 is part of a statutory regime which prohibits systemic discrimination on the basis of sex and the payment of different wages between groups of predominantly male and predominantly female employees performing work of equal value. Commission Counsel further submitted s. 11 is designed to remedy the historical undervaluing of female work and to address gender discrimination in pay. Counsel submits that proof of gender discrimination in pay is found if there is a wage gap between male- and female-dominated occupational groups performing work of equal value. (Volume 218, p. 28424).

84. It is important at this point to understand the meaning of wage gap within the context of s.11 of the Act. The Tribunal had the benefit of expert evidence from Armstrong, with expertise in job evaluation and pay equity, who described the overall wage gap between prevalent rates of pay earned by females as compared to males. Armstrong testified in order to comprehend the wage gap one must understand the underlying factors which may have contributed to it. She stated there may well be some legitimate and unchangeable factors responsible to some extent for the existence of the wage gap.

85. A wage gap is not something clearly delineated. The Tribunal recognizes that salary differentials between male and female jobs can be a function of job requirements making some jobs intrinsically more valuable to the employer than other jobs. Such differentials are in contrast to differentials which are based entirely on gender differences and it is the latter resulting wage gap which the Tribunal believes s. 11 is intended to eliminate.

86. Section 11 incorporates the concept of equal pay for work of equal value in its wording. Weiner testified there are two questions which arise when one invokes this concept in the context of evaluation of jobs employing the same criteria, firstly, the identification of what is meant by equal value and, secondly, to define what is meant by equal pay. Weiner equates the concept of equal pay for work of equal value with the concept of pay equity.

87. The evidence before the Tribunal is that pay equity legislation addresses a trend that assumes systemic discrimination against female- dominated jobs. Some provinces have enacted pay equity legislation to remedy pay discrimination by identifying and redressing the wage gap through the implementation of pay equity plans. This latter legislation is pro-active because, Weiner says, its motive and intent is to provide a framework for redressing wage discrimination, rather than laying blame upon employers or unions for historical wage discrimination. The difference between pro-active legislation and s. 11 is that s. 11 is complaint based

18

legislation, whereby a complainant alleges discrimination against some identified comparator group. Since s. 11(1) talks about discrimination between male- and female-dominated jobs either way, Weiner says presumably under s. 11 one could have a male job alleging discrimination.

88. While the principle of equal pay for work of equal value underpins the provisions of s.11 and is frequently expressed as pay equity, there is in current usage of that phrase a pro-active connotation. There is, in fact, a significant difference between the principle enshrined in s. 11 which is complaint based and the pro-active approach to the problem of wage disparity which the experts in the field today accept and refer to as pay equity. The comments of Weiner in her testimony before the Tribunal are instructive and illustrative of the problem which is encountered when applying the principles of the Act and in particular s. 11 to remedying systemic discrimination in the work force. She stated in Volume 16, at p. 2124:

I agree with you that the Human Rights Commission law, including Section 11, is written with a complaint-based mind set. I think that was a mistake, but we didn't know that in 1977 when it was written. And really, while that makes a great deal of sense for many of the kinds of issues the Human Rights Act has to deal with, it doesn't fit as well with the systemic discrimination of something as complicated as the wage setting process.

So I think you are right, there is, to my mind, an anomaly of law makers, including a methodology that fit our 1970s thinking of how discrimination operated with some forward thinking about another problem, but not recognizing that the equal value pay equity problem was a systemic problem and didn't fit as well with a complaint-based mentality. [emphasis added]

89. Weiner, who is co-author of Pay Equity: Issues, Options and Experiences with Morley Gunderson, summarizes at the end of Chapter 8, at pp. 127-28, their conclusion regarding the federal legislation as follows:

That pay equity is an idea whose time has come is demonstrated by the initiation of pay equity in eight Canadian jurisdictions since 1985. In the previous ten years, only two jurisdictions had passed pay equity legislation. Unlike most of the subsequent legislation, these two early pieces of legislation, in the federal government and in Quebec, were complaint-based. The inability of such legislation to address a systemic problem like pay equity is evidenced by the employer-initiated enforcement mechanism in most of the recent legislation.

90. Some change has been instituted through the political movement in the United States to enact comparable worth plans which, in turn, has created a framework within which previously invisible or unacknowledged skills associated historically with female and minority work were made visible and worthy of compensation. The parallel pay equity movement in Canada saw the enactment of provincial legislation designed to redress systemic wage discrimination and compensation for work performed by

19

employees of female-dominated jobs. Of relevance is the preamble to the Pay Equity Act (Ontario), 1987, which states that affirmative action is required to redress systemic wage discrimination. However, the legislative history of s. 11 does not document the same political motivation contained in that legislation or other provincial legislations found in Manitoba, Ontario, Prince Edward Island, Nova Scotia and New Brunswick.

91. Provincial legislation is aimed at correcting systemic discrimination and provides a time frame and a procedure for achieving pay equity. The approach in the provincial legislation is future oriented and while recognizing past injustices, the remedies are focused on achieving equity in employment as well as in pay. On the other hand, s. 11 of the Act is complaint based and is silent on the means for achieving equal pay for work of equal value. While the Guidelines passed pursuant to the Act expand on the four essential elements of s. 11(2), i.e., skill, effort, responsibility and working conditions and define how value is to be assessed, who is an employee, what a group is and so on, it does not establish a programme or describe an appropriate methodology for achieving the goal of eliminating systemic discrimination. It is a phenomenon which is not expressly referred to either in the Act or in the Guidelines.

92. Referring again to Weiner commenting on s. 11, she states in Volume 16 at p. 2125, the legislation does not recognize that equal pay for work of equal value was a systemic problem that didn't fit as well with a complaint based mentality. Therein lies the difficulty with s. 11 which is not entirely compatible with the evolution and application of the principles of pay equity (or comparable worth) during the past two decades. Nevertheless, it is necessary in view of the general nature and intent of the legislation which is to combat systemic discrimination to adopt the reasoning of Chief Justice Dickson, at p. 1139 of CN, supra, where he states:

...it is essential to create a climate in which both negative practices and negative attitudes can be challenged and discouraged.

93. The wage setting process in the Federal Public Service is a highly complex process spanning many decades, each contributing to new trends and developments, most notably, the introduction of collective bargaining in the 1960s. The advent of collective bargaining brought about contract negotiations which in turn affect the determination of wage rates. For the most part, job classification in the Federal Public Service has been determined by a job evaluation process; however, no single process has ever been stipulated and the result is a classification structure of multiple occupational groups with no common job evaluation plan. Rates of pay have been arrived at through this process with the aid of labour market surveys, largely provided by the Pay Research Bureau until 1992. It is apparent the classification system has been undergoing reform since 1990.

94. Evidence was led that the Government of Canada is committed to simplifying the job classification system in the Public Service through an initiative entitled PS2000. Part of this Initiative is a commitment to compensate employees equitably, in a manner that is free of gender bias and

20

maintains equal pay for work of equal value. A new classification system is being introduced to meet these commitments.

95. Documentary evidence reveals a Task Force has been examining and developing this initiative and has thus far produced a draft pamphlet in November, 1992 as a reference guide for public sector employees to prepare what is referred to as gender-neutral work descriptions

96. The expert evidence reveals that compensation systems that rely on market surveys can result in wage disparities for jobs deemed to be of equal value. Research has shown that the market reflects an historical pattern of lower wages to employees in positions staffed predominantly by females. For the most part, market rates are established through the use of traditional job evaluation systems which self-perpetuate the problem of undervaluation of female work as these traditional job evaluation systems were not designed to capture skills associated with female work.

97. The pay equity experts explain that gender bias is reflected in existing compensation systems and pay practices. Historically these systems and practices undervalue female work. Since the purpose of s. 11 is to remove gender discrimination from pay, based on the intrinsic value of a job, any job evaluation system used to assess job value must be designed to eliminate factors that contribute to gender bias and include factors that will capture skills associated with female work which have, in the past, been overlooked.

98. We do find that s. 11 is a remedial section dealing with salary inequities which arise between jobs that are deemed by some process of evaluation to be of equal value. The salary inequity, or resulting wage gap, is the salary differential between rates of pay for male and female employees who are performing work of equal value and not the overall wage gap referred to by Armstrong. She was referring generally to differences in pay between males and females which can result from factors in addition to gender inequities. Armstrong's response to the following question in Volume 179, at p. 22879, lines 2 - 7, is informative:

If pay equity were to be achieved in all occupations, in all jobs, would the wage gap disappear?

THE WITNESS. The overall wage gap probably wouldn't disappear completely, no. There still might be a difference.

99. We must be assured the complaints seek to redress a wage gap based on wage differentials that are gender based and not resulting from other factors. It seems apparent that the existence of a wage gap per se is not proof of discrimination. To hold otherwise would negate the entire evaluation process which has, as its purpose, the comparison of jobs according to a plan or system for rating work according to the criteria prescribed in s. 11(1) of the Act.

100. We also find s. 11 is designed to eliminate economic inequality created by gender based wage discrimination. The discrimination is unintentional as the decision of Dickson C.J. in the CN case, supra, makes

21

clear. It is nevertheless a subtle form of discrimination built into employment practices as they have existed over the years since females have become contributors to the work force. We recognize from the expert testimony of Weiner, Armstrong and Willis that systemic discrimination operates in systems and becomes incorporated into the wage setting practices of organization and that classification of jobs may be the by- product of systemic discrimination. Since systemic discrimination is part of a system never designed to discriminate, Weiner says that it cannot be corrected instantly nor can pay equity be achieved quickly.

101. As remedial legislation, s. 11 addresses pay disparities in employers' compensation practices. Willis testified to the effect s. 11 is not true pay equity legislation, but instead concerns itself with examining pay disparities and is probably a first step in the direction of true pay equity insofar as it requires the wages of females be moved up to the same level as males. Willis testified in Volume 29, at p. 3760, line 22 to p. 3761, line 9:

The concept of pay equity has to do with compensation without gender bias; that is, compensation based on the intrinsic value of a job rather than the market value of a job.

I recognize there is a school of thought that says: Don't bother with job evaluation, just give us the money.

But in order to logically arrive at an intrinsic value of a job, you exercise a job evaluation plan; that is, job evaluation provides a way of taking any job apart and examining the amount of skill, effort, responsibility, and conditions of work that are required.

102. If employers use job evaluation systems that are gender biased in favour of male work, the result will be seen in differential wages paid to male and female jobs that ought to be considered of equal value. Job evaluation systems that traditionally favour male jobs do not value the skills and job content of jobs that are designated female work. Traditional job evaluation is most often designed to value characteristics of male work. On the other hand, pay equity job evaluation has as its goal the use of systems that remove gender bias in the valuing of work.

103. At this point, it is useful to recall some of the circumstances leading up to the issue before us. Under the JUMI, the parties engaged in a proactive study with the intention of developing parameters in which to implement the principal of equal pay for work of equal value as incorporated in s. 11 of the Act. The parties employed a pay equity expert, Willis, to assist with this study and used the Willis Job Evaluation Plan to assess selected jobs from male- and female-dominated occupational groups in the Federal Public Service. The JUMI never completed its task. The data generated in that study is now in evidence before the Tribunal. It was used by the Commission in its investigation of the complaints and is presented as proof of a breach of s. 11 of the Act. The Commission and the Alliance have called upon the Tribunal to accept the evaluation scores as evidence of the value of work. These results, they

22

submit, can be used to establish the equality of work and are proof of a wage gap. It is alleged by the Employer that the results are not reliable. We are called upon to determine whether these results are reliable.

104. We have previously referred to the steps involved in the use of the job evaluation plan as discussed by L'Heureux-Dubé J. in the Syndicat decision, supra. The Tribunal heard lengthy evidence on the Willis Process which incorporates the typical steps involved in job evaluation. The reliability of the results, which is the issue before us, focuses on one particular important step in job evaluation, namely, its application by the evaluators who had the responsibility of analyzing the job information and assigning points or scores to each of the job factors in the Willis evaluation plan.

105. Willis and Weiner agree for job evaluation to be effective in eliminating bias it must be approached in a systematic fashion. At the same time, one needs to understand that job evaluation is an inherently subjective process. The question as to what constitutes bias is a complex one, and is central to the arguments submitted to us. Counsel for the Commission refers to the Equal Wages Guidelines passed pursuant to the provisions of the Act and in particular to s. 9(a) which reads:

9. Where an employer relies on a system in assessing the value of work performed by employees employed in the same establishment, that system shall be used in the investigation of any complaint alleging a difference in wages, if that system

(a) operates without any sexual bias;

106. Commission Counsel, supported by the Alliance, posits the question of the reliability of the results to be addressed by the Tribunal as follows:

Is there a pattern, a systematic variance of different treatment of male and female questionnaires (in the evaluation process) that was caused by or is attributed to gender bias or gender related bias. [emphasis added]

107. Respondent Counsel advocates a broad reading of the term sexual bias as used in s. 9(a) of the Guidelines. Respondent Counsel proposes that any bias that is a different treatment of male and female questionnaires is a sexual bias and bases this submission on an interpretation of Willis' testimony, who described bias in Volume 208, at p. 26937, lines 11 - 16, as follows:

A. Bias simply means that if there is a pattern of different treatment for male-dominated jobs versus female-dominated jobs, whether it's conscious or unconscious, that difference in treatment would represent an amount of bias.

108. According to Willis, it is possible to have a bias that is related to gender but is not a direct gender bias, and he refers to this bias as a gender preference. His example of a gender preference would be

23

where an evaluator may have a preference for individuals who wear blue shirts or have blue collars. Willis testified, for example, trade jobs are known as blue collar jobs. If a preference for blue collars causes an evaluator to evaluate trades jobs more favourably, Willis says that this may not be a gender bias but maybe it is a blue collar bias. According to Willis, this would bring the same result as a direct gender bias.

109. The Employer submits that the meaning of gender bias can include attitudes toward one sex or the other that are conscious or unconscious. In the Employer's view, a bias can also relate to some characteristic which is not gender per se, but is itself related to gender, which they describe as a gender-related bias. Respondent Counsel submits that s. 11 is designed to redress both kinds of biases. Respondent Counsel also submits that if a blue collar preference results in a different treatment of male and female questionnaires this is a sexual bias as contemplated by s. 9(a) of the Guidelines. Respondent Counsel urges that the question to be addressed by the Tribunal is not the one posed by the Commission but rather is as follows:

Is there a pattern of different treatment of male and female questionnaires?

110. It is to be noted that the Commission and the Alliance do not express any difficulty in assigning a wide meaning to the term sexual bias and we refer to the remarks of Commission Counsel in Volume 230, at p. 30583, lines 14 - 25:

These are very different things. All those other things such as - - I mean, they referred to something called dirty work'. I don't know whether you would have a preference for people who do hard work outdoors. If that were gender related and had an effect on the way people perceive the work and rated the jobs, and in the end had a consequential gender effect on the jobs, then no one can dispute that that would be a gender bias contrary to section 9 of the Equal Wage Guidelines and therefore contrary to section 11 of the statute.

111. In formulating the question for the Tribunal to address, Respondent Counsel argues that their formulation does not require a causative factor for the different treatment of male and female questionnaires. The disagreement between the parties lies not in assigning a broad meaning to the words sexual bias but instead arises as to whether s. 11 requires the existence of a cause when different treatment of male and female questionnaires is found or whether, on the other hand, it is simply a matter of differential treatment of male and female jobs without the necessity of assigning cause. In support of the Employer's submission, they rely on a meaning of bias which in their view does not require a causal link or relationship under s. 11 of the Act.

112. There is a disagreement between the parties about the analysis and investigative findings of the Commission on the job evaluation process and the statistical evidence. The dispute centres on the submissions of the Commission and the Alliance that some differences in treatment of male

24

and female questionnaires between committees and consultants are not based on gender or gender-related bias but are due to a value bias. The Commission and the Alliance rely on the statistical expert, Sunter, whose analyses, they submit, demonstrates the effect of a value bias which accounts for some, if not all, of the differences in treatment between the committee and the consultant. Sunter testified that the effect of the value bias has an appearance of gender bias and the difference in treatment between the committees and the consultants is as likely to be a consequence of value bias as it is gender bias.

113. In dismissing the need to know the cause of differential treatment of male and female questionnaires, Respondent Counsel relies on the testimony of Willis. Willis repeatedly stated during this hearing that after the evaluation process is finished there is no need to explore the reasons for the differences between the evaluations of the committees and the consultants. That testimony can be summarized in Willis' letter to Respondent Counsel dated May 19, 1994 which expands on job evaluation disparities. This reads as follows:

Evaluation disparities represent a lack of consistency in the application of the evaluation system. Therefore, disparities are a cause for concern, and require attention to determine if they result in a pattern of different treatment for different kinds of jobs.

The question as to why disparities have occurred is important during the course of the committees' work. An understanding of the reasons can be helpful in the continued training of the members. However, after the evaluation phase of the study has been completed, the reasons for any disparities are no longer of any real importance. What is important is the existence of any pattern of bias that is developed among the evaluations. [emphasis added] (Exhibit R-164)

114. In the course of our hearing, in addition to his definition of bias as a different treatment between male and female jobs, Willis offered an opinion on the meaning of gender bias in a pay equity study in Volume 80. He states in Volume 80, at p. 9737, lines 13 - 18:

In the context of the pay equity study, gender bias has to do with the extent to which jobs that are traditionally held by one sex or the other are paid more favourably than jobs that are traditionally held by the opposite sex.

115. Willis refers to gender bias as both different treatment and different pay. To better understand Willis' definitions of bias, it is helpful to refer to the theory of disparate treatment considered in the decision American Federation of State, County and Municipal Employees, ASL- CIO et al. v. State of Washington et al., Nos. 84-3569, 84-3590, 770 s. 2d 1401 (1985) United States Courts of Appeal, 9th Circuit.

25

116. The plaintiffs in the American case alleged sex discrimination in compensation against the State of Washington pursuant to s. 703(a) of Title VII of the Civil Rights Act of 1964, 42 U.S.C. The United States District Court for the Western District of Washington had found in favour of the class of state employees of which at least 70 per cent were female, and the state had appealed to the Court of Appeal, 9th Circuit. A relevant fact in the District Court decision, was that Willis had conducted a study in 1974 to examine and identify salary differences pertaining to job classes predominantly filled by males compared to job classes predominantly filled by females, based on job worth. The 1974 Willis Report submitted into evidence concluded based on the job content of the 121 classifications evaluated, the tendency was for female classes to be paid less than male classes for comparable job worth, and that overall the disparity was approximately 20 per cent. Willis' study had deemed the male and female positions to be of comparable worth. Comparable worth as defined by the State, for the District Court, means the provision of similar salaries for positions that require or impose similar responsibilities, judgments, and knowledge.

117. In the first instance, the district court had found a violation of Title VII premised upon the American disparate impact and the disparate treatment theories of discrimination. As explained in the District Court's judgment, Title VII prohibits two types of employment discrimination: (i) intentional unfavourable treatment of employees based upon impermissible criteria; and (ii) practices with a discriminatory impact: facially neutral practices that have a discriminatory impact and are not justified by business necessity.

118. The District Court decision was appealed to Kennedy, Circuit Judge for the United States Court of Appeals, who considered the allegations of disparate treatment, and held that the unions had failed to prove a prima facie case of sex discrimination by the preponderance of the evidence. In citing reasons, Kennedy J. offers the following with regard to the Willis study at p. 1408:

We also reject ASFCME's contention that, having commissioned the Willis study, the State of Washington was committed to implement a new system of compensation based on comparable worth as defined by the study. Whether comparable worth is a feasible approach to employee compensation is a matter of debate...Assuming, however, that like other job evaluation studies it may be useful as a diagnostic tool, we reject a rule that would penalize rather than commend employers for their effort and innovation in undertaking such a study.

119. As noted in the decision of Kennedy J., under the disparate treatment theory, an employer's intent or motive in adopting a challenged policy is an essential element of liability for violation of Title VII. To establish liability, a plaintiff must show the employer chose a particular policy because of its affect on members of a protected class and it is insufficient for a plaintiff to allege under this theory that the employer was merely aware of the adverse consequences the policy would have on a protected class.

26

120. The United States Court of Appeals had to find a proof of intent required in a disparate treatment case unlike s. 11 which addresses systemic discrimination, a form of unintentional discrimination. Willis' definitions of bias should be viewed within the context of that American jurisprudence which we note deals with a different statute than the Act and a different requirement of intention.

121. Referring to s. 9(a) of the Guidelines, supra, we note that it provides, inter alia, that an employer may use a system for assessing work if that system operates without any sexual bias. By way of contrast and focusing on the issue which we must resolve, it is in the application of the system that we are concerned. In this regard, it is helpful to refer to the comments of Weiner, and to the following statement she made:

Even though I mentioned gender bias in job evaluation and gender bias in the evaluation systems, I think gender bias in the application of the system is key. If you give people bias free job information and a bias free evaluation system, people can still introduce gender bias when they apply it.

122. There is agreement between the parties that the evaluation system, i.e., the Willis Job Evaluation Plan, is bias free by any reasonable standard. According to Willis, if there is a pattern of differential treatment between male and female questionnaires, this is evidence of systematic biases occurring in the application of job evaluation. For purposes of determining whether bias is present in the results, Willis was not prepared to give an opinion based solely on his observations of the committee process, but instead said he would rely on statistical analysis of the data. According to Willis, there are two ways to determine if bias is present in the application of the plan: (i) observation by the consultants who participated in the process; and (ii) statistical analysis.

123. It was Willis' opinion that as a factual matter, he and his consultants who were present during the job evaluations, were able to ferret out and identify direct gender bias. They observed how evaluators responded to questions as to why they evaluated jobs a particular way. The consultants would not permit a rater to defend an evaluation based on opinions or conclusions.

124. Willis said that to the extent that indirect biases occur, they are more difficult to detect. Usually the only way to detect if indirect bias is operating is to do a statistical analysis of the results to determine if a pattern in the ratings exist. Since the evaluators are usually unconscious of these biases, they are not aware of making gender based judgments. Evaluators will apply points unevenly across male and female jobs, and male and female jobs will consistently receive low or high points. Generally speaking, a statistical analysis will reveal this type of pattern if indirect biases have entered the process.

125. Both the question posed by the Employer and the question of the Commission, in our view, restrict the Tribunal from fully assessing the question of reliability. The issue of reliability is not purely

27

statistical and the questions as suggested restrict our assessment of the evidence to statistical measures.

126. It is important to bear in mind that the results were generated through a process of job evaluation overseen by a JUMI Committee with the advice and consultation from a pay equity expert. Willis testified that he recommended to the JUMI Committee certain safeguards in the process to ensure consistent and reliable results. These safeguards included reliable job information, balanced evaluation committees, selection and rating of benchmarks, sore-thumbing exercises, training of participants, quality checks in the form of testing of evaluators and committees and consultant participation.

127. Willis took a consistent position throughout the hearing that in order to analyze the results, he required a statistician. He viewed the role of the statistician to examine the data and to ascertain the extent of the problem if one was found.

128. Willis said he would not support the reliability of the results based on the process alone. When asked to consider the process without the results, Willis said in Volume 78, at p. 9570, lines 12 -22:

A. ...if all of my recommendations had been taken, if I had felt that the processes that were followed were all sound, then it is quite likely that I would have been able to support the results of the study without doing any testing.

This didn't happen. I have not yet supported the results of the study. But in the final analysis that testing is going to tell somebody, me or somebody else, whether or not the study was sound.

129. Expert Weiner has stated that the idea behind a job evaluation process is to be systematic, that the process should involve a series of steps. The goal in pay equity job evaluation should be, in her opinion, to apply the process fairly to all the jobs. Notwithstanding Willis' opinion which focuses primarily on results rather than process, the Tribunal must be able to assess the checks and balances in the process, and must be able to do this not only from a statistical perspective but also to analyze reliability by assessing how the Willis job evaluation plan was applied by the job evaluation committees.

130. We are entitled to look at the Act as a whole, including the regulations and Guidelines passed pursuant thereto, in order to assist us in interpreting the meaning of s. 11(1), see Driedger, The Construction of Statutes, Chapter 11, 3rd Edition by Ruth Sullivan. It is our opinion the legislation, given the state of development or evolution of the concept of equal pay for work of equal value at the time, its complaint based orientation and considering the gender driven language of the relevant sections, that causation is implicit in its provisions.

131. The wage gap to be redressed by s. 11 must be caused by gender based discrimination. Section 9(a) of the Guidelines is subordinate to the enabling legislation, the Act, and is authorized by s. 27(2) of that Act.

28

There is a presumption in favour of the validity of regulations in the light of their enabling statute. In the Interpretation of Legislation in Canada, 2nd Edition, Pierre Andre-Coté at p. 310 the learned author comments as follows:

Finally it must be pointed out that the regulations are not only deemed to remain intra vires, but also to be formally coherent with the enabling statute.

132. Moreover, s. 16 of the Interpretation Act (Canada) provides:

Where an enactment confers powers to make regulations, expressions used in the regulations have the same respective meanings as in the enactment conferring the power.

133. For purposes of s. 11 of the Act we do not find it necessary to make a distinction between gender bias or gender preference. We are in agreement with the parties that the phrase sexual bias, as contained in s. 9(a) of the Guidelines, should provide for any bias in the context of job evaluation which has the end result of favouritism toward one gender. Moreover, we agree with Willis and the Employer that it is not necessary to determine why a particular evaluator is motivated to exhibit bias. However, we find it necessary to examine the differences between committees and consultants from both a statistical perspective and a process perspective to determine if a bias exists.

134. In our opinion, causation is implicit in the provision of the legislation and the Guidelines. Different treatment of male and female jobs must be proven to be gender-based. This is consistent with the opinions expressed by Willis as he does not merely talk about a different treatment but a different treatment that is an influence towards one gender or another (Volume 38, p. 4794) and a bias favouring a gender (Volume 38, p. 4792). It is the gender aspect of the treatment that concerns Willis and which concerns the Tribunal.

135. Accordingly, the Tribunal is interested in the gender aspect and based on our interpretation of the Act, the question to be addressed is:

Is there a different treatment of male and female questionnaires in the evaluation process that was caused by or attributed to gender bias or gender-related bias?

136. We will now address the question of whether there is gender-based bias present in the treatment of male and female questionnaires. Our enquiry will encompass the evidence of the process that generated these results and the statistical evidence presented at this hearing.

IV. BURDEN OF PROOF

137. In the case before the Tribunal the issues, because of the length and complexity of the evidence, have been argued and addressed by the parties in stages. The first stage relates to the reliability of the results generated by the JUMI study. The affirmative alleges that the

29

results are reliable and free of gender bias by any reasonable standard. The negative alleges that the results are unreliable and coloured by gender bias to such a degree that they do not allow for an adjudicated resolution by the Tribunal.

138. The phrase burden of proof describes the duty which lies on one or the other of the parties either to establish a case or to establish the facts on a particular issue. See M.N. Howard, ed. Phipson on Evidence, 14th ed., (London: Sweet & Maxwell, 1990) para. 4-01.

139. In Miller v. Minister of Pensions, [1947] 2 All E.R. 372, (K.B.), Lord Denning, at p. 374, defines the degree of probability required to discharge the burden of proof in a civil case in these terms:

That degree is well settled. It must carry a reasonable degree of probability but not so high as is required in a criminal case. If the evidence is such that the tribunal can say: We think it more probable than not the burden is discharged, but if the probabilities are equal it is not.

140. Counsel for the Commission in her opening remarks conceded that the burden of establishing a prima facie case rests with the Alliance and the Commission. (Volume 218, p. 28337).

141. This concession by Counsel for the Commission simply recognizes the evidentiary rule frequently enunciated by the Courts and contained in the text books on this subject.

142. In the view of Sopinka J. et al, The Law of Evidence in Canada, (Toronto: Butterworths, 1992), a prima facie case does not compel a specific determination unless there is a specific rule of law which demands such a conclusion. After examining and analyzing several decisions of the Supreme Court of Canada in which the Justices differ, the authors state that a prima facie case simply permits an adverse finding against the Employer in the absence of evidence to the contrary. The authors quote with approval a passage found in R. v. Girvin (1911), 45 S.C.R., 167, (S.C.C.) at p. 169 as follows:

I have always understood the rule to be that the Crown, in a criminal case, is not required to do more than produce evidence which, if unanswered, and believed, is sufficient to raise a prima facie case upon which jury might be justified in finding a verdict." [emphasis added]

143. This passage was recently adopted in R. v. Mezzo, [1986] 1 S.C.R. 802 (S.C.C.) and the learned authors conclude at p. 73:

The terms prima facie evidence, prima facie proof, and prima facie case are meaningless unless the writer explains the sense in which the terms are used. For clarity and conciseness it is preferable...to explain the evidentiary effect consequent upon the proof of certain facts rather than to indiscriminately use these mixed Latin English idioms.

30

144. Because there appears to be some question as to the meaning of the phrase burden of proof as it applies in these circumstances we refer again to Phipson on Evidence, supra. According to the learned author, it has three meanings as follows:

  1. The persuasive burden, the burden of proof as a matter of law, i.e., the burden of establishing a case by a preponderance of evidence;
  2. The evidential burden, the burden of adducing evidence; and
  3. The burden of establishing the admissibility of evidence.

145. The persuasive burden, sometimes referred to as the legal burden, in a civil case rests on the party who substantially asserts the affirmative of the issue and is fixed at the beginning of the trial or hearing by the state of the pleadings, i.e., the complaints made pursuant to the legislation, and it is settled as a question of law that the burden remains unchanged throughout the hearing exactly where the complaints place it, and only rarely shifts except under special circumstances.

146. The legal burden of proof normally arises after the evidence has been completed and the question is whether the trier of fact has been persuaded with respect to the issue or case to the civil or criminal standard of proof. The legal burden, however, ordinarily arises after a party has first satisfied an evidential burden in relation to that fact or issue. See The Law of Evidence in Canada, supra, at p. 58.

147. Stated another way, the legal burden does not play a part in the decision making process if the trier can come to a determinate conclusion on the evidence. If, however, the evidence leaves the trier in a state of uncertainty, the legal burden is applied to determine the outcome on a balance of probabilities. See also The Law of Evidence in Canada, supra, at p. 60 quoting a passage in a decision of the Privy Council in Robins v. National Trust Company, [1927] 2 D.L.R. 97, which reads in part as follows:

But onus as a determining factor of the whole case can only arise if the tribunal finds the evidence pro and con so evenly balanced that it can come to no sure conclusion. Then the onus will determine the matter.

148. This passage can be compared to the comments of McIntyre J. in Ontario Human Rights Commission v. Simpsons-Sears [1985] 2 S.C.R. 536 at 558:

But as a practical expedient it has been found necessary, in order to insure a clear result in any judicial proceeding, to have available as a `tie breaker' the concept of the onus of proof.

149. The evidential burden, on the other hand, may shift constantly throughout the hearing, accordingly as one scale of evidence or other

31

preponderates. The burden of proof in this sense rests upon the party who would fail if no evidence were produced at all, or no more evidence, as the case may be, were given on either side. In civil cases the evidential burden may be satisfied by any species of evidence sufficient to raise a prima facie case. It is for the Tribunal to decide as a matter of law whether there is sufficient evidence to satisfy the evidential burden, that is to say, to establish a prima facie case. See Phipson on Evidence, supra, at para. 4-10(b).

150. The burden of proof in any particular case depends on the circumstances in which the claim arises. In general, according to Phipson, the rule which applies is he who invokes the aid of the law should be the first to prove his case. This rule is founded on considerations of good sense and as well, in the nature of things, a negative is more difficult to establish than an affirmative. See Robins v. National Trust Co., supra,; Constantine Line v. Imperial Smelting Corp., [1942] A.C. 154, 174 per Lord Maugham.

151. Commission Counsel, in her oral presentation, in Volume 218, at p. 28349, line 25 to p. 28350, line 11, asserts as follows:

If a process is created which is considered by the experts to be the best process for identifying gender bias...then there is no reason to look further beyond that process. If that's the case, then there is prima facie evidence of a reliable process, which is [sic] the absence of evidence to the contrary would permit a finding of reliability. [emphasis added]

152. This rather broad statement of Counsel is supported by reference to Farnquist v. Blackett-Galway Insurance Ltd. (1969), 72 W.W.R. 161 (Alta. C.A.) (Allen J.A.) at pp. 172-73 and by OPSEU v. Ontario (Ministry of Community and Social Services) (1986), 15 O.A.C. 78 (Div. Ct.) at p. 79 which deal with proof on a balance of probabilities.

153. It is not clear what Counsel intends by use of the phrase If a process is created. For purposes of clarification, if Counsel means the procedures and structures put in place by Willis and the JUMI Committee for the evaluation of jobs, the evidence establishes that in general the structures are compatible with the requirements of the Act, the Guidelines and with the principles of pay equity as understood by the experts.

154. Assuming the process encompasses, not only, the procedures and structures such as the evaluation plan, the training of the evaluators, the questionnaires, the collection of information and in addition but more importantly, according to Weiner, the application of the evaluation system in a gender free manner, then one might accept Commission Counsel's statement as correct. But Counsel follows her statement with this comment in Volume 218, at p. 28357, line 22 to p. 28358, line 7:

The purpose of all this is to relate to the shifting burden that rests with the Defendant, once the Complainants have demonstrated that there is a prima facie case and the burden has shifted to the Defendant, it is incumbent upon them to prove on the balance of

32

probabilities...that gender bias or whatever their allegation is going to be is in fact the cause of the event and, therefore, is in fact the cause of the unreliability of the results. [emphasis added]

155. The shifting burden of proof referred to by Counsel in the passage quoted above does not relieve the party who asserts the affirmative, in this case the Commission and the Alliance, from satisfying the evidential burden that the results of the study can and ought to be relied upon for purposes of adjudication. If, on the whole of the evidence, including both anecdotal and statistical testimony, the Tribunal can come to a determinate conclusion it will not be necessary, in our opinion, to invoke the legal burden, in order to reach a decision.

156. Counsel for the Commission in his opening remarks admitted that the process in the JUMI exercise was flawed but not so flawed as to vitiate the results. The anecdotal testimony of the participants in the study, the Willis consultants and some of the evaluators, raise questions about the impartiality of some evaluators as well as the functioning of certain committees. Incidents occurred which were disturbing and caused the consultants to experience discomfort about the process and its application. Additionally, there were differences between the committees and the consultants in re-evaluation exercises conducted by the consultants at different stages of the study. Analyses of the results, in turn, led to a critique of the data by qualified statistical experts. This short description of some of the problems which arose during the course of the study and which might have had some effect on the scores of the evaluators is not exhaustive.

157. The problems relating to the reliability of the results whether arising from the evaluation process or from the statistical analyses on re- evaluations will be addressed in a subsequent portion of this decision.

158. The Employer submits if the Willis Process worked well then the Complainant and the Commission have made out a prima facie case on reliability and there is no need therefore to look at the results for further evidence of reliability. If on the other hand, the process did not work well, then the onus, according to the Employer, remains with the Alliance and the Commission to demonstrate through other evidence that the results are reliable and sound.

159. What is meant by other evidence is described by Respondent Counsel as consisting of statistical analyses performed by the statistical experts to demonstrate there is a systematic pattern in the disparities. But Counsel argues that by attacking the credibility and the usefulness of those disparities the Commission and more particularly the Alliance are left with no basis for comparison between the committee evaluations and the re-evaluations performed by the consultants, whose credibility and impartiality were under attack by the Alliance.

160. According to Respondent Counsel, in the eventuality that the process did not work well and if the Commission and the Alliance are to be

33

precluded from relying on the statistical analyses, they are left with nothing and have therefore failed to establish a prima facie case.

161. With respect, the Tribunal, is unable to accept the proposition advanced by Respondent Counsel. In our view, it is not simply a matter of choosing to accept or reject one or both of the alternatives presented to us.

162. Within the JUMI Study itself and under the direction of Willis, the approach adopted by the JUMI Committee was to use statistical tests as a means of validating the process. The whole Willis Process is a complex scheme that does not only include an exercise of job evaluation but is inclusive of many steps and stages, one of which is the validation of the results by testing for inter-rater and inter-committee reliability. Some of the testing uses statistical analysis. These tests are integral to the process Willis utilizes in large pay equity studies, and are most evident in the JUMI Study. Other significant tests undertaken were a re-evaluation of 222 positions by the Willis consultant, Jay Wisner, who performed statistical tests on the re-evaluation results. The only statistical testing which did not occur during the process itself consisted of an additional re-evaluation of 300 positions conducted by the Commission in its investigation after the completion of the study.

163. The statistical analysis by the Commission combined the re- evaluations which occurred during the JUMI Study with the re-evaluations that were done subsequently. The act of combining these re-evaluations does not, in our view, create any artificial framework in respect of the evidence as it relates to the process or the evidence as it relates to statistical analyses. Neither process nor statistical measures operated in complete isolation from each other, but were interlocked in the sense that an understanding of one required an understanding the other.

164. Accordingly, we are entitled to look at the whole of the evidence and to weigh it in the light of all the circumstances. We will be examining in great detail the testimony of the participants in the study, the expert evidence of the consultants, the expert testimony of the statisticians and others who had some involvement in the study. Our decision, will therefore encompass all the evidence presented to us during the course of this hearing.

165. Those elements required to satisfy the evidential burden in the present proceedings consist, in our opinion, of the following which are based on the provisions of s. 11 of the Act, the companion Guidelines and the state of the pleadings:

  1. The complainant groups are female-dominated within the meaning of the Equal Wages Guidelines;
  2. The comparator groups are male-dominated within the meaning of the Equal Wages Guidelines;
  3. The value of work assessed is reliable; and
  4. 34

  5. A comparison of the wages paid for work of equal value produces a wage gap.

166. As mentioned previously at this stage the Tribunal will address the third element which is have the sampled positions in the JUMI Study been properly evaluated so as to produce reliable results. It should be noted moreover, the parties, including the Employer, have agreed the Willis Plan is, in fact, an appropriate gender free evaluation plan for the JUMI Study which captures the criteria required to be measured by s. 11(2) of the Act.

167. In addressing the third element relating to the reliability of the evaluations, Counsel for the Commission enumerated several considerations which needed to be taken into account, namely: the plan allows for comparison between occupations; the process was designed to obtain reasonably reliable job information; there were additional procedures in place so as to ensure comprehensive job information; the Plan was, in fact, applied with reasonable consistency by the multiple committees; there was consistency in the job information; there was consistency in the results; and the salary data was reasonably reliable.

168. These considerations are, it seems to us, appropriate and helpful in evaluating the evidence and will be applied when the Tribunal assesses the evidence, both anecdotal and statistical in the following sections of this decision.

V. STANDARD OF PROOF

169. The standard of proof determines the degree of probability that must be established by the evidence to entitle the party having the burden of proof to succeed in proving either his/her case or an issue in the case.

170. There are two levels of probability depending on whether the matter to be tried is of a criminal nature, in which case, proof beyond a reasonable doubt is required, or is a civil matter in which case the claimant is required to establish his/her case, or an issue therein, on a balance of probabilities, which is to say a greater likelihood that the conclusion advanced by the claimant is substantially the most probable of the possible views of the facts. See Duff J. in Clark v. Treking, [1921] 61 Can. S.C.R. 608, at p. 616.

171. The standard applied in Haldimand-Norfolk (29 May 1991), 0001-8 P.E.H.T. by the Tribunal when interpreting Section 5(1) of the Pay Equity Act (Ontario) is contained in paragraph 24 of that decision which reads as follows:

24. Having carefully considered the evidence and submissions in this case, we find that the parties have an obligation to ensure the collection of job content information meets the requirements of the Act to accurately identify skill, effort, responsibility and working conditions normally required in the work of both the female job classes in the establishment and the male job classes to be compared. Not only is this a necessary condition of a

35

gender neutral comparison system but we also find that section 5 of the Act requires a standard of correctness, that is, the skills, effort, responsibility and working conditions must be accurately and completely recorded and valued. [emphasis added]

172. Section 5(1) of the Ontario Act reads as follows:

5(1). For the purposes of this Act the criterion to be applied in determining the value of work shall be a composite of the skill, effort and responsibility normally required in the performance of the work and the conditions under which it is normally performed.

173. Section 5(1) itself does not impose any particular standard which must be met by the parties in order to fulfil the criteria.

174. Accordingly, the decision of the Tribunal in the Haldimand- Norfolk case, insofar as it deals with the standard to be met in the collection of job information is the Tribunal's interpretation of Section 5(1) of the Pay Equity Act (Ontario). It should be pointed out that the issues in that case were whether the employer had adopted a gender biased comparison system and whether it had failed to negotiate in good faith with its employees. The question of the reliability of the results which concerns us here was not directly addressed. That issue relates to the process. The process requires a standard by which to assess the collection of job information and a standard by which to assess the procedures for evaluating that information.

175. The issue before us relates to such matters as the format of the questionnaire, the procedures for gathering information about jobs, the follow-up procedures and safeguards, the composition and functioning of the committees, the application of the job evaluation plan and the vetting of committee results by statistical analysis.

176. The Commission and the Alliance have advocated a standard of reasonableness to be applied in assessing job information and job evaluation. Also in assessing damages, by which it is assumed is meant the measure of consequential relief afforded to the complainants by the provision of the Canadian Human Rights Act, a standard of reasonableness is to be applied.

177. Respondent Counsel in his oral submissions when dealing with the onus of proof makes the following statement in Volume 226, at p. 29761, lines 16 - 24:

Another point on onus of proof is this. The employer's position is that the standard for assessing reliability, the standard for assessing the process to decide whether we have reliability, is one of reasonableness. Did the process work well? It is not a question of whether the process worked perfectly or whether the job information was perfect. The employer has never contended for perfection.

36

178. In commenting on the Haldimand-Norfolk case, Counsel also makes the following observations in Volume 226, at p. 29761, line 25 to p. 29762, line 13:

I might just contrast that with the Haldimand-Norfolk case, in which...the Tribunal said, And we want a standard of correctness. Correctness sounds very like that they were looking for perfection. I don't know...In any event, the employer's position on this is that we are looking for did the process work well. That doesn't mean perfectly.

179. So in the result the parties have themselves advocated a standard of reasonableness. Respondent Counsel's position is that when applying a standard of reasonableness to the results in this study the Tribunal must find that the results fall short of providing a reliable basis on which to render a favourable decision.

180. What standard ought the Tribunal to follow in assessing the reliability of the results? The concept of reasonableness should be viewed in the context of what pay equity or comparable worth hopes to achieve and how it expects to achieve its goal. There are, as well, practical considerations as to its effects in the work place on the parties involved.

181. Throughout his testimony, Willis, an acknowledged expert in his field, stressed that achievement of pay equity or equal pay for work of equal value as between male dominated jobs and female dominated jobs is not a scientific, mathematical or statistical endeavour. Rather it is an art based on a combination of analytical skills, comprehension, intuition and ultimately a subjective evaluation of the job within the framework of the plan while at the same time adhering to the discipline which the plan imposes.

182. In an article by Judy Fudge of Osgoode Law School entitled the Legal Standard for Gender Neutrality under the Pay Equity Act (Ontario): Achieving the Impossible?, the learned author in referring to a legal standard against which to judge the gender-neutrality of job comparison systems states:

...to date there does not exist a conclusive method to demonstrate either gender bias or gender-neutrality in any particular job comparison system. For this reason, the Pay Equity Hearings Tribunal should adopt a reasonableness standard with respect to the issue of gender-neutrality.

183. She then outlines minimum criteria for developing a gender- neutral job evaluation system. It is not necessary to examine those criteria for our purposes since all of the parties to this enquiry have agreed that the Willis Plan initially adopted at the outset of the study satisfies the minimum criteria and is therefore gender-neutral. What is useful for our purposes is to note the observations of the author (at p. 20) where she acknowledges that:

37

...there does not exist a conclusive method to demonstrate either gender bias or gender-neutrality in any particular job comparison system. For this reason, the Pay Equity Hearings Tribunal should adopt a reasonableness standard with respect to the issue of gender neutrality.

184. In commenting on the actual job evaluation process, the author states:

No matter how scrupulous the design of the job comparison system in avoiding gender bias, bias can creep into the actual process of assigning job value points to jobs. In other words, the job evaluation system may be fair, but the application can be biased.

185. Fudge then goes on to describe the use of job evaluation committees which if properly constituted following clearly defined procedures would minimize the possibility of bias.

186. Also we refer to the earlier comments of Armstrong that the overall wage gap probably wouldn't disappear completely if pay equity were achieved in all jobs.

187. What is apparent from these comments and from the nature of the subject is that equal pay for work of equal value is a goal to be striven for which cannot be measured precisely and which ought not to be subjected to any absolute standard of correctness. Moreover, gender-neutrality in an absolute sense is probably unattainable in an imperfect world and one should therefore be satisfied with reasonably accurate results based on what is, according to one's good sense, a fair and equitable resolution of any discriminatory differentiation between wages paid to males and wages paid to females for doing work of equal value.

VI. FACTS

A. THE WILLIS PLAN

188. The framework within which work was evaluated during the JUMI Study was through the use of a job evaluation plan.

189. One of the early tasks of the JUMI Committee was to select a job evaluation plan. The JUMI Committee created a sub-committee to examine various job evaluation plans and make recommendations to the JUMI Committee at large. Several plans were examined by the Sub-Committee on a Common Evaluation Plan. In the end, this sub-committee recommended the Willis Plan, designed by Willis, with some minor modifications to better meet the criteria of the Act. Following consultations with representatives of the JUMI Committee, Willis agreed to make changes to the plan including changes to the working conditions chart.

190. The Commission also examined the Willis Plan and expressed its concern about the treatment of effort in respect to working conditions. There was also concern about the manner in which the plan dealt with accountability. Willis agreed to change aspects of the Willis Plan such

38

that both physical and mental effort would be assessed in working conditions. He also agreed to changes in the treatment of accountability.

191. All participants, including the Commission, appeared satisfied with the changes. Paul Durber, Director of Pay Equity, Canadian Human Rights Commission and a pay equity expert, provided expert evidence as to how the requirements of s. 11 of the Act and the Guidelines were captured by the Willis Plan. Durber stated an essential element of the job evaluation plan, used for purposes of pay equity, is that the tool be gender bias free. In Durber's opinion, there is nothing on the face of the Willis Plan which appears gender biased and there is nothing in the plan that would make it difficult to measure work traditionally performed by men as compared to that of women.

192. The Willis Plan is complex in design. Willis developed this plan in 1974 after working with the consulting firm, Hay & Associates for three years. It uses a matrix format which permits evaluation of the four factors of skill, effort, responsibility and working conditions to be broken down into subfactors. The matrix design allows for one, two or three sub-factors to be assessed on a single guide chart with a total of four guide charts. A guide chart presents the criteria used in the Willis Plan. In some cases, one factor is imbedded in another. For example, interpersonal skills are measured within the levels of managerial skills, thus, how one scores managerial skills affects the number of points given for each of the levels of interpersonal skills.

193. The Willis Plan is a point factor system, which simply means points are assigned to each factor. The point values are added together to arrive at a total point score for each job. The Willis Plan is designed geometrically. Willis has chosen a 15 per cent difference between any two levels of the plan. He finds this percentage of difference is a discernible difference in the semantic definitions of the different levels in the charts. He stated, if the differences were too small, evaluators would be unable to make a choice.

194. The assigning of relative worth to each job is established by the number of points that are available for each factor in the Willis Plan, and there is almost an infinite number of points that are available. The relative number of points that are available for each factor contributes to the conclusion of relative worth of different positions. (Volume 77, p. 9377).

195. Dr. Nan Weiner, President of NJ Weiner Consulting, Inc., a consultant specializing in pay and employment equity, was deemed an expert by the Tribunal in pay equity and compensation. She was asked to express an opinion on the Willis Plan, which she referred to as a system. Although, she had not worked with the Willis Plan, she stated there was nothing to indicate that it would in any way undervalue female jobs. She indicated a weakness in the Willis Plan could be that, given the breadth and diversity of the Federal Public Service, four levels of interpersonal communication are simply not adequate to differentiate across all the jobs which were evaluated. In her opinion, the Willis Plan attributes more points to knowledge and skill, and accountability or responsibility than to

39

effort and working conditions, and in some respects, favours white collar work over blue collar work. According to Weiner, it is important for the evaluators who use the system to ensure, through their discussions, the blue collar jobs are measured fairly.

196. In her view, it is not the distinguishing element of the work but how the system is adapted by the user that is important. In this respect, she says in Volume 11, at p. 1564, lines 20 - 23:

THE WITNESS: What is important is that you ask the people who actually use the system what they did to make sure the system was being used fairly for blue collar jobs.

197. Willis testified the current modified Willis Plan does not have the points on the charts. Thus, the evaluators never know what the points are. They evaluate using pluses and minuses. A computer program determines the points. According to Willis, this frees the evaluator from knowing the point relationships between different jobs.

198. Two elements emerge from the design of a pay equity job evaluation plan. The first is a pay equity plan must be capable of capturing and appropriately valuing female and male work. The second is the allocation of weight assigned to the various factors of the plan. For example, in the weighting of factors, the Willis Plan attributes many more points to knowledge and skills than it does to working conditions and physical effort.

199. Willis described his weighting scheme for the Plan. He testified the weights were validated using market rates of pay. Criticism about the use of the market as a measure of validation was expressed by Weiner who stated market influences on the wages for female-dominated jobs are inconsistent with pay equity. Simply expressed, when job evaluation systems undervalue work traditionally performed by women, this becomes compounded in the market place. Accordingly, in her opinion, the market reflects the undervaluation of women's work.

200. The last formal validation of the weights in the Willis Plan was done in 1985. Early in the study, Willis agreed, at the request of the JUMI Committee, to do a validation study of the weights, because of a concern expressed by the Commission. Willis did not believe this re- validation necessary as he had been using the system continuously and had no empirical evidence to demonstrate the factor weights were inappropriate. Management representatives ultimately decided not to perform the validation study because of cost considerations.

201. During the lengthy course of these proceedings, the Employer challenged the weighting of the Willis Plan and its validity as a tool for evaluating jobs in a gender bias free manner. On that basis, the Tribunal heard a considerable volume of evidence. It was not until written submissions from Respondent Counsel were available that the Tribunal and the other parties were advised the Employer agreed the Willis Plan was an appropriate and acceptable evaluation plan for the purposes of the study. Moreover, it was then agreed by the parties that the Willis Plan met the

40

requirements of s. 11 of the Act and is an appropriate instrument within the meaning of s. 11 for these complaints. Reference is made to the Employer's written submissions at para. 41, p. 11, which read:

41. Nevertheless, for purposes of this litigation, the Employer accepts that the Willis Plan was an appropriate plan to use in evaluating jobs in the Federal Public Service. Therefore, the Tribunal need not decide whether weighting the Willis plan is valid.

202. Also, in oral argument, Respondent Counsel stated in Volume 218, at p. 28453, lines 4 - 11 as follows:

Then, having covered all of those points, we say this: For purposes of this litigation the employer accepts that the Willis Plan was an appropriate plan to use in evaluating jobs in the federal public service. So that, in my submission, was intended to be a complete indication that no issue is raised in respect of the Willis Plan.

203. The Commission has an obligation to assure the Tribunal the Willis Plan meets the requirements of s. 11 of the Act and s. 9 of the Guidelines. In this regard, the Tribunal was assured by Durber with regard to the Commission's view which essentially confirms that the Willis Plan meets the requirements of the Act.

204. During oral argument, all parties agreed on the suitability of the Willis Plan as an appropriate tool for dealing with the complaints before us. Therefore, the Tribunal is persuaded and does find as a matter of fact that the Willis Plan is an appropriate tool within the requirements of the Act and Guidelines for the job evaluations which form the basis for this adjudication.

205. The Willis Plan provides a tool to be used in assessing the relative value of work. But in and of itself it does not give a methodology to determine what is the wage gap between female positions and male positions. The determination of any wage gap is a function of comparing evaluations between male and female jobs. The system itself does not do that without a further step.

B. THE WILLIS PROCESS

206. The Willis Process was developed by Willis over the approximately 24 years he spent as an independent consultant in the area of pay equity job evaluation. The Tribunal heard considerable testimony relating to the implementation of the Willis Process in the JUMI Study. In particular, the evidence covered the period of job evaluation which commenced in the fall of 1987 and concluded in the fall of 1989. In assessing the issue of reliability, we find it appropriate to review each aspect of the process to determine whether it achieved or fell short of achieving the aim of avoiding gender bias.

41

207. The Willis Process is a process for examining, assessing and evaluating jobs. Participants in this exercise were given the task of measuring the content of each of the positions examined, and asked to assign a value reflective of the total work of each position or job.

208. Although Willis testified the job evaluation plan must be a sound instrument, he also insisted the process within which the evaluation plan is used is more important than the plan itself. According to Willis, everything done in terms of the process was aimed primarily at avoiding evaluations which would suggest traditional relationships or stereotyping. It was designed to avoid anything that might be identified as gender bias. Willis maintained throughout the study, that vigilance during the evaluation stage was of paramount importance, and he continually reinforced the need for objective, fair and equitable evaluations of all positions.

209. By way of historical background, the process eventually agreed upon by the JUMI Committee was not Willis' preferred choice. Willis initially recommended his proposal for the consideration of the JUMI Committee which outlined processes and procedures to be employed in the study. Due to financial considerations under the control of the management side, his proposal was rejected. Willis then prepared a modified proposal. The JUMI Committee accepted the modified version after a number of make or buy decisions were made relating to certain aspects of the Willis Process.

210. The modifications in the data-gathering phase included:

  1. instead of having his consultants conduct the briefing of employees selected to complete questionnaires, he agreed federal employees could be trained to do this task;
  2. instead of having his consultants review or screen the completed questionnaires, he agreed to train a team of federal employees to perform this task under the direction of a consultant who would be available to oversee this process;
  3. instead of using consultants to conduct face to face interviews with select incumbents, he agreed to train a team of federal employees to conduct any face to face interviews deemed necessary; and
  4. in the later stage of the study, Willis reduced the amount of time and involvement that he and his consultants would have in the data-gathering phase of the study.

211. Willis believed these modifications would result in a study which would be of sufficient quality to meet the requirements of the Act based on a number of safeguards he instituted to ensure that complete and accurate job information was obtained.

212. We find it helpful to separately identify each step in the Willis Process, accompanied by the evidence relevant to each step, thereby

42

assisting us in determining the issue of reliability and the effectiveness of the safeguards.

(i). Data-Gathering

213. Willis testified data-gathering is a critical and most important step in a study of this type. He characterized four possible sources of data-gathering.

214. One source of information is the job description. In reviewing a sample of job descriptions from the Federal Public Service, Willis determined they were out of date and not sufficient for this purpose.

215. A second source is the closed-ended questionnaire, which is described as similar to a multiple choice question. Closed-ended questionnaires require very extensive, in depth and detailed knowledge of the job in order to structure them properly. A great deal of familiarity with the work is required so as to construct the kinds of alternatives provided in a closed-ended questionnaire. This is easier to do in a smaller establishment where there is less variety in the kinds of jobs.

216. The advantage of a closed-ended questionnaire is that it leaves less for the comprehension of the incumbent in terms of their awareness of the range and content of the job. There is instead more reliance on the knowledge and comprehension of the person who structures the questionnaire. The disadvantage of a closed-ended questionnaire is that if the person who structured the questionnaire is not aware of the whole range of the work involved or if they are not fully aware of the kinds of considerations that must go into pay equity questionnaires, then the questionnaires may have fundamental bias built into them. (Volume 180, p. 22971). A closed-ended questionnaire is relatively easy for employees to complete but it is, according to Willis, fundamentally unsound because it permits employees to make value judgments about their work rather than providing factual information.

217. A third source is an open-ended questionnaire which is more difficult to complete than a closed-ended questionnaire. Willis prefers an open-ended questionnaire to a closed-ended questionnaire. Armstrong explained the advantage of open-ended questionnaires is that they are most useful when there is a whole range of quite different jobs to collect information for evaluation. Open-ended questionnaires are also more useful with a literate workforce, which is the case with most of the employees in the public service. (Volume 180, p. 22971).

218. Willis testified he constructed his questionnaire to obtain complete, definitive, accurate and up to date job information. (Volume 68, p. 8542). The data gathered in this study was obtained through an open- ended questionnaire (the Willis Questionnaire).

219. The fourth and last source of data-gathering involves a task force of professional job analysts who would interview each employee and then prepare the document. Willis has used this approach in a few

43

instances but stated that it would be impractical in the context of the Federal Public Service. (Volume 29, p. 3696).

(ii). Willis Questionnaire

220. Willis discussed the advantages and disadvantages of open-ended and closed-ended questionnaires in the context of a large study such as the JUMI Study. He preferred an open-ended questionnaire as opposed to a closed-ended questionnaire because his aim was to prevent incumbents from making value judgments of their own work. This will occur when a closed- ended questionnaire is employed. Willis states in Volume 65, at p. 8084, lines 12 - 14:

I think the important thing is that the evaluator must make that value judgment, not letting the employee make it.

221. This questionnaire was used in many previous Willis & Associates pay equity studies in both Canada and the U.S. The JUMI Committee established a sub-committee to finalize the questionnaire's format and content. The amended Willis Questionnaire was agreed to by the JUMI Committee. A guidebook was appended to each questionnaire as a source of assistance to an employee completing a questionnaire. The guidebook was also amended by the JUMI Committee to reflect the Federal Public Service environment.

222. In summarizing his participation in the design of the questionnaire, Willis says in Volume 60, at p. 7429, line 18 to p. 7430, line 6:

A. The questionnaire has been, I would say, developed over a number of years. It is probably the most worked up questionnaire that is in existence going back all the way to 1974. We have tried to modify it and change it over the years to make it easier for people to complete, but at the same time it is a totally open- ended questionnaire, which I think is necessary.

The final design, of course, was a modification of the questionnaire by a sub-committee from the Joint Union/Management Committee. I think perhaps we have as good a questionnaire as you could expect to have for any study of this type.

223. Willis participated in the suggested changes in the questionnaires and the guidebook and approved all of the changes which were made. He testified he was satisfied with the questionnaire and the guidebook in the form in which they were used in the JUMI Study. (Volume 62, p. 7654).

224. A portion of the questionnaire provides space for the incumbent's supervisor to make comments. The Questionnaire Sub-Committee had discussed and made changes to this portion of the questionnaire. It was Willis' view these changes were minor and he was satisfied with the questions in their final form. The questions for the supervisor reads as follows:

44

Carefully review the completed questionnaire, but do not alter or eliminate any portion of the original response. Please answer the questions listed below. We also invite you to consult with your manager on this subject.

1. What do you consider the most important duties of this position and why? (Refer to Question III.)

2. Comment on the accuracy and completeness of the responses by the employee.

3. Please sign on page 34.

IMPORTANT: Significant differences of opinion noted by the immediate supervisor should be reviewed with the employee.

(Exhibit HR-34)

225. Willis stated this kind of check on position information is intended to address two concerns: firstly, the tendency of some employees to overstate their jobs to some degree when there is no supervisor to review the information; and secondly and more importantly, often the supervisor will have additional information which the employee forgets that might be helpful in evaluating the position.

226. One of the problems identified by Willis was obtaining good information from sophisticated professional level positions. (Volume 68, p. 8544). He stated the higher the level of knowledge and sophistication of the job, the more it requires the understanding and interpretation of principles and theories, hence greater difficulty is encountered by the individual incumbents in describing their work. In a higher level job it is more difficult for the employee to document and describe their work in a way an evaluator can understand. On the other hand, a very simple cleaning job which follows specific procedures can be documented with relative ease.

227. Another problem encountered in gathering information is ensuring adequate time be given to employees to complete the questionnaire. Each incumbent must be given sufficient time to complete the questionnaire, which is contingent upon the ability of the incumbents to express their jobs in writing. Not only is time an important element in this exercise, but also the effort and care expended by each incumbent.

228. In Willis' expert opinion, the questionnaire was a good tool for obtaining factual up to date job information. In assessing the ability of the Willis Questionnaire to collect sufficient information for the evaluation committees, we note the remarks of Willis in Volume 62, at p. 7686, lines 16 - 22:

Q. In terms of any of the times that the consultants were sitting in, are you satisfied that by the time those questionnaires came to be evaluated that there was sufficient

45

information for those jobs to have been properly evaluated in accordance with the Willis Plan?

A. Yes.

229. The JUMI Committee understood, from the onset of the study, the need to communicate to selected employees the importance of the study as well as the importance of providing thorough job information. As a result, the JUMI Committee established a Communications Sub-Committee to develop a communication strategy emphasizing the necessity of complete and accurate information and a prompt return of the questionnaires distributed for this purpose.

230. The communication strategy included such items as:

  1. a pay cheque stuffer explaining the purpose of the JUMI Study and containing an assurance to employees that classification levels would not be affected;
  2. letters to employees who were asked to complete a questionnaire;
  3. preparation of a video for employees designated as screeners/reviewers to be used in training of incumbents who would be filling out questionnaires; and
  4. training materials for coordinators.

231. Employees were given assurances from the JUMI Committee their participation would not have a negative impact on their careers. They were also assured any information provided would not be used for any other purpose than the JUMI Study. The incumbents from male-dominated occupational groups were instructed that if a wage gap were found, their wages would remain unaffected.

232. To counter possible problems in using an open-ended questionnaire, Willis implements checks and balances, or safeguards, to ensure that the evaluators had complete, definitive, accurate and current information. These safeguards will now be described.

(iii). Coordinators

233. Willis had originally proposed his consultants train incumbents in the completion of the Willis Questionnaire but accepted the JUMI Committee's decision to use coordinators as trainers which he considered a valid make or buy decision.

234. The function of the coordinators included training incumbents on how to complete the questionnaires, conducting briefing sessions to explain the nature and intent of the study, responding to employee questions, distributing and explaining the Willis Questionnaire, assisting employees in completing the questionnaires when required, and coordinating the data- gathering process.

235. Coordinators were designated as either national or regional depending upon their purpose and locale. The selection process and criteria applied in selection, used by the Alliance in appointing coordinators was described in detail by an Alliance witness, Elizabeth Millar, Head, Classification & Equal Pay, Collective Bargaining Branch. Similarly, the selection process of the Institute was described by Kathryn

46

Brookfield, Section Head of Research. On the other hand, no evidence was presented by the Employer as to the manner and criteria for selecting its employees for this role.

236. Coordinator training sessions were conducted in the months of September and October, 1987. Materials, in the form of printed information, slides and videos were given to coordinators to assist in the training of incumbents. The coordinator training program lasted about a day and a half. Some additional exercises in coordinator training included practice sessions on eliciting the support of individuals who might be reluctant to complete the questionnaire, dealing with language difficulties and making arrangements for interpreters where necessary. All training on the Willis Plan was conducted by a Willis Consultant.

237. With regard to the adequacy of coordinator training, Willis provided the following opinion in Volume 62, at p. 7657, lines 15 - 21:

Q. In terms of the training, you participated in the training of the coordinators, or your consultants did?

A. Yes.

Q. And you were satisfied with the training that was given to the coordinators?

A. Yes.

238. Following training, each coordinator was then assigned a number of incumbents to train. The date for performing this task was to be decided by the individual coordinator although Willis wanted the training of incumbents undertaken as soon as possible. He also emphasized to the coordinators the importance of having incumbents complete the questionnaires as soon as possible after their incumbent training was given. Willis preferred the questionnaires be completed within a two week period subsequent to the employee training. Willis estimated it would take incumbents four to eight hours to properly complete the questionnaire.

239. Following the training of the coordinators which was completed in October of 1987, approximately two-thirds of the questionnaires were received by February of 1988 and up to three-quarters were received by March of 1988. The Administrative Sub-Committee, established by the JUMI Committee, spent considerable time assessing the number of questionnaires which had been received and ways and means of obtaining all the remaining questionnaires. The final rate of return for the questionnaires was 95 per cent. A few questionnaires continued to come in over the summer and fall of 1988.

240. The Tribunal heard from Brookfield who testified that many of the coordinators from the Institute commenced their training task very soon after receiving coordinator training. She explained that the employee training went on for a considerable period of time because the coordinators had a large number of employees to train and the employees were not all at a single work site. These factors required staggered training sessions for

47

coordinators to meet with different employee groups. Brookfield also indicated some of the incumbents could not be released from their work at the same time, and this factor also lengthened the period of time required for the training.

241. Brookfield also expressed the Institute's view as to the calibre of training provided at the end of coordinator training. She said in Volume 168, at p. 21007, lines 5 - 16:

A. They said to me quite frankly the more they did it, they felt the better they got and that they had received input from previous training sessions about questions that employees would have and they would respond to them at that point. But then, after that, they might think of more information or another way they might have addressed that concern and they would incorporate it in their next training session, perhaps up front, or be able to raise, if there weren't questions, possibilities and things they had gleaned from other training sessions.

242. The Alliance had many more coordinators than the Institute, numbering approximately 100. Margaret Jaekl, Classification and Equal Pay Officer, Collective Bargaining Branch, of the Alliance, testified as to the effectiveness of coordinator training from feedback she received from Alliance coordinators. Jaekl states in Volume 200, at p. 25831, line 25 to p. 25833, line 8:

Q. Did you receive feedback from the co-ordinators as to how they felt their role was being received, first of all, by management and, second of all, by those that they were training in the filling in of the plan?

A. Yes. We had meetings from time to time with all of what we called our national co-ordinators. Each component had a national co-ordinator and then they had many regional co-ordinators, too.

...

A. The feedback we got generally was that they felt they were working well with their management counterpart. People were understanding their presentations. People were generally completing their questionnaires and returning them. Some people had questions and, in general, they felt comfortable that they were able to answer those questions.

243. The JUMI Committee sought cooperation from management in granting time to employees for training. The uncontradicted evidence is there was good cooperation from the Employer in providing the selected employees with sufficient time during normal working hours to attend the training session and to complete the questionnaire. Incumbents were given time off with pay to complete the questionnaire which could involve up to eight hours, where necessary.

48

244. In Willis' opinion, the shorter the time lapse, between the training of coordinators and their training of the incumbents, the more effective the training would be. Willis' experience was he was able to track the quality of the questionnaires in terms of how soon the incumbents completed the questionnaire after receiving their training from the coordinators. According to Willis, quality goes downhill over time. In this particular case, Willis was not able to pinpoint when the quality began to decline. He found a variety of quality levels in the completed questionnaires. He remarked the earlier questionnaires possessed a higher quality. The Department of National Defence questionnaires were completed right on schedule. Willis testified these employees completed the questionnaires as they were supposed to be done and were excellent questionnaires. Willis and his consultants noticed a dropping off in quality the longer it took for the questionnaires to be returned.

245. There is little evidence concerning specific dates of coordinator-incumbent training sessions. Some portion of the delay can be attributed to the time supervisors took to read, comment on and sign employee questionnaires. Some supervisors waited until all of their employees had completed their questionnaires and signed them off en bloc. Willis admitted there was no way of knowing whether an employee had, in fact, filled the questionnaire out within the goal of 10 to 14 days after receiving their training or at a later time. It is noted there were 1,258 incumbent substitutions in total involving 837 questionnaires.

246. The evidence revealed the information from female employees came in sooner and was of better quality than the information received from male employees. Also, questionnaires from incumbents of high level technical and professional positions were returned later and contained weaker information than questionnaires from the incumbents of clerical and vocational positions.

(iv). Screeners and/or Reviewers

247. As the completed questionnaires were returned, one of the Willis Consultants, Jan Drury, was asked by Willis to select the best questionnaires for evaluation by the MEC. Drury expressed concerns to Willis about the overall quality of the questionnaires. As a result, Willis then instituted a back up procedure to obtain additional information. This involved a task force of employees, appointed by the JUMI Committee, referred to as screeners and/or reviewers. Their primary function was to screen incoming questionnaires for any gaps in information and/or inconsistencies.

248. According to Willis, the screening of questionnaires is an absolute necessity in the Willis Process. It was Willis' original recommendation for the study that the consultants perform the screening and reviewing function. Normally, Willis would use his consultants to screen the completed questionnaires. The JUMI Committee decided to train federal government employees to perform this task. This triggered another make or buy decision by the JUMI Committee. The screeners/reviewers functioned throughout the duration of the study.

49

249. The screeners and reviewers were trained by Drury. They received more extensive training than the coordinators because the screeners and reviewers had to be familiar with the Willis Plan in order to assess whether the questionnaires were properly completed.

250. Accordingly, the management side and the union side each appointed individuals to act as screeners/reviewers. Approximately 55 individuals functioned in this capacity. Their responsibilities included undertaking certain technical tasks for each questionnaire, such as removing all gender and classification references. After identifying questionnaires requiring additional information or clarification, the screener/reviewer was then required to draft questions to ask incumbents in order to complete the necessary information. They also obtained further factual information respecting technical terminology found in the questionnaires and presented this information in terms better understood by an evaluation committee.

251. Drury oversaw the work of the reviewers until March of 1988. Drury examined the review questions and notes drafted by the screeners/reviewers for each review completed on the questionnaires evaluated by the MEC. Subsequently, Diane Saxberg, on the union side, and Doug Edwards, on the management side were appointed Chief Reviewers as of March 7, 1988. The Chief reviewers were responsible for reviewing the draft questions of the screeners/reviewers.

252. The screeners/reviewers interviewed the incumbents to obtain the required information. A high percentage of these follow up interviews were done by telephone and only a limited number, less than a dozen, were done in person. In some instances, obtaining this information required several telephone calls, some of which were extremely lengthy. The responses were then written up and appended to the questionnaires before being conveyed to an evaluation committee. The written responses were referred to as reviewer notes.

253. Willis wanted the screeners/reviewers to identify areas in the questionnaires where something may have been overlooked or left out, or where there might have been contradictions between what the incumbent wrote and the comments of their supervisor. They were also instructed to be alert to expressions of opinion or conclusion not supported by fact. 254. The screeners/reviewers found only a handful of cases in which there was disagreement between the supervisor and incumbent. Saxberg testified in these situations she would talk to both individuals and, in most cases, reported the disagreement was more of a semantic nature than a substantive disagreement about job duties.

255. Willis explained, based on his past experience, about 50 per cent of the necessary interviews can be conducted by telephone but the other 50 per cent require a personal meeting, in order to obtain more substantive information, especially when dealing with higher level technical and professional jobs.

256. Willis stated the number of times that a questionnaire has to be supplemented, whether it is 80 per cent or 30 per cent of the cases, does

50

not really impact on the quality of the questionnaire. According to Willis, it is the extra information which is obtained and put before the committee that counts.

257. The evidence indicates there were some reviewers who previously had functioned as evaluators on an evaluation committee and who had been identified as outliers in terms of their evaluations. Willis defined an outlier as an individual, on an evaluation committee, who exhibits a divergence from the rest of the committee as a whole and gives higher scores to certain kinds of jobs or lower scores to certain kinds of jobs compared to the other members of the evaluation committee. (Volume 29, p. 3793).

258. Willis indicated one way of checking for validity, in the situation where a screener/reviewer is also an outlier, is to examine the questions they draft and the answers they give and determine whether the answer responds to the question. Willis testified he saw no indication these individuals were not recording the answers to the questions raised.

259. The Tribunal heard evidence from three individuals who performed as screeners/reviewers. With regard to the effectiveness of telephone interviews for obtaining information, Christine Netherton states in Volume 173, at p. 21919, lines 5 - 20:

A. ...sometimes it only took half as long to get the information, but very often you would have to explain what the study was doing and they would say Oh, I filled that out and so and so. So there would be a lot of chat to get easy with them. And you tried not to rush people.

I think the information did come back on the whole. And you would get this response from other reviewers as well.

But there would be the person that did not like talking. I am talking of the impression I am left with. I am not saying that it was 100 per cent perfect. But the main impression is that in the majority of cases you did get good information via the telephone.

260. Another reviewer/screener who testified, Mary Crich, said she did not often see examples of conflict in the summary of duties and responsibilities between the incumbents and the supervisors. Crich gleaned from the telephone interviews that employees enjoyed the opportunity to speak with someone about their job.

261. Both Willis and Durber were asked about the competency and ability of the screeners/reviewers. A number of individuals who functioned as screeners/reviewers were familiar to Durber because of his lengthy experience in the Federal Public Service. Durber described them as professional job evaluators as well as analysts. In his opinion, they would tend to be more competent to perform the tasks assigned to them as reviewers than others without similar backgrounds. (Volume 164, at pp. 20505-07).

51

262. With respect to the adequacy of their work, Willis said the following about screeners/reviewers in Volume 65, at p. 8136, lines 2 - 7:

Q. But were you aware that after training reviewers had any difficulty understanding their job?

A. I don't believe any of them had any difficulties. At least none were expressed to me.

263. Following the screening/reviewing process, the questionnaires with the reviewers' notes would be turned over to an evaluation committee. If an evaluator on the committee required further information, questions would be drafted by the evaluation committee and would be passed back to the screener/reviewer to solicit the necessary information from the incumbent. The information obtained by the screener/reviewer would then be provided in writing and returned to the appropriate evaluation committee.

264. Under the direction of Durber, the Commission examined questionnaires with a view to assessing their quality. During the hearing, the Commission introduced a report, An Examination of the Quality of Questionnaire Information used by the Federal Equal Pay Study (Exhibit HR- 245). The report which examined the quality of questionnaire information was prepared by the Pay Equity Directorate of the Commission at the request of Durber, the investigator into these complaints. An experienced researcher, who possesses a Master's Degree in Canadian Studies from Carleton University, was commissioned to review a cross section of the evaluations. This included 63 benchmark evaluations and 588 non-benchmark questionnaires, a total of 651 questionnaires. Her task was to ascertain the apparent completeness and accuracy of all material in the questionnaires files collected as part of the JUMI Study. The researcher was closely supervised by Durber. As part of this work, Durber personally reviewed 36 files which were flagged by the researcher and found each to be in satisfactory condition.

265. The researcher reported the legibility of the descriptions in the questionnaires was good in all cases and that the open nature of the questionnaire appeared to provide scope for answers for both male- and female-dominated occupational groups. Many incumbents enlarged on their duties by adding pages to this portion of the questionnaire.

266. The Commission's report also recorded supervisor signatures were affixed to over 99 per cent of the questionnaires and in over 96 per cent of them the supervisors provided comments. Contradictory information from supervisors appeared in approximately 9 per cent of the questionnaires. In the questionnaires where supervisors provided conflicting information, 95 per cent were resolved by subsequent interviews conducted by the screeners/reviewers.

267. Durber expressed his own expectations about the quality of the questionnaire information when he said in Volume 158, at p. 19761, line 23 to p. 19762, line 3:

52

I can only say that from my experience in the public service, what I did see was much superior to what I have seen in job descriptions and in job files, presentations even in grievance situations, just to try to put my own expectations into some sort of context.

268. During cross-examination by Respondent Counsel, Willis was asked whether the safeguards implemented by the JUMI Committee to address problems in the data-gathering stage achieved what he wanted. Willis said in Volume 78, at p. 9543, line 3 to p. 9546, line 1:

Q. Those safeguards -- and they were all described in your original proposals -- related to both information-gathering and evaluation. Right?

A. Yes.

Q. I am going to suggest to you that almost or wholly without exception the safeguards that were implemented -- and there were lots -- were not effective to achieve what you wanted them to do.

A. I think it is fair to say that there were degrees of effectiveness that I experienced.

Q. And the degree of effectiveness, I am going to suggest to you, is disappointing at best.

A. Yes.

Q. Part of the result of that is that when we come to the information that was made available to the five and nine committees, after all the shoring-up it was weaker than it should have been. Do you agree?

A. I am not sure what you mean by weaker than it should have been.

Q. Weaker than is desirable for a good evaluation.

A. I did feel -- and I expressed this to the Joint Union/Management Committee -- that the quality of the information was not as high as I would have liked. However, I felt that overall it was satisfactory for our purposes.

Q. I understand, but you have also told us that it was weaker than what you normally get in other studies.

A. Yes.

Q. Even because of some of the weaknesses in the safeguards we have to raise something of a question mark or a flag, if you will, over some of the information that was actually obtained, some that is actually there, because of some of our discussion that it

53

wasn't written by a skilled job evaluator and some of the entries were made by outliers; all that discussion that we had. Do you agree with me?

A. Are you suggesting that some of the information may have been inaccurate?

Q. I am not saying that it is inaccurate. We don't know whether it is accurate or not. Our level of confidence in the information is below what we would like because the information is, to some extent, written by people who aren't skilled in doing this kind of writing, it was screened by people who aren't skilled in screening, interviews were conducted by people who aren't professional job analysts. That is what I am saying.

A. I think I did express to the Joint Union/Management Committee, or at least to the Mini-JUMI, that we would expect a wider amount of disparity because of the information being somewhat weak.

Q. But what I am suggesting to you, in addition, is that -- you say weak. I am asking you whether you agree with me that even with what we have, we have a somewhat reduced level of confidence in its accuracy.

A. I don't know if I can say that. Certainly what we would have liked would have been questionnaires that were more complete and that focused more on factual information. These are the things that always lead the evaluators to making certain assumptions, resulting in a wider range of possible disparities.

269. In spite of the fact the quality of information was weaker than what was available to him in other studies in which he had been involved, Willis consistently maintained throughout the course of this hearing that the quality of the information was good enough for the purposes of this study.

270. We note, in the course of further cross-examination by Respondent Counsel, Willis again gave his opinion on the quality of the information. This response is found in Volume 69, at p. 8612, line 22 to p. 8615, line 15, where he says:

Q. So what you are saying is that after all the shoring up, the information was still wanting to some significant degree.

A. I would say that the information at best was satisfactory, but not superior.

Q. Would there have been a range -- you make me think of our performance appraisals. We can get satisfactory, fully satisfactory, and superior. Is that the table you are using?

54

A. Let me put it this way: I had some concerns about the quality. Telephone interviews and interviews by interviewers who were not professionally trained can never completely substitute for a well-completed questionnaire in the first place. While overall I would say the quality was sufficient for our purposes, particularly with the large numbers of evaluations -- again, if this had been Addiction Services with only 19 or 20 positions, I would have been very concerned because I knew that we had to tolerate a greater disparity than I would have hoped for as a consultant. Again, as long as the disparity is random and it will cancel itself out in the end, I felt that I could live with the result.

What happens when you do have a questionnaire -- two things happen when you have a questionnaire that is somewhat weak: (1) it slows down the process, as we found out; and (2) we have to anticipate that there will be a wider tolerance for disparity.

Q. All right. I want to stop you there because you said something again that I want to challenge you on.

You are saying that the disparity cancels itself out. That is if you are looking for gender bias.

A. If it is random, by definition it will cancel itself out. If there is a pattern that results, then it isn't random.

Q. I am going to suggest to you that what it does -- if it is random, it cancels out gender bias.

A. It will cancel out any bias.

Q. Any bias, all right. But what it doesn't cancel out is unreliability. If you have extensive disparity, what you have is a lower level of reliability. I thought we agreed on that yesterday.

A. I think a statistician would say that if you were dealing with a relatively small number, that would be very true. It is less true as the number of evaluations grows, and the disparity continues to be random -- that is, the pluses and the minuses tend to cancel each other out -- you can still achieve satisfactory reliability with a large number of evaluations.

271. We also note, during cross-examination by Respondent Counsel, Willis reiterates his previous testimony in Volume 78, at p. 9566, lines 19 - 22:

I think I have already said that I feel that with all of the work we did on the data gathering that the data is good enough for the purposes of this study.

272. And again at p. 9567, line 23 to p. 9568, line 14:

55

Now I am saying to you, Mr. Willis, what do we have to take away before you would say, I will not defend this study?

A. Number one, I did make the statement several times that the quality of information was good enough. I would have blown a whistle if I felt that the quality was so low that we couldn't depend on it.

Second, I did indicate that I felt strongly that I could not validate the results of the study if we couldn't do an assessment, an internal review of existing evaluations. That has been done.

But what it would take? It is possible that I might look at that final analysis and say that I agree that we cannot use the results, but I don't know that.

C. THE EVALUATION PROCESS

273. In a large study such as the JUMI, involving a significant number of positions, Willis utilizes multiple evaluation committees. One committee is his preferred approach, but with myriad jobs, it is necessary to rely on more than one committee in order to evaluate efficiently and properly. Overall, there were 16 evaluation committees established to evaluate questionnaires.

(i). Master Evaluation Committee

274. The challenge for Willis is to design a process which enables the various committees to be consistent with one another over a relatively long period of time. As a guide and procedural safeguard, Willis creates a steering committee or a master evaluation committee. Willis stated it is necessary and essential in a pay equity exercise to make comparisons among dissimilar jobs. The master evaluation committee has the primary responsibility for establishing the relationships among different jobs and setting the frame of reference for the multiple evaluation committees. This exercise is what may be described as the master evaluation committee discipline.

275. The MEC evaluations are referred to as benchmark evaluations. The MEC evaluated a total of 501 evaluations. Benchmark evaluations are critical in a process where multiple committees are used.

276. The MEC was composed of 10 members, one half management representatives, and the other half union representatives. One management representative and one union representative were designated as co-chairs. Willis did not select the MEC members, this was left to the parties discretion. Willis recommended its members have a government wide perspective of work performed, analytical/conceptual skills, dedication to completing a tough assignment and an ability to submerge feelings of union or management affiliation in order to achieve a balanced approach to evaluations. The parties attempted to structure the MEC to reflect that balance.

56

277. Willis testified the MEC had a good balance of males and females with a good variety of backgrounds. The MEC also had an even number of union and management representatives.

278. According to Willis, the key to successful job evaluation is consistency in the interpretation of the evaluation factors as between the multiple evaluation committees. He uses three methods to test for consistency, all of which were employed in the JUMI Study. The first method is used in situations where a consultant is facilitating an evaluation committee or acting as an advisor. Here the consultant independently evaluates the same job using the same information the committee members are absorbing, while at the same time looking for committee patterns which may differ from the independent consultant evaluation. The second method consists in comparing individual evaluators to the committee as a whole. Finally, the third method consists of comparing committees to one another. Testing for reliability between evaluators, inter-rater reliability, and testing for reliability between committees, inter-committee reliability, will be described and examined in greater detail in a following section.

279. Benchmark evaluations provide a broad frame of reference for evaluation committees and are utilized to achieve consistency and function as a kind of quality control in the evaluation process. More specifically, the term discipline refers to the liberalness or conservativeness with which the MEC interprets the evaluation semantics.

280. The consultants must ensure the discipline is consistent, among the different evaluation committees. The discipline adopted by the MEC places a heavy responsibility on the multiple evaluation committees to evaluate the jobs and ensure they track well and are consistent with the jobs the MEC evaluated. That is, if the MEC evaluates a certain factor in a certain way, it must be adhered to by the other evaluation committees.

281. Willis testified if the multiple evaluation committees were permitted to create their own discipline, the end result would be that the evaluations would be inconsistent. The evaluations might be consistent within themselves, that is, the multiple evaluation committees might treat all jobs fairly and equitably, but the degree of liberalness with which they interpret the semantics might differ. If the master evaluation committee evaluates a factor in a certain way, that same approach must be adhered to by the multiple evaluation committees, otherwise over or under evaluation of questionnaires arises. In Volume 60, at p. 7396, line 18 to p. 7397, line 4, Willis stated:

Every evaluation committee adopts what I have referred to as a discipline, which is a conservativeness or liberalness in treatment of the evaluation factors. Once that discipline is established, if an evaluation comes in higher or the job is evaluated more liberally than the discipline would suggest by other evaluations, I would call that an over-evaluation. If the evaluation was more conservative than I would have expected compared with the overall consistency of the committee, then I would call that an under-evaluation.

57

282. Willis felt it critically important the MEC provide a sound evaluation basis for the other committees to use as a frame of reference. As to the relative quality of the questionnaires used by the MEC as compared to those used by the other committees, Willis stated that the quality of questionnaires used by the MEC was higher than used by the other committees.

283. Willis requested Drury select benchmarks for the MEC, based on a broad representation of the depth and breadth of the organization. The JUMI Committee formally approved the criteria for selection at its July 10, 1987 meeting. These criteria specify that benchmark positions would be representative of all occupational groups, different organizational levels, high population jobs, standard jobs, and mix of male- and female-dominated occupational groups in the total study population sample. As well, care would be taken to ensure that there was a sampling of specialized positions and that consecutive levels within a job series would be minimized.

284. He also gave Drury another criterion for selecting benchmarks which was to pick questionnaires of the highest quality. Quality in this context, according to Willis, was completeness, definitiveness and factual content. Willis felt it was very important the MEC have the highest quality questionnaires.

285. The MEC did not enjoy the luxury of receiving all the questionnaires beforehand and then selecting those to be used as benchmarks. Willis was instructed by the JUMI Committee to begin the MEC's work as soon as the first 50 questionnaires were returned. In fact, some questionnaires were still being returned when the MEC had finished its work. While Willis was satisfied overall that the MEC provided a good frame of reference, he could not say each of the criteria approved by the JUMI Committee for selecting the MEC's questionnaires was satisfied in selecting the benchmarks.

286. At the beginning of the MEC's work, Willis functioned as the chair of the committee. After a period of time, Willis relinquished the role of chair to the MEC co-chairs who rotated on a weekly basis. The role of the chair was to facilitate the meeting, maintain a neutral posture so as not to influence the group, write the evaluations on the blackboard and lead the group through the consensus process. Willis spent some time with the co-chairs, coaching them as to what he was doing, and why he was doing certain things. It was about three weeks before they assumed this task. From that point on Willis sat in the back of the room as an observer and was called upon from time to time for interpretation. He also functioned as a facilitator during the sore-thumb or interim review sessions (another part of the process which will be explained later). He proceeded on that basis all the way through. Whenever it was time for a review session, he would take over from the group.

287. After the MEC had completed its work, Willis suggested, for efficiency purposes, that a portion of the MEC benchmarks be designated primary benchmarks. As the additional job evaluation committees began their work, they required access to the MEC benchmarks. Rather than having a complete set of all benchmark evaluations available to each evaluator,

58

primary benchmarks were identified and provided to each individual evaluator. However, each evaluation committee was provided with one complete set of benchmarks.

288. The selection of primary benchmarks was based mainly on expected frequency of use and on other factors such as different organizational levels, different occupational groups, and the inclusion of different factors which were most representative of the jobs evaluated. At Willis' request, each of the MEC members produced a list of benchmarks, which was refined by Willis and in the end approximately 100 primary benchmarks were identified.

(ii). Multiple Evaluation Committees

289. Each of the remaining multiple evaluation committees had seven members equally divided between union and management. One member functioned as either a management or union chair of the committee. Again, Willis left the selection of these members to the parties. The Tribunal heard evidence from the Alliance and the Institute that care was taken to select individuals who were articulate, analytical, able to defend the evaluations and willing to work as a team. In terms of balance between the sexes, the Alliance attempted, without success, to recruit equal numbers of males and females. Their female evaluators were, however, often members of male-dominated occupational groups.

290. Willis believes a mix of genders on a committee is important primarily because of perception. As he said, if a committee is all female, it could be viewed as a female-oriented study or might be perceived the other way if the committee was all male. Willis' experience is if a committee has good people on it, their gender is not important. Willis considers the background of the members more important than the sex of the individual doing the evaluation.

291. Willis had recommended that no Federal Public Service classification specialists be on evaluation committees. This, however, did happen. Seven evaluators nominated by the Employer had extensive knowledge of the classification system in the federal government. They served on four evaluation committees and on the MEC. Willis' concern about individuals with classification background is they tend to bring what he refers to as baggage to the evaluations. Willis believes someone who is totally inexperienced will likely be more objective than someone with years of experience in classification.

292. In this context, Willis described baggage as pre-existing knowledge and understanding of the relativities within an organization. For example, baggage refers to assumptions about work and are probably unconscious. He views baggage as biases based on incomplete information from which hidden agendas could arise because of those beliefs.

293. Everyone carries baggage of one sort or another, according to Willis. It can, nevertheless, be minimized with an open mind and an objective, fair attitude when applied equally to all jobs so as not to improperly influence an evaluation.

59

294. Each of the five and nine evaluation committees consisted of seven members at all times; however, in many instances, substitutions did occur. The Tribunal heard direct evidence from 17 evaluators.

295. There was testimony from one of the 17 evaluators, Christine Netherton, a member of the first version of Committee #1 (it functioned after the MEC was finished) concerning the element of baggage. One member of her committee had a classification background. Netherton testified this particular individual had difficulty appreciating other points of view because of her background in classification. When this kind of problem emerged, the committee would attempt to discuss it with the member. Failing a resolution of the problem they would obtain the assistance of a consultant.

296. This problem was also identified by Willis with evaluators on Committee #3. The first formation of Committee #3 had numerous problems. Some of these can be attributed to the fact that certain of the management evaluators had former classification backgrounds. On the staff side there were evaluators committed to raising the scores of female-dominated occupational groups higher than was warranted. Willis described the way this committee functioned as almost a stand off. Further details of the problems in Committee #3 are canvassed in Willis' evidence in Volume 57, at p. 7090, line 12 to p. 7093, line 20:

A. Number 3 had some individuals on it who, on the staff side, were people who seemed to be committed to having the jobs of people in female occupational groups up as high as they could and two of the three on the management side were former classification people and they seemed to be devoted to keeping them as much in line as they could. It was almost a standoff.

The Chair of Committee #3 was a union representative and, while we counsel the chairs very carefully to take a neutral position -- that is, the chair, for their own credibility and not to have an undue influence, should be very careful how they led or how they facilitated the groups -- this particular chair almost became a fourth union evaluator. Not that she actually evaluated, but she entered into discussions in a way that leaned toward the union side rather than taking a neutral posture.

Of course, the chair has the opportunity of consensing and moving on. She would never move on until her side seemed to be well represented. This was one case where I felt that it was imperative that the chairperson be removed and I so recommended to the Joint Union/Management Committee.

Q. And what happened with regard to your recommendation?

A. Nothing. The management side supported my recommendation, but the staff side refused to go along with it.

Q. So, what was the result of this standoff? How do you feel it affected the evaluation process within Committee #3?

60

A. Interestingly enough, they tend to be a good match. One of the management side was a former classification manager and he was very forceful. It turned into a standoff in most cases.

The problem was that they were evaluating slow and slower. While I would expect eight or nine evaluations a day, I was not able to get that sort of productivity from any of the committees. But this particular committee was evaluating two or three jobs a day and they were themselves becoming extremely frustrated. So, I felt that the exercise was detrimental not only to productivity but to the health and well-being of the members themselves.

Q. You have mentioned two consequences of this standoff, being the health of the committee members and also the slow productivity rate. What is your opinion with regard to the actual evaluations performed by that committee?

A. I can't say that we found any pattern of bias that grew from that committee. I am sure that we would have gotten a pattern if it hadn't been three on one side and three on the other. I think the evaluations, at least as far as we could determine, were okay.

Q. What eventually happened to Committee #3?

A. At the time that we expanded from five to nine committee [sic], we were able to remove the chair from the leadership role and place here as a voting member of one of the other committees, one of the nine committees. Surprisingly enough, her attitude seemed to improve dramatically at that point.

Q. What do you mean her attitude improved?

A. In the opinion of the consultant sitting with this committee and in the opinion of the chair of the committee, she was handling herself more conscientiously than she had.

Q. So, with regard to the evaluations performed within the subsequent committee she worked on, do you have an opinion to give on that?

A. I didn't see a pattern of problem developing either with the individual or with the committee itself.

297. The best pay equity evaluation results are obtained from having truly heterogenous committees. The ideal profile of a job evaluation committee in this kind of exercise is to have individuals with different backgrounds, different experiences, with approximate numbers of males and females, individuals representing different unions and employees with different functions representing different departments and organizational levels. Willis' goal is to obtain individuals who can be called upon to evaluate conscientiously and fairly, not an easy task.

61

298. According to Willis, bias can occur in the use of job evaluation systems, not necessarily from the evaluation plan but on the part of an evaluator. This is the reason he thinks that the process itself is more important than the job evaluation instrument. A heterogenous committee cannot guarantee bias will not creep into a job evaluation process; however, with this kind of committee, there is a better chance of getting an objective result. As Willis states in Volume 29, at p. 3788, lines 18 - 22:

...we need people who are conscientious, who can be analytical, and who could be depended on to do their best to do what is right, rather than to protect their own particular field or area.

299. The Tribunal heard evidence on the backgrounds, ages, positions held, skills, strengths and weaknesses of the members of the multiple evaluation committees. With some exceptions, the evidence generally supports Willis' criteria of a balanced committee. On the other hand, the evidence clearly indicates there were some evaluators who carried baggage, who had agendas, and could not be depended upon to evaluate jobs objectively. To the extent these evaluators may have affected the reliability of the results, we will review other procedural safeguards used by Willis in the evaluation process to determine how well the process worked.

(iii). Process for Evaluation of Questionnaires

300. Willis described the difference between what is commonly known as traditional job evaluation and pay equity job evaluation. Willis stated traditional job evaluation has been used since the early 1940s to evaluate primarily management jobs. Its purpose is to achieve some basis for applying pay differences among different levels of managers. On the other hand, Willis states pay equity requires comparisons of dissimilar jobs at all levels within an organization and in the market.

301. The methodology tends to vary considerably between traditional job evaluation and pay equity job evaluation, although they both utilize evaluation committees. Routinely, in traditional job evaluation, committees are made up of managers, information is collected using job descriptions and interviews are conducted by consultants. The consultant's role becomes less intrusive once a committee is trained to evaluate. As Willis says, the learning curve goes up rather dramatically.

302. In pay equity job evaluation, Willis prefers committees that are:

  1. balanced, comprising equal representation of male and female;
  2. with cross-sections of different organizational levels; and
  3. representative of diverse backgrounds.

303. Willis further testified pay equity job evaluation committees have to be trained in how to look at a questionnaire and to analyze the importance of a job. During this process, they must submerse their personal feelings about how jobs tend to fit together. Willis stated different problems are encountered in pay equity job evaluation than in traditional job evaluation. Primarily, problems arise in pay equity because of people's feelings about job relationships.

62

304. Willis finds evaluators are comfortable in the context of traditional job evaluation because of their general understanding of jobs, as for example, a group of managers evaluating jobs in their own organization. Thus, peoples feelings about job relationships become less important in that context than in a pay equity job evaluation process where the consultants have to actually get the evaluators to look at things differently than they ordinarily would.

305. The Willis Process requires evaluators to evaluate independently. The Willis Process prescribes a particular procedure which must be followed during evaluations. The procedure may be described as follows: each member of the committee reads the questionnaire on their own; then the evaluators discuss the information and raise questions about job content which Willis equates to the final step in the data-gathering process; during the discussion stage, Willis permits committee members to share any special factual knowledge about the nature of the work performed and the context in which it is performed; should any evaluator or committee require additional information, questions would be drafted at this time and sent to a reviewer; when the committee members have a common understanding of the facts, each evaluator is required to independently and confidentially, rate each factor pertaining to that position; subsequently, the consultant or chair collects all of the evaluation slips which contain each individual evaluators rating and transfers them to a blackboard, thus giving the committee a visual basis for making comparisons.

306. There follows a discussion period in which the evaluators talk about their evaluation differences; if an individual has a slightly different rating for any given factor they are called upon to justify, in factual terms, their rating; Willis expects the other members of the evaluation committee to listen to the reasons of minority evaluators and he refers to this part of the process as the consensus process; he then permits individual evaluators to adjust the factors at this point, but only if they can demonstrate factual reasons for this adjustment; the consensus score is recorded and a rationale is prepared which explains essentially the reasons for the particular evaluation of each position, using criteria defined in the evaluation plan, and exemplified by the benchmarks.

307. Although Willis advised the JUMI Committee it was only necessary for the MEC to prepare rationales, the JUMI Committee decided that the multiple evaluation committees should prepare written rationales as well. Problems arose in relation to the rationales as some were poorly written and difficult to decipher. Consequently, there were delays in their transcription. In the past, Willis has not used rationales because he does not consider them critical to the evaluation process although, he did testify they can be helpful. Willis counselled multiple evaluation committee members to use the rationales as a guide but, in every case, he wanted the evaluators to return to the MEC questionnaire and read it, rather than relying on a rationale. It was Willis' opinion it was impossible to capture all the things that an evaluator would need to know in order to evaluate a position in a one or two page rationale.

308. Either before reaching consensus or after reaching consensus, depending upon the preference of the individual committee, the committee

63

looked at the MEC benchmarks and selected either similar or dissimilar jobs to ensure their ratings were consistent with the MEC benchmarks. If their evaluation scores were inconsistent with the MEC benchmark then the committee had to adjust its evaluation to accord with the benchmark evaluations.

309. Willis stated it becomes fairly obvious, particularly to the consultants, when an individual demonstrates a gender preference during this process because it is difficult for an individual to provide factual information to support a preference based on feelings. Willis does not require unanimity for consensus, but requires a two-thirds agreement by members of the evaluation committee. Any evaluator in the one-third minority has an opportunity to persuade the group their rating is the correct one. As Willis stated, ...in the final consensus, we have to have at least two thirds of the people who feel the evaluation is right. (Volume 38, p. 4737).

310. The evaluation committees tended to follow the evaluation process designed by Willis, that is, independent reading of the questionnaires, discussion among committee members to better understand the facts, individual rating of each subfactor, posting of individual ratings on a blackboard, arriving at a consensus and selecting appropriate MEC benchmarks.

311. Willis testified a good committee possessing good job information can usually evaluate 8 to 10 questionnaires per day. On that basis, the JUMI Committee initially established 5 multiple evaluation committees, with the expectation that each committee would be able to evaluate approximately 750 positions; however, productivity was much lower than originally anticipated for the MEC and the multiple evaluation committees. Consequently, in order to deal with the time delay and to solve other problems, Willis recommended and the JUMI Committee agreed on March 3, 1989 to reform and expand the five multiple evaluation committees to nine.

312. Many of the problems observed by Willis and his consultants occurred with the initial five evaluation committees. The circumstances surrounding these concerns are now detailed.

(iv). Training of the Multiple Evaluation Committees

313. The evaluators needed training in the use of the Willis Plan. Willis personally trained the MEC in October, 1987. Willis testified he was satisfied with the training of the MEC. (Volume 62, p. 7698).

314. Training of the first five evaluation committees was undertaken by Willis and his consultants. Willis met with all five evaluation committees for the first day, and thereafter he divided the members into evaluation committees and assigned a consultant to each committee. When the five evaluation committees expanded into nine, all new committee members received individual training or, if it was a new fully constituted committee, then training was undertaken with the whole committee.

64

315. One of Willis' goals in training a committee is to ensure comfort with the Willis Plan. His training usually consists of explanations of the Willis Process, and on the job training with his consultants until evaluators become comfortable using the Willis Plan. Willis' approach is mostly learned by doing. Training usually spans a two week period, and towards the end of the first day or maybe into the second day, Willis distributes a questionnaire and has the group go through an evaluation exercise. Willis instructs evaluators not to make assumptions about the work, and to look for facts when completing the questionnaire.

316. Willis trains his own consultants in the Willis Process. Part of their training is directed at attitudinal problems relating to stereotypical work. This part of the training, with both consultants and committees, is informal. The perspective he conveys is to ignore whether a job is male-dominated or female-dominated. He discusses attitudes with his consultants and trains them to deal with attitudes in terms of examining pieces and components of a job, breaking a job down into a number of parts and examining the pieces without regard to the sex of the incumbent. The same method is then imparted by his consultants in training evaluation committees.

317. Willis was asked to comment on a publication of the Ontario Pay Equity Commission (Exhibit PSAC-71), a commission established to assist in the implementation and administration of the Pay Equity Act (Ontario). The publication contained information on training job evaluation committees. Willis agreed in principle with the Ontario Pay Equity Commission's list of elements of appropriate training for an evaluation committee which include: information on the history of job evaluation; how salaries and wages were set in the past; pay equity and wage determination processes; how gender bias may enter into evaluation systems; trends in women's participation in the labour force; the rationale for pay equity; and specific mechanics of the system used by the organization in question.

318. However, Willis prefers his approach which, over the past 20 years, has been more pragmatic than the detailed criteria listed by the Ontario Pay Equity Commission. He says the following in Volume 209, at p. 27088, line 23 to p. 27089, line 16:

A. We have found through experience that the best way of dealing with differences in different kinds of jobs -- and, incidentally, there is no such thing as an all women's job or an all men's job any more. They are all some mix of men and women and there are all kinds of jobs.

There are some features in men's jobs and women's jobs that are somewhat hidden and there is such a variety of kinds of jobs, particularly in the public sector, that our experience has been that we can best deal with it if we don't try to focus on men's work versus women's work at all, but, rather, focus on breaking the job down into factors and examining those factors without regard to whether it is a woman's job or a man's job, making sure that all of the hidden elements, whatever they are, are brought out.

65

319. Willis' own training in gender sensitization was not achieved through any formal program but rather came from the school of hard knocks. (Volume 209, p. 27168).

320. On further cross-examination by Complainant Counsel, Willis agreed consciousness raising or gender sensitization is not a bad thing. In response to expert evidence from Weiner and Armstrong on this topic, advocating the kind of training recommended by the Ontario Pay Equity Commission, Willis, relying on his experience, said it may be helpful to have sensitivity training of this type, but that it is not absolutely necessary. More particularly, he confirms this in his response in Volume 209, at p. 27096, lines 9 - 17:

Q. So you are satisfied, then, that you can do a pay equity study and do fair evaluations of jobs without the kind of training that is suggested by Dr. Armstrong and Dr. Weiner?

A. I would say that we have ample experience in evaluating male and female jobs in cases where there has been no sensitivity training per se, but that the consultant's guidance is sufficient.

321. The mechanism or safeguard Willis uses to ensure sound, reliable results, in the absence of more formal gender sensitivity training, is one of consultant participation. With the exception of three weeks, Willis personally observed the work of the MEC. During his three week absence, he was replaced by his consultant, Drury.

322. There were five consultants working on this study, including Willis. When the first five evaluation committees began their evaluations, each committee had a designated consultant for training and consultation. (Volume 60, p. 7433).

323. Willis testified the role of the consultant is to evaluate privately while the evaluation committee is doing its evaluations. Willis said initially the MEC would have a short period of time to discuss the particular job selected for evaluation. As was his usual practice, Willis would have his own list of questions which needed to be answered with regard to a particular questionnaire. If this information was not brought out by the MEC members then Willis would raise these questions himself. This function was performed by his consultants with the later multiple evaluation committees.

324. Willis summarizes the role of the consultant as generally to serve as a group facilitator, to be a trainer, to answer the committee's questions about evaluation techniques, and at the same time, to observe the functioning of the committee and to maintain a finger on the pulse. When the consultants do their own evaluation of the job, they do not communicate to the committee the result of their evaluation. The purpose of these evaluations is to enable the consultant to track the committee evaluations. In Willis' opinion, the consultants have two advantages over the committees: (i) they are professional evaluators; and (ii) they do not carry any baggage.

66

325. In Willis' opinion, a disadvantage of the consultant's role is that as outsiders they do not know the environment of the organization as well as the evaluation committee members and thus, do not know how differences are perceived within an organization. Willis also points out there is always the danger a consultant may be influenced by their knowledge of a job in another organization which may be similar but not exactly the same as a job within the study.

326. The consultant is not only examining the factual basis each evaluator is using to justify their own evaluation, but is also examining what the committee is doing and more importantly ascertaining the committee's rationale for what they are doing for each of the subfactors in the Willis Plan. The consultants exercise what Willis describes as an empirical judgment during this process.

327. Willis testified the MEC was a good and effective committee. Based on his own observation and the information received from Drury, he was satisfied with the degree of consistency the MEC had in terms of its own discipline. His overall assessment of the quality of their efforts was they were evaluating based on facts.

328. From his personal observation, he identified two individuals who seemed to be outliers. The MEC did not tend to be influenced by these individuals. Willis' conclusions on how the other members of the MEC received and reacted to the outliers' comments was based on the remaining evaluators' reasons for evaluations and the overall consensus of the group, which were not affected by the outliers.

329. With respect to the training of the multiple evaluation committees, Willis found at the conclusion of the first five days training, the majority of the members were barely comfortable with the system, but became more comfortable with the plan after two weeks of training. Individual evaluators who testified at this hearing also experienced an increase in comfort with the Plan as their work progressed.

330. Willis recognized the need for constant vigilance in maintaining and understanding the plan so that evaluators would not revert to previous evaluation judgments. As a result, Willis met regularly with the initial five evaluation committees to review problems and suggest solutions. In addition, the Willis firm prepared technical advisories, written explanations by the consultants, which answered questions posed by the committees concerning the interpretation on the technical aspects of the Willis Plan.

(v). Master Evaluation Committee's Evaluations

331. Willis provided the Tribunal with his conclusions regarding the work of the MEC during the JUMI Study. He repeatedly maintained the MEC's evaluations were unbiased, he was comfortable with the MEC's work, the information the MEC was using was based on facts, and, in his opinion the MEC had done an excellent job.

67

332. On several occasions, throughout the JUMI Study, Willis was asked to assess the quality of the MEC's ratings. In response to Respondent Counsel, Willis commented in Volume 75, at p. 9202, lines 14 - 23, as follows:

I probably examined those 503 evaluations and the differences to death. It was over a six month period that I was continually challenged about them. Every time I reviewed them by myself and with other consultants, we came up to the consistent opinion that, while there was some differences, the Master Evaluation Committee's benchmarks were satisfactory, recognizing that this is not an exact science; it is an art.

333. The approach adopted by Willis to validate evaluation results was to personally, or have one of his consultants, re-evaluate selected questionnaires. The first testing of the MEC evaluations was conducted by Willis consultant, Drury, in the spring of 1988. Willis testified the purpose of Drury's review was not to validate the results of the work of the MEC, but to ensure the MEC evaluators knew how to use the Willis Plan and that they understood and were interpreting it properly.

334. Willis was interested in whether the MEC evaluators were consistent among themselves. The MEC evaluators themselves wished to have a review as a double check while they were still learning the evaluation process. It was not made known to the Tribunal exactly how many MEC questionnaires Drury reviewed. Her review was done in the spring of 1988, and the MEC had been evaluating since the fall of 1987.

335. For purposes of her review, Drury used only those questionnaires done when she was absent from the MEC discussions. In the end, she identified a total of 12 positions which she evaluated slightly different from the MEC's. Drury's differences arose only with female-dominated positions and the female evaluators on the MEC took exception to this fact and wrote a letter in protest. This letter was addressed to Willis and the Commission. Willis believes this controversy arose not so much because Drury was critical of the MEC's evaluations but rather because there was an appearance of singling out female jobs.

336. Willis reviewed the 12 evaluations Drury identified. He considered Drury's assessments sound and communicated his findings to the Commission. Of the total, in three of the twelve questionnaires there was less than a 2.5 per cent difference in points between Drury and the MEC. Drury then met with the committee once more. The MEC made some changes in view of Drury's re-evaluations, but seven out of the twelve were left unchanged. Of these seven, Drury deemed the MEC had undervalued two and overvalued five. Willis was not concerned with the small number of differences per se, he was more concerned with the fact that of the seven, five were in one direction and two in another. This fact lead him to consider the possibility of a pattern of bias.

337. Willis again met with Drury and with the MEC. Of the five jobs Drury deemed over-evaluated, four were nursing positions. After discussing the content of these jobs with the MEC, Willis concluded the MEC's

68

evaluations were satisfactory and he supported their original evaluations. In discussing this situation later with Drury, she admitted to Willis her past experience with nursing positions had been in the State of Connecticut and this had coloured her evaluations. Willis reasoned that nurses in Connecticut do not possess the same breadth of knowledge required from nurses in the Canadian system. Thus, in Willis' opinion, Drury was probably a little off.

338. In his final analysis of the re-evaluations, Willis did not believe the MEC had over-reacted to the Drury re-evaluations in any systematic way. He expressed this opinion in a letter to the management side co-chair of the JUMI Committee, Lise Ouimet. The letter was written on December 5, 1988 and reads in part:

In the spring of 1988, we responded to a request from MEC to review and comment on the evaluations they had completed as of that time. Jan Drury reviewed the Committee's efforts and made recommendations regarding twelve evaluations. Of these, four were for total point adjustments of between 10.0 percent and 10.9 percent, four were between 11.0 percent and 15 percent and one was slightly greater than 15 percent.

The group reviewed her evaluations and explanations, both written and verbal, and changed their evaluations of two of the nine positions (including the one that showed a difference of slightly of 15 percent), leaving seven that are different from Jan Drury's evaluations by between 10 percent and 15 percent. Of these seven, Ms. Drury's evaluations were higher on two positions and lower on five positions. Five of these seven are nursing related positions.

Comments by MEC members indicated that they believed there is a slight difference in the roles of Government of Canada nursing positions having specialty assignments than Ms. Drury's experience with nurses in the U.S. would suggest. For example, MEC gave more weight to #83 Staff Nurse-Sexual Offenders Unit's role in counselling of offenders than Ms. Drury did.

I am not inclined to totally discount the MEC's judgment on this issue without more information and do not feel that these slight differences warrant concern. On the other hand, if you disagree, I suggest that these nursing positions be submitted to MEC for review including obtaining additional information regarding the significance of specialty assignments, and re-evaluated. (Exhibit R-35)

339. Another procedural safeguard designed by Willis to address the issue of disagreements arising between the multiple evaluation committees' and the MEC was to permit and even encourage the multiple evaluation committees to submit their differences with explanations to the JUMI Committee. Willis had proposed in his plan that the MEC be reconvened to address these differences, and to either explain their evaluations in a more comprehensive manner or to adjust their evaluations to conform with

69

the results of the multiple evaluation committees. He believes this is a vitally important exercise.

340. Willis explained once a committee has evaluated a number of jobs, they develop a sense of confidence in their own ability to evaluate and inevitably there will be minor differences in how a job is perceived. Willis said one approach would be to tell evaluation committees they would have to adopt the discipline of the MEC regardless of any disagreement. But Willis desired more open communication. If the multiple evaluation committees were not comfortable with an evaluation by the MEC, Willis felt they had an obligation and a right to note these differences, and that the MEC should review any challenges brought forward by the evaluation committees.

341. Pursuant to the above, a total of 48 challenges to the MEC evaluations were brought forward by the multiple evaluation committees. There was disagreement within the JUMI Committee as to whether or not the MEC should be reconvened to review these evaluations. At one stage, the consultants were asked to independently review approximately 33 adjustments suggested by the evaluation committees. Willis testified in two-thirds of the cases, the differences were so nominal, that they were hardly worth considering, and those cases were discussed individually with the evaluation committees. Willis believed there were about 14 remaining questionnaire evaluations requiring review by the MEC.

342. Willis did not wish to say what the change ought to be because he did not want to second guess the MEC. There were 14 he identified as problematic questionnaires, suggesting a possible problem or at least enough doubt that they ought to be revisited.

343. Willis felt strongly the MEC should be reconvened to put to rest the differences in interpretation between the MEC and the multiple evaluation committees. It was ultimately decided by the JUMI Committee that the MEC would not be reconvened, instead a smaller version of the MEC (the Mini-MEC) was created. The Mini-MEC was composed of a small number of former MEC members. They were three in total, Willis, Joanne Labine, a union representative, and Michel Cloutier, a management representative. These latter members were previously identified as outliers on the MEC. Both outliers, Labine and Cloutier, had been identified by Willis in his direct observation of the MEC and that conclusion had been confirmed by statistical analysis conducted by an independent statistician. It is one of the incomprehensible decisions of the JUMI Committee.

344. Not surprisingly, Willis questioned the JUMI Committee's decision to select those two outliers and suggested choosing two other individuals. According to Willis, they stone-walled me and he lacked the authority to overrule the JUMI Committee's decision. He felt the two outliers were ill prepared to represent the JUMI Committee because of gender bias in their original evaluations.

345. The Mini-MEC considered the consultant's recommended changes to the challenged benchmarks. The union representative agreed with the

70

consultant and the management representative rejected all of Willis' recommendations.

346. The Mini-MEC then suggested three options to the JUMI Committee which were made without Willis' consultation. These options included:

OPTION 1

It is proposed that MINI-MEC review the consultants' recommended changes to MEC bench marks (33).

-Should the two MINI-MEC members agree, the challenged bench marks and rationales will be amended.

-Should MINI-MEC after consultation with N.D. Willis or Jane [sic] Drury can not arrive at a decision, than MINI-MEC and the consultant will determine whether it is in the best interest of the study to remove the bench mark(s) in question.

OPTION 2

It is proposed that challenged MEC bench marks not be amended. In cases where MINI-MEC agrees that the rating is an inconsistent one, then the bench mark(s) in question(s) [sic] would be removed.

This proposal is based in part on our opinion that it is to[o] late to attempt to change bench marks and have the committees adjust their rating patterns.

It is to be noted that the two above options would necessitate re- sore thumbing in situations where a change to or the removal of a MEC bench mark is effected.

OPTION 3

It is proposed that should a situation arise where a committee is unable to reach a consensus on a rating, that the questionnaire be referred to MINI-MEC for resolution.

It is also recommended that further challenges to MEC bench marks not be accepted.

347. The JUMI Committee selected Option 2. As might be expected the Mini-MEC could not agree which benchmark rating was inconsistent and, as a result, none of the benchmarks was removed. Although disappointed, Willis did not feel the integrity of the process was invalidated. At this stage he believed the evaluations were intact and reasonable.

348. Wisner performed the re-evaluations of the challenged benchmarks and suggested an average 4.2 per cent overall increase for the 14 benchmarks under review. The purpose of the consultant's review was to

71

determine whether or not the differences were representative of a pattern of bias. The analysis by Willis did not demonstrate a pattern of bias, and Willis felt he could live with the JUMI Committee's decision not to recall the MEC. Willis did not believe the overall differences identified by this analysis between the consultants and the committee had a material adverse affect on the study.

349. Willis testified Wisner's analysis illustrates the percentage difference between Wisner and the MEC's approach. He stated the consultants tried throughout the study to refrain from imposing their evaluations on the MEC. Willis was asked when he would impose the consultant's evaluations on the committees. He responded in Volume 57, at p. 7053, line 14 to p. 7055, line 10:

THE CHAIRPERSON: Maybe you can tell us: Where do you draw the line between when you strongly make a recommendation or you strongly suggest or you advise? Where do you draw the line in terms of saying to the Committee, This should be done, or do you ever do that?

THE WITNESS: Yes. First of all, the consultant who is meeting or sitting with the committee will be privy to the questionnaire information and the discussion about the job. If we find at that time that people are not talking about the facts and are apparently not using the facts fairly and equitably, we would raise the question with the Evaluation Committee itself as it is proceeding.

On the other hand, after the fact, looking at a series of evaluations, we might disagree slightly with the committee, but our concern would be whether or not that difference might be a difference of honest interpretation by the committee, it might be a difference between what the consultant knows about that kind of a job -- and we can be a little bit misled ourselves as consultants -- as opposed to the committee, which may have a better handle or feel for the content of jobs in their organization.

If we identified a pattern that seemed to be resulting, then we would take very strong steps. Recognizing that these are value judgments, we have to have some tolerance. Just because we come out with an average of five or six per cent more overall than the committee, that doesn't necessarily mean that they are five or six per cent wrong. But our concern would be more: Is there a pattern here or is there a random difference? If it is a random difference, then we are not at all concerned, unless there is a possibility that they are misunderstanding how to use the instrument itself.

THE CHAIRPERSON: Why were you strongly advising that MEC reconvene?

72

THE WITNESS: Just because of the psychology of the committees, they felt very strongly about this. Even though there may be a very slight difference in a job, they feel uncomfortable if they haven't at least had a hearing.

350. According to Willis, stemming from the JUMI Committee's decision not to reconvene the MEC, a considerable amount of frustration was experienced by the multiple evaluation committee evaluators. The consultants were obliged to tell the evaluators the JUMI Committee had made a policy decision and that there would be no changes in the benchmarks resulting from the committee challenges. Willis suggested that the committees try to work around them. He believes many evaluation committees tended to take alternate MEC evaluations for comparison with their own evaluations and tended to ignore the MEC evaluations which had been challenged.

351. Willis indicated the level of frustration was highest when the evaluation committees expected and were waiting for the MEC to reconvene. When they were informed this was not going to happen they became more resigned.

(vi). Multiple Evaluation Committees' Evaluations

352. Some of the first five evaluation committees tended to negotiate rather than cooperate in trying to achieve a consensus. Willis found the evaluation committees tended to balance each other fairly well, but the obvious result was lower productivity. In the initial formation of the five evaluation committees, Committee #3 was more contentious and less productive than the others. Willis trained Committee #3 and led them for the first three weeks of their evaluations. He met with the chair weekly to try to work with her, as he considered her part of the problem. He sat with this committee a great deal of the time, working with them and monitoring their evaluations. He was better acquainted with this committee and their problems than with any other committee. Willis observed Committee #3 had individuals on the staff side who seemed to be committed to rating jobs from female-dominated occupational groups as high as they could and two or three on the management side, some of whom had former classification backgrounds, who were devoted to keeping the jobs as much in line as they could. He described this as a stand-off.

353. Willis testified the chair of Committee #3, a union representative, sometimes assumed the role of evaluator and entered into the discussions in a way he believed was inappropriate for the chair. The proper role of a chair is to assume a neutral posture and to facilitate committee discussions. Willis subsequently recommended to the JUMI Committee that the chair of Committee #3 be removed and he eventually recommended the entire committee be disbanded. However, the JUMI Committee rejected his recommendations and nothing happened to improve Committee #3's situation until the reformation and expansion into the nine committees.

354. Willis described a good functioning evaluation committee as a team working together with each member trying to evaluate fairly and equitably. His discomfort with Committee #3 was not the fact of the actual

73

evaluation ratings but rather the manner in which they evaluated. This committee would debate until finally they would agree due to exhaustion.

355. Willis did not believe the stand-off he described between management and union evaluators on Committee #3 negatively affected the evaluation process in that committee. Neither he nor his consultants could detect any pattern of bias in Committee #3 evaluations. However, as a consequence of the standoff, some committee members experienced health problems and the productivity rate suffered noticeably.

356. Willis also had a problem with the initial formation of Committee #4. He testified Committee #4 was an excellent committee from its inception to about March of 1989. However, in the latter stages, due to substitutions and the reforming of this committee, problems developed. In April of 1989, Willis requested Committee #4 undergo a final sore-thumbing exercise. During this exercise the chair of this committee came to him, almost in tears. Willis testified she said, I can't handle this any more. It has all broken down, they are all getting emotional, they are yelling at each other. We have a job to do and I quit. In the JUMI Committee minutes of October 31, 1989 (Exhibit R-44), Willis remarked, with regard to the consultant report on Committee #4, the major problem with Committee #4 was its lack of objectivity, creating the disastrous consequence of two camps, separate agendas, and arbitrary and opposing viewpoints.

357. At this point the committee had evaluated 52 jobs. Willis then requested that the remaining committee members state in writing their individual concerns about the evaluations and suggest any changes which they thought were necessary. He then disbanded the committee. Subsequently, a Willis consultant, Robert Barbeau, reviewed the specified concerns, made recommendations and was asked to take appropriate action. The committee members made suggestions on a total of 25 jobs and there was only one in which the consultant differed significantly from the committee members. Willis described this as one instance where there was consultant influence on the evaluations, albeit a small amount.

358. Willis did not observe any problems occurring with Committee #1 or #2 during the initial formation of the five evaluation committees.

359. Willis' observations of Committee #5 were that the evaluators tended to be extreme, on one side or the other, but not as extreme as Committee #3 and their productivity tended to move along. Willis identified one female union representative demonstrating a female preference and two male management representatives demonstrating a male preference. Willis further testified a female union representative also demonstrated a male preference. Willis found two of these evaluators, a female union representative and one of the male management representatives, tended to cancel each other out. Willis observed that the other members of the committee were not influenced by these two individuals and tended to discount their positions.

360. Willis further found the evaluations generally produced by Committee #5 to be pretty good. He identified two members of the committee as outliers but later recommended they become chairs of the

74

expanded nine committees notwithstanding, because he considered them to be good evaluators.

361. The Tribunal heard evidence from three evaluators who were members of Committee #5. Each confirmed this committee's thoroughness in discussing jobs and diligence in completing their task. Their evidence further corroborates Willis' view that the outliers did not influence the consensus of the committee.

362. There was evidence provided by two evaluators who were members of the first version of Committee #5, to the effect that the questionnaires discussed in this committee were difficult. One of these evaluators, Mary Crich explained the committee's long discussions resulted from very difficult male jobs.

363. Pauline Latour, another evaluator on Committee #5, states in Volume 171, at p. 21604, lines 20 - 25:

A. We had a difficult -- the questionnaires in Committee 5, I have a sense that they were more difficult to evaluate. There were many that we seemed to have unanswered questions. So, we definitely returned more questionnaires in Committee 5.

364. Latour further elaborates on this point in Volume 171, p. 21605, line 12 to p. 21606, line 9:

Q. You mentioned just a short time ago that there were some jobs that you recall as being more difficult to evaluate than others. Could you describe which of these -- give us perhaps some examples of the types of jobs that as a committee you found more challenging than others.

A. This perhaps is going to be a bit of a convoluted answer, but, for example, the jobs that we were comfortable with were jobs that we had rated many similar positions. For example, we evaluated many secretarial jobs which were evaluated at quite a range, from typists to senior executives. We had a good understanding of the nature of the work.

There were some positions where we evaluated basically one or two jobs that were related and we never had a sense of how that job actually fit in the section that that person worked in. So, because they were so unrelated, there were quite a few positions that were unrelated, we really had a difficult time just grasping the level of complexity of that position.

365. The Tribunal heard direct evidence from 15 witnesses, who were evaluators on Committees #1, #2, #3, #4, #5, #8 and #9, about their experiences and perceptions while serving on their respective committees. Evidence about Committees #6 and #7 was provided by Willis and another Willis consultant, Owen. Neither one expressed any serious concern about what they observed on either of these committees.

75

366. In terms of the direct evidence from these 15 witnesses, the Tribunal was impressed with their individual level of commitment to the study. Although job evaluation is a systematic process that is mentally challenging, the fact remains these individuals endeavoured to achieve a consensus evaluation for each position, eight hours a day, five days a week, over long periods of time. Willis observed variations in the productivity of the committees. The productivity record based on a total of 3,185 questionnaires is as follows: (i) Committee #1 - 466; (ii) Committee #2 - 431; (iii) Committee #3 before expansion to 9 committees - 165; Committee #4 - first version - 200 evaluations; second version - 52 evaluations and after expansion of 9 committees - 160 evaluations; Committee #5 - 430 evaluations. After the expansion to 9 committees, Committee #6 - 197 evaluations, Committee #7 - 149 evaluations, this was francophone committee; Committee #8 - 150 evaluations, this also was a francophone committee and Committee #9 - 145 evaluations.

367. Given his experience in previous studies, Willis expects a certain amount of conflict within an evaluation committee because of the different backgrounds and perspectives of the various evaluators. However, Willis testified the degree and nature of the conflict he observed in this study within the evaluation committees made him feel uncomfortable.

368. Some of the problems which arose in the multiple evaluation committees had been anticipated by the JUMI Committee. The Testing Sub- Committee of the Willis Evaluation Plan, in its report of July 20, 1987 (Exhibit HR-11A, Tab 19), made recommendations in response to problems experienced by this sub-committee during a two week trial period. Some of the problems included personality conflicts, weariness owing to constant concentration and stress in being seconded from their regular jobs for long periods of time. As a consequence of this experience, the sub-committee recommended the rotation of evaluation committee members between evaluation committees, working a shorter day or week and the utilization of alternate members to replace designated committee members for periods of time. These recommendations were never acted upon by the JUMI Committee and the Tribunal was not provided with reasons for the rejection of these recommendations.

369. The evaluators testified they experienced tension as committee members, stress in reaching a consensus, personality conflicts, inflexibility on the part of some individual evaluators, difficulties with some chairpersons and screaming by some evaluators. In some instances evaluators walked out of evaluation meetings because of frustration. Compounding these problems was the frequent rate of substitutions of members for some committees. This resulted in a change of dynamics requiring adjustments by both new and older members.

370. Coupled with these problems, was a rigid working environment orchestrated and controlled by the Chief of the EPSS, who was apparently more flexible with management evaluators than union evaluators. The Chief, Pierre Collard, closely monitored the arrival and departure time of the evaluators, the lunch breaks and the coffee breaks. He insisted doors remain closed at all times during deliberations (causing ventilation problems), limited access to telephones, and kept all supplies in locked

76

compartments (thus creating time delays for obtaining supplies). These very stringent constraints intensified the frustrations already experienced by committee members. Moreover, some evaluators were from out of province and found it difficult to wait for long periods to be reimbursed for their travel expenses. This issue, in particular, was not resolved in a timely fashion.

371. Many evaluators who testified at this hearing, expressed a willingness for and the necessity of adhering to the MEC benchmarks as well as to the requirement that evaluations were to be based upon facts contained in the questionnaires, and not on any other extraneous considerations.

372. The only criticism the Tribunal heard concerning the committees' willingness to follow the MEC discipline was that for a short period of time, early in the evaluation process, Committee #1 tended to follow its own discipline rather than that of the MEC. This problem was corrected as soon as it had been identified by the Willis consultants.

373. In the early part of the year, 1989, Willis began to express to the JUMI Committee, his concerns, not directly related to the actual evaluations themselves, but concerns regarding circumstantial things which had transpired. He referred to these incidents as smoke because they were largely rumours and included incidents which occurred both inside and outside the committee rooms. He became increasingly uncomfortable with how the evaluation committees were working and with what he described as confrontations between union and management sides. Although he could not identify anything specific which would suggest gender bias was developing, based on his own observations and those of his consultants, he knew some things were happening and some improper attitudes were developing causing him a great deal of concern.

374. On several occasions, while Willis sat with a committee, it became clear to him a position taken by a particular evaluator was very biased. Usually, the individual evaluator refused to change the score, even though lacking the facts upon which to base a rating. The frequency of these occasions began to disturb Willis. It was occurring, he observed, on both union and management sides, and arose more frequently from the earlier formation of the five evaluation committees and their members.

375. Willis said that he was made aware of union members attempting to recruit other evaluators to their bloc. He had not seen this phenomenon in any other evaluation study in which he has been involved. Willis did not observe directly any incidents regarding this recruitment. He was informed, however, by Owen, of an incident in which an evaluator approached another evaluator about the evaluations. Owen testified about the circumstances surrounding this incident which occurred in February, 1989. Owen testified he overheard a conversation between two female evaluators who entered a room in which he was working. He overheard one female evaluator say to the other we don't think you're doing enough for women's jobs. According to Owen, the other evaluator became agitated, her voice increased in loudness and he heard her reply I didn't come here to build

77

up some kinds of jobs. I came here to do an honest job of evaluating the work.

376. Owen further testified he observed a sort of faction-based behaviour in the committees. There were some union evaluators who seemed to be treating certain jobs in a similar way as union evaluators in other committees. He identified them as Alliance members. What troubled Owen was in his prior experiences, which involved training and facilitating more than 50 evaluation committees, he had not observed any kind of similar behaviour. He also noticed unusual scoring, long discussions advocating a particular choice, and the selection of benchmarks inappropriate to the particular evaluation at hand. In another incident, during the initial formation of the five evaluation committees, Owen was asked to chair Committee #3, because the regular chair was participating as an evaluator elsewhere. When the chair returned to the room, a very contentious argument concerning an evaluation was taking place. The chair asked Owen to rule on how to proceed and asked for points of order similar to Roberts Rules of Order. Owen was completely unfamiliar with Roberts Rules of Order and was thus unable to give an appropriate response. The chair's reaction was to order and instruct the Alliance evaluators on this committee to walk out, which they did, slamming the door as they left. Owen viewed this unhappy incident as an attempt by one side to control that particular committee.

377. Like Willis, Owen felt frustration at not having any level of opportunity to intervene or take action.

378. Another incident noted by Owen occurred during the fall of 1988. Most of the Alliance members did not attend their committees on a particular day as they had designated it the day for a sick out to demonstrate their support for pay equity issues at the collective bargaining table. Apparently, collective bargaining was under way and there was some discussion among union members as to whether the proposals on pay equity would be withdrawn from the bargaining process. Two Alliance members who did not attend the sick out, told Owen that they were concerned about reprisals from their union for not having participated in the sick out.

379. Among the committees, Willis felt the conflict was too much us versus them. Willis confirmed he had never seen so many participants with a classification background in a pay equity study and this was an important aspect in the conflict in this case.

380. Willis testified if the Federal Public Service was his organization and he had control over the evaluation process and decision making authority, he would have made some changes and continued with the study. His preference would have been to remove the personnel creating problems and engage more consultants to work closely with each committee. 381. Willis' expert opinion is that gender bias can operate very subtly in a pay equity study, and he felt in order to defend the results, he had to reassure himself there were no problems with the evaluations. Willis was not sure the actual problems which existed resulted in biased results. He stated in Volume 69, at p. 8654, lines 8 - 14:

78

I have mentioned that there was an interesting contradiction. I had some very strong concerns about attitudes, things we observed. However, when we attempted to look at what committees' results were and when we tried to look at comparison of similar jobs, we were not able to detect a clear pattern of a problem.

382. During the evaluation of the first five evaluation committees, Willis testified that based on his observations, he identified ten evaluators who he believed were exhibiting gender preferences. According to Willis, the majority were exhibiting a female preference. His approach in dealing with this problem was to counsel these individuals. At this stage, Willis could not determine whether the identified evaluators were influencing the group evaluations. He was concerned he had no evidence other than his personal observations for support. He alerted his consultants who were already aware of the particular individuals. He and his consultants continued to track these individuals and to look at the results of the evaluations overall.

383. Another approach of the consultants was to break the evaluations down by occupational groups and determine if these individuals were influencing the group and to then attempt to identify on an overall basis if there appeared to be any problems with bias. This tracking did not seem to indicate any significant bias.

384. When counselling individual evaluators, Willis would sit with them in a private room and discuss their evaluations and what changes he expected from them. Willis testified he did not see any difference or change in the evaluations of the individuals after they received counselling. Willis was informed during counselling of some management evaluators that they were evaluating to offset evaluations on the part of the union evaluators. For the most part, Willis did not receive denials from any of the evaluators whom he counselled as to their behaviour.

385. Throughout the study, Willis also conducted committee counselling. He observed an evaluation committee as they were evaluating. In his interventions, he attempted to direct the evaluators to the facts, to look at the questionnaire and discuss the actual position rather than to make assumptions or stereotype. As to the effect of committee counselling Willis said the following in Volume 57 at p. 7087, line 9 to p. 7088, line 5:

Q. With regard to the last type of counselling you just gave for the evaluation committees as a group. I had already asked you as to your opinion of the efficacy for the individuals. Now I want to know what your opinion is with regard to how well the counselling of the evaluation committee groups worked?

A. That is a little hard to say. These committees were somewhat unusual compared to most committees I work with, in that I was not observing actual evaluation bias or any pattern that I could identify. On the other hand, I did not have committees that were all working together to accomplish a fair, equitable, conscientious result.

79

What I had in many committees were the staff on one side and the management on the other side and they were at loggerheads. This was a pattern that was not universal, but we found it on several committees. The extent to which our counselling affected them, in some cases, was negligible. [emphasis added]

386. During later testimony, Willis was asked to explain what he meant in the above excerpt by the words I did not have committees that were all working together to accomplish a fair, equitable, conscientious result. Willis explained his reference is primarily to the word conscientious. To him this word suggests an employee is working hard and meeting their own personal standards. In this context, Willis testified every individual he observed on every committee was evaluating conscientiously. On the other hand, the consultants attempted to instill a standard by which every job would be treated fairly, objectively and impartially. In that context, Willis said he observed evidence, which was not pervasive among all evaluators and committees, that this standard was not being consistently applied.

387. The testimony from the participating evaluators who were asked about how they personally approached evaluations was that they were honest, dedicated and conscientious. They observed the same commitment from most of their committee members.

388. Specific questions were posed to the evaluators who testified about Willis' concerns, referred to as smoke. The questions posed concerned rumours some committees were block voting, meaning union evaluators would vote together to obtain the same score for subfactors and all of the management evaluators would vote together to obtain their same score and about other methods of communication including the use of sign language and hand signals to indicate how specific evaluators were scoring so as to influence decisions.

389. None of the evaluators who testified observed this kind of behaviour or any other kind of organized communication designed to over- evaluate female jobs and under-evaluate male jobs. Apparently, hand signals had been discussed in a social setting, which one witness believed resulted from frustrations expressed about the difficult process of job evaluation. This action was given and received in a joking manner.

390. The Tribunal heard direct evidence of three separate incidents of inappropriate behaviour. In the first incident, both evaluators were female representatives of the Alliance who were involved in the conversation overheard by Owen, referred to earlier. An evaluator on Committee #4 testified she was approached by another evaluator on her committee concerning the subject of whether or not she was evaluating female- dominated jobs fairly. This witness had the impression that this individual wanted her to increase her ratings. The witness testified she responded by saying she was there to evaluate fairly and to the best of her ability in comparison to all of the jobs. As far as the witness was concerned, this was the end of the incident.

80

391. The second incident also occurred between two female Alliance evaluators. The witness testified she was approached by another evaluator who wanted to meet her outside of the committee room to discuss how to evaluate jobs. The gist of the meeting was the second evaluator wanted the first evaluator to favour female-dominated jobs in a higher bracket in the same way as she did. The first evaluator felt this was not an objective approach and told the second evaluator that her ratings would continue to be objective.

392. With regard to the first and second incidents respectively, both witnesses testified the incident did not have any impact on their manner of evaluating. The evidence is clear the individual connected to the first incident and who made the request was noted by her committee for her biased ratings which the committee had endeavoured unsuccessfully to change. Since she refused to change, she was basically ignored by the rest of her committee.

393. The third incident involved a female, Institute evaluator. This evaluator testified there was a social gathering in her hotel room involving about 10 or 15 evaluators. A conversation occurred later in the evening between this evaluator and four other evaluators from the Alliance. The Institute evaluator testified she had been advocating an objective point of view for doing evaluations and two of the Alliance members became very aggressive toward her. Their response was the study was an opportunity for women to have something done for them, and nothing was going to get done unless women's jobs were evaluated higher and the study was their last chance. The Institute evaluator testified things then got a little too personal. Another Alliance witness who testified described this incident as a verbal attack on the Institute evaluator.

394. With regard to this third incident, the Institute evaluator assumed the individuals who confronted her in her hotel room were in a position of authority vis a vis the Alliance and could call meetings and influence other Alliance evaluators. At the time of giving her testimony, she admitted she no longer had any basis for this belief and no longer felt there existed a common understanding among Alliance evaluators to act dishonestly.

395. Willis recalls he had discussions about problems in the evaluation committees with the Mini-JUMI, a sub-committee of the JUMI Committee. This sub-committee was formed to handle procedural problems of the evaluation committees. Willis testified he discussed with two of its members, Gaston Poiré and Elizabeth Millar some of the evaluators he felt were creating problems. Willis suggested certain individuals be eliminated from the evaluation committees. He testified he did not get the active support he expected. As a result, the JUMI Committee reassigned problem individuals when the committees expanded from five to nine. According to Willis, after the committees expanded, some committees worked well and some still had problems but not to the same extent as the initial five evaluation committees. He stated Nothing was worse than the original Committee #3. In his estimation, it was at the bottom of the barrel and after that it got better. (Volume 69, p. 8653).

81

396. Willis regarded what was happening in the evaluation committees as unacceptable. He concluded he needed to conduct further analysis, a more in depth analysis of the results, if he was going to be able to support the outcome of the study. Although he had not identified gender bias in the evaluations by January and February of 1989, he said in Volume 58, at p. 7229, line 13 to p. 7230, line 3:

A. I think the one thing that characterized the whole study, the equal pay for work of equal value charge, was to evaluate a broad range of positions on a gender neutral basis. I think everything we did in terms of the process that we set up and the evaluation system that was used, the way we tried to work with the groups, was all aimed primarily at avoiding any evaluations that would suggest traditional relationships, or in any way any bias that could be identified as gender bias.

I feel that at all stages in this study it was paramount that we continue vigilance and continually reinforce the need for objective, fair, equitable evaluations of any and all kinds of positions.

397. A letter dated May 4, 1989 to the JUMI Committee co-chairs from Scott Gruber, a Willis consultant, contained a recommendation that a special analysis of evaluation committee results be undertaken. The letter reads in part:

This letter describes our proposal for a special analysis of evaluation committee results, which we believe is timely and appropriate. The question to be addressed is:

Have the evaluations of the five evaluation committees (#1 through #5) been consistent with the evaluations generated by the MEC?

...

The methodology for this analysis will be as follows:

  1. A sample would be selected randomly from the evaluation result of each committee. The sample size will be 10% of the positions evaluated, with a minimum of 25 per committee. This latter provision allows for a reasonable examination of the efforts of low productivity committees. Using these guidelines, the total sample will be approximately 140 positions.
  2. A Willis consultant, familiar with the MEC evaluations, will examine each of the 140 questionnaires and make comparisons with appropriate or corresponding MEC questionnaires.
  3. Based on this examination the consultant will then assess the soundness of the final, post-sorethumb consensus evaluations from the five committees together with their
  4. 82

    selected MEC benchmark questionnaires. Problems and trends will be identified, by committee and for the entire group.

  5. Gender domination information will be obtained for positions in the sample at this stage. Additional analysis will identify whether any committee, or committees, exhibited tendencies regarding male or female dominated groups in their final results. Other variables besides gender could also be included in the analysis at this stage.
  6. A report will be prepared and presented to you, describing the process of the research, the analysis, and the findings.

...

We view this as a quality assurance study, to examine the evaluation results of five committees, comprised of people with a diversity of education, experience, and occupation, that could not mirror the characteristics of the composition of the MEC...A major question to be explored is whether the committees have used the MEC benchmark evaluations consistently and properly in the comparison process.

...If the results show that the five committees have performed their respective tasks consistently with the MEC, many concerns regarding the study will be resolved. On the other hand, if problems are identified corrective actions can be taken and the continuing efforts of the nine committees will benefit from the knowledge gained.

(Exhibit HR-11B, Tab 32)

398. A snapshot assessment of the validity of the evaluations was requested to be conducted on the 2,000 positions evaluated to date. Willis suggested one of his consultants examine 10 per cent of the completed evaluations and compare the committees' evaluations to the consultant's evaluations. In this way, he would at least satisfy himself there was no evidence of a problem or would expose the possibility that a problem might exist. His intention at the time was to start with a small study, which might expose evidence of discrimination. If a problem was revealed, he anticipated conducting a second study, a more in depth analysis, which would expose the extent of any problem indicated by the first study. He did not indicate to the JUMI Committee directly that he anticipated a two- tiered approach.

399. The proposal of a small study was accepted by the JUMI Committee and this analysis commenced in the spring of 1989. The analysis is entitled the Special Analysis of Evaluation Committees' Results (the Wisner 222) and was prepared by the Willis consultant, Jay Wisner (Exhibit PSAC-4). Wisner examined and re-evaluated 222 of the committee evaluations from both the five and nine committees. When the sample of the 222 positions was made the multiple evaluation committees were still evaluating questionnaires and the nine committees had been operating for about three months.

83

(vii). Re-Training of Multiple Evaluation Committees

400. This step in the Willis Process involves retraining an evaluation committee or an individual evaluator. If the consultant noticed a problem, the objective of the retraining session was to bring the committee or individual back to the MEC discipline. Retraining could be as informal as that which took place during the life of the MEC, when Willis assisted the committee in interpretation of the plan, or it could have involved more formal sessions which did occur during the work of the five and nine committees. After the initial training for the five evaluation committees during the week of September 19 - 23, 1988 (Exhibit HR-11B, Tab 27), the next formal retraining occurred in March-April, 1989, following the expansion of the multiple evaluation committees. Between these sessions, less informal training was provided by the consultants as required.

(viii). Sore-Thumbing

401. Another procedural safeguard in the Willis Process is a review process referred to as sore-thumbing which is synonymous with the term interim review. According to Willis the first interim review usually occurs after 25 to 30 jobs have been evaluated. These jobs are then listed in descending order of points and comparisons are made between the jobs, factor by factor. The idea is to look for sore-thumbs, that is to say, those evaluations which may not have the same consistency as the other evaluations. A final evaluation sore-thumbing session occurs after all the jobs have been evaluated. This technique is designed to ensure consistency within a committee and reveals whether a committee has varied from its discipline. No evaluator was permitted to be involved in a sore-thumb exercise if they had not been present during the original evaluation.

402. The MEC had five sore-thumb sessions which resulted in nominal changes. Overall, Willis was satisfied with the results of the MEC sore- thumbing. Each of the other evaluation committees also had four or five sore-thumb sessions. The evaluation committees sore-thumb exercises had a different emphasis than the MEC simply because the concern was more with whether the committees were adhering to the MEC discipline. This sore- thumbing took the form of reviewing their own evaluations and comparing them with the MEC discipline so as to ensure consistency.

403. If the evaluation committees were not consistent with the MEC discipline on a factor by factor basis, the result would be a lack of consistency in overall evaluations across the board. The degree of liberalness or conservativeness is not always the same from one factor to another. The important rule is all jobs must be treated the same way; that is, if the committees are going to be liberal in interpersonal skills, then they must be liberal with all jobs and if they are conservative in knowledge and skills, then they should be consistently conservative for this factor. Willis did not express a direct opinion on the effectiveness of the multiple evaluation committee sore-thumbing exercises.

D. RELIABILITY TESTING

84

404. As part of the Willis Process, Willis generally recommends reliability testing of the evaluations.

(i). Inter-Rater Reliability Testing

405. The first type of reliability testing is inter-rater reliability (IRR) testing which specifically identifies evaluators who may be developing patterns in their ratings inconsistent with the other members of their committee. Willis introduced the concept of IRR testing during the planning phase of the JUMI Study.

406. Willis explained IRR testing is advisable for two reasons. For personal reasons Willis finds, when counselling evaluators who demonstrate bias in their evaluation scores, it is helpful to have documentation of a statistical nature to support his observations and opinions. If it were otherwise, it would be the consultant's word against the evaluator's. Willis finds it helpful to use the IRR testing with the individual and to ask the evaluator to look at the pattern in their evaluations. This makes it easier to discuss the problem with the evaluator and convince the evaluator to change. He testified in certain instances, evaluators would refuse to heed the suggestion their evaluations were biased unless confronted with statistical documentation.

407. The second reason Willis introduced the IRR testing is, in a very large and important study like the JUMI, he felt the results would be subjected to public scrutiny and in that sense, might be criticized for failing to use this procedure.

408. Willis made it clear IRR testing is not necessary in order for himself or his consultants to observe and identify outliers. Willis testified an experienced consultant will always recognize an outlier but this testing provides some written statistical evidence.

409. Willis' recommendation for IRR testing was not accepted by the JUMI Committee in the initial planning stages. He later reintroduced this concept when the MEC started its work. There was some debate within the JUMI Committee about whether or not the testing should actually be undertaken. At the January 13, 1988 meeting of the JUMI Committee (Exhibit R-9), the management side agreed in principle there was a need to conduct IRR testing in addition to inter-committee reliability testing but questioned the current Willis proposal.

410. The JUMI Committee formed a sub-committee called the Inter-Rater Reliability and Methodology Sub-Committee (the IRR Sub-Committee) which was delegated to explore this issue. Its mandate was:

  1. to determine and make recommendations about the methodology and research necessary to test evaluation committee rater reliability
  2. to assess and make recommendations about research methodology as it applies to the JUMI Study as a whole (Exhibit HR-11A, Tab 26)

85

411. Willis testified he was not certain exactly why there was resistance from the JUMI Committee to IRR testing but, ultimately, it was decided by that committee to engage the consulting firm of Tristat Resources to perform the testing. For his part, Willis accepted and agreed to this arrangement. The testing was conducted by Dr. Richard Shillington, a statistician who testified as an expert at this hearing.

412. Willis was disappointed the actual IRR testing did not commence before the MEC had completed its work. As a result, he was unable to use the results in his counselling of the MEC evaluators whom he identified as outliers. Willis stated, but other than that, I was satisfied with the testing itself.

413. Originally, Willis had proposed to undertake the IRR testing at least three or four times during the course of the MEC's work, thus providing statistical information which he could use as a basis for discussion with evaluators who exhibited gender bias. During the MEC's work Willis identified two evaluators as outliers and the IRR testing confirmed his observation. Willis met with them but since no testing had taken place, he had no documentation to support his counselling.

414. Willis did not have authority to remove those he identified as outliers. At the time, Willis felt their biases were subtle, ineffective and not harmful to the MEC's work. In both cases, Willis' counselling had little or no effect. Willis testified the two outliers tended to cancel each other out. One was systematically favouring male jobs and the other female jobs. The IRR testing confirmed the identity of the two outliers.

415. Although, the IRR testing did not assist Willis in his effort to counsel the outliers, still, in his opinion, the testing could be used as after the fact evidence of the consistency of the evaluation process. Shillington's report on the IRR testing of the MEC evaluations was released in July 31, 1988. The report is referred to as the Tristat Report.

416. Shillington first became involved in the JUMI Study in the spring of 1988. He was approached by a Treasury Board member of the IRR Sub- Committee and was asked if he would be interested in the work of the sub- committee. Although Shillington was retained by the Employer, he viewed the IRR Sub-Committee as his client. In the context of the IRR testing, Shillington conducted statistical tests to analyze and interpret inter- rater reliability. Its purpose was to determine whether evaluators functioned consistently and whether evaluators treated questionnaires for male- and female-dominated occupational groups in a consistent fashion.

417. Shillington understood his role as assisting the IRR Sub- Committee to develop a methodology which could be used with the data to address their questions and to assist them in making some decisions. The IRR Sub-Committee was primarily interested in identifying evaluators who seemed to have a gender preference or a gender bias in their questionnaires but there were other aspects as well. One of these was a determination of whether these evaluators were influential evaluators within their committee. (Influential evaluators, in this context means evaluators who

86

seemed to be able to shift the consensus score of the committee towards their own initial rating.)

418. Shillington used a combination of statistical tests called t- tests, chi square and z-scores, (which are similar to t-tests), to make comparisons of the differences between individual evaluator scores and committee averages to determine whether there was a pattern between male and female questionnaires.

419. Shillington identified two MEC evaluators demonstrating a systematic gender preference in their ratings. One was a male management representative who allocated male-dominated positions a higher rating than the committee and the other was a female union representative who allocated female-dominated positions a higher rating than the committee. These evaluators were the same individuals identified by Willis and who ultimately became members of the Mini-MEC. The IRR test results did not indicate there was a dramatic difference between their scores and the committee scores on every single questionnaire but a rather subtle, smaller pattern appeared fairly frequently.

420. The IRR Sub-Committee had also requested Shillington identify influential evaluators. The sub-committee put the question: Were there particular raters who seemed to be able to do this more often than other raters? To answer this, Shillington looked at questionnaires where the consensus score was not near the middle of the ratings in order to determine how often particular evaluators were in the situation where they had apparently moved the consensus score towards their score. Using this methodology, some evaluators were identified as influential.

421. Shillington was then asked to identify the extent to which evaluators who had shown a gender bias were influential. These test results indicated that the evaluators who demonstrated a significant level of influence over the committee were not the same two evaluators who had been identified as having a gender bias and that the most influential evaluators displayed no gender preference.

422. The third and last aspect of the IRR Testing was the identification of questionnaires for re-review. This exercise arose from the identification of influential evaluators. The IRR Sub-Committee used the test results to identify questionnaires where the consensus score seemed to be either large or small compared to the initial ratings. Approximately 103 questionnaires were referred by the IRR Sub-Committee for re-review and characterized as unusual. Of these questionnaires referred, one factor only, i.e., working conditions, was responsible for 43 of them being identified as unusual.

423. Shillington testified regarding the limitations of the IRR testing methodology contained in his report of July 31, 1988. Using the methodology of comparing evaluator initial scores to committee average scores, an assumption had to be made, according to Shillington, that the committees are less biased than individual evaluators. In this context, the overall average of a committee is then considered more reliable.

87

424. Another limitation expressed by Shillington is found in the Tristat Report:

Further, the fact that a rater systematically favoured occupations dominated by one gender over another does not imply a gender preference. Since the sexes were not equally distributed in the population, it may simply have been a result of a bias for or against some other factor which was common in occupations dominated by one gender. For example, a bias in favour of advanced education would have caused a rater to be identified as having a preference in favour of males having been more common in senior positions. Similarly, individual rater preferences associated with technical skills, or physical labour would have lead to the appearance of a gender bias. (Exhibit HR-39, p. 5)

425. As to the limitations expressed in the above excerpt found on page 5 of the Tristat Report, Shillington testified in Volume 86, at p. 10653, line 10 to p. 10656, line 11:

THE WITNESS: The mathematical statistics are not a lot of help in that. That is basically an interpretation question. Thank you for drawing my attention to that limitation. When I was trying to summarize this report, I didn't mention it and it was an important limitation.

The mathematical statistics can be helpful in identifying that an individual was treating questionnaires from male-dominated groups differently than questionnaires from female-dominated groups. But it can't do a lot to help you understand why.

The limitation that is expressed in the section you pointed out, that it might be an indirect relationship to education, or blue- collar/white-collar preference, or things like that, is certainly a valid consideration. Someone who had a strong preference who thought that advanced education was undervalued or though that work outside was undervalued or overvalued could possibly appear to have a gender preference or a gender bias -- I will use those words interchangeably for a moment -- and you would have no way of knowing whether or not it was directly related to gender or an indirect relationship to something that is correlated with gender.

If that idea that I discussed of having hypothetical questionnaires inserted into the process, questionnaires that were basically rigged to appear to have a gender difference even though they were identical in all other respects, if you had done that, then you could have actually addressed some of this concern.

As part of that you could have said: If we had someone who had a high education preference and that high education preference might get reflected in gender bias, can we design three or four questionnaires which are all similar in terms of requiring advance

88

qualifications, but are different in terms of a male/female composition? Then you could look at those questionnaires for these individuals and try to identify whether or not when you compared two jobs that had an advance qualification requirement but were slightly different in gender, whether or not those persons treated those jobs differently or not. Then you could try to distinguish between whether or not it was truly a gender preference that was operating or whether or not it was a high education preference.

The gender preference that is identified by the mathematics could be an indirect relationship to some other preference. Basically it is an interpretation question. The mathematics can't really help you, except, I guess, in judgment. The stronger the relationship is, the more striking the difference between the treatment of the male and female questionnaires, the more most people would, in judgment, conclude that it really was a gender preference operating and not something correlated with that.

You talk about a gender preference as opposed to a gender bias. We use the terms gender preference and gender bias fairly interchangeably in the work because of this concern that someone might get labelled as having a bias when in fact it is potentially related to a preference for education or blue collar, a secondary relationship. We caution someone that we should call it a preference.

For the two raters that were identified there, the relationship in the data was so strong that I would have a hard time believing that it wasn't gender that was driving the distinction between the way they treated the questionnaires.

426. The IRR Sub-Committee produced its own report concerning the IRR testing performed by Shillington. Their report was released on July 15, 1988, about two weeks before the Tristat Report was officially released. The Sub-Committee's report of July 15 differs from the Tristat Report of July 31, 1988 by referring to 103 problematic questionnaires requiring re-evaluation. On the other hand, the Tristat Report identified these questionnaires as unusual and suggested that they should be reviewed not re-evaluated. Shillington testified the IRR Sub-Committee's report used stronger language than he used and, in his opinion, the identified questionnaires should be looked at, nothing more.

427. Shillington attended the JUMI Committee meeting of July 15, 1988, when the IRR Sub-Committee report was tabled. The tabled report written by a management representative indicated 103 of approximately 500 benchmarks had been influenced requiring further examination and possibly re- evaluation and sore-thumbing.

428. Shillington testified the use of the word influenced in that context did not reflect what was agreed upon in the IRR Sub-Committee. He indicated his July 31, 1988 report was his best recollection of the opinions formed by the IRR Sub-Committee. Shillington testified he was of

89

the view the identification of the 103 questionnaires as having been influenced was not supported by the research he had done.

429. Willis also testified about this aspect of the Tristat Report (Exhibit HR-39) and the IRR Sub-Committee Report (Exhibit HR-11B, Tab 26B). Willis did not agree with the section of the IRR Sub-Committee report which dealt with influential raters and he stated the sub-committee appeared to overlook the fact that Willis considers it necessary and desirable that evaluation members be permitted to make adjustments in their evaluations at consensus time based on factual information. The fact there is a shift from the majority evaluators towards a minority evaluator in a number of cases is not by itself evidence of a problem. Willis testified in Volume 38, at p. 4803, lines 7 to 20:

There were, if I recall, a couple of raters on the evaluation committee who did have an influence not because they were biased but because they were bright analytical people that others respected. Usually when they had a statement about an evaluation or were asked to provide information relative to the facts of that rating, they generally had very sound reasons, and these reasons were respected. So there were occasions where other members of the Master Evaluation Committee did respond to them.

I don't consider that a limitation. I think that was one of the steps that was built into the process.

430. Eventually, Willis did follow through with a re-evaluation of the 103 questionnaires identified by the sub-committee. Willis and his consultants had been asked by the IRR Sub-Committee to question the assumption the 103 questionnaires presented a problem. One of Willis' consultants, Jay Wisner, did the re-evaluations and prepared a report for the JUMI Committee. His analysis is contained in a report to the JUMI Committee entitled Analysis and Conclusions Concerning the Master Evaluation Committee's Work and dated July 1988 (Exhibit R-22). Willis testified he reviewed each of Wisner's evaluations and made some minor changes in the report. Willis testified it was our conclusion that a systematic review of further evaluations was not warranted, nor was a reconvening of the MEC necessary. Willis felt the evaluations were appropriate, and he was very comfortable with the overall results given the reasonable random disparity among the evaluations. Moreover, he felt at this point the JUMI Study should proceed as planned.

431. Willis' conclusions are contained in a report dated July, 1988 to the JUMI Committee. The conclusions concerning the re-examination of the IRR analysis is reproduced as follows:

After careful and intensive consideration of the questions raised by the IRR committee's analysis of the MEC's evaluations, we find that the principal recommendation of that report, that the MEC should be convened to re-examine a large number of evaluations, is not supported.

90

We have re-examined the evaluations which the IRR analysis indicated were unusual. We did not find evidence that any raters exercised undue influence over the group consensus evaluation. In our opinion, the great majority of the evaluations listed by the IRR committee are the product of accurate and consistent application of the evaluation plan by the MEC, and should not be changed.

For those few positions where we recommend re-evaluation, we found no pattern of influence by a minority resulting in evaluations with which we disagree; in some cases, we recommend movement further from the middle of the initial individual evaluations. We believe that the eventual re-examination by the MEC of the ten evaluations where we suggest some revision need not delay the convening of the five sub-committees. We recommend that these reviews be combined with reviews of benchmarks sought by one of the five sub-committees.

We have no significant concerns regarding the MEC's understanding and application of the evaluation plan. The MEC's pattern of application of the evaluation plan to positions (their discipline) differs in some respects from the pattern which the consultants would use. However, given the manner in which the MEC membership was determined, their discipline constitutes a more accurate reflection of the values of positions as commonly understood within the Government of Canada than the consultant could determine from an outside point of view. This kind of adaptation of the plan to the climate and conditions of an organization by an evaluation committee is expected and proper. We would be concerned if there were evidence of inconsistent application of the evaluation factors within or across position families. We did not encounter any evidence of such inconsistency. We believe that the framework of benchmark evaluations and the selection of principal benchmarks by the MEC provides a sound basis for the evaluation of the remaining positions by the five sub-committees. We have found no significant cause for concern and support the progression of the study as scheduled.

(Exhibit R-22, p. 8)

432. Of the 103 MEC questionnaires re-evaluated by Wisner and reviewed by Willis, ten were evaluated differently by the consultants and of these ten, only three were significantly different. Of the three, one was more significant than the others. It was Willis' judgment if the MEC was going to be reconvened, it would only be to review that one questionnaire which was more significant than the others.

433. By September of 1988, the management side of the JUMI Committee were still not satisfied with the manner in which the Willis Plan had been applied by the MEC and continued to express concerns about the MEC's work. At the September 15, 1988, JUMI meeting, the management side indicated a further analysis should be carried out on problematic benchmarks referred

91

to in Willis' report of July, 1988 on MEC evaluations. The management side identified 100 benchmarks with problems and forwarded 46 of these 100 to Willis with a list of questions, observations and anomalies. In response to management's request, Willis & Associates conducted an independent review of these questionnaires and attempted to do a fresh evaluation, without regard to the MEC's prior evaluations but consistent with the general evaluation discipline established by the MEC.

434. A report of this work was submitted to the JUMI Committee in September, 1988 (Exhibit R-28). This analysis was done by the Willis consultant, Wisner. Willis & Associates agreed with management on a number of their challenges but, in the end, did not identify the existence of a gender pattern. As to the discipline adopted by Wisner in his independent evaluation of the 46 questionnaires, Willis said in Volume 56, at p. 6936, lines 3 to 11:

I think he was familiar enough with the Master Evaluation Committee's evaluations at this point. We had had discussions as to where they were conservative, where they were a little bit liberal, so that he was able to track, but fairly independently. I would say, though, that while it is not critical, it would appear that he was just a hair more liberal on the average than the Master Evaluation Committee.

435. In the final analysis, Willis and Wisner identified one evaluation out of the 46 which they considered was misunderstood by the MEC. Willis' report provided explicit answers to each of the questions raised by the management side. In their judgment, the additional analysis supported their conclusion the MEC had done a fully satisfactory job in applying the evaluation system to a broad range of positions. The report states:

We believe that a sound basis has been provided for evaluation of the remaining 3900 positions and that, at this stage, there is no logical reason to expect less than a high quality, defensible result from the study. (Exhibit R-28, p. 4 of the addendum)

436. In this report, Willis also provides general observations as to how differences can occur between the MEC and the consultants. The report states they can occur in three different ways and anyone of these three ways could be caused by systematic bias on the part of evaluators. He identifies the three different ways as follows:

3. Differences in evaluations of the same positions between the MEC and the consultants could occur in three different ways:

Misreading of the questionnaire. This could result if parts of the questionnaire were overlooked or not given appropriate consideration.

Different interpretations of the facts given. The consultants may draw interpretations from a more

92

extensive experience in evaluating other jobs having similar functional responsibilities. On the other hand, evaluation committee members may have a better understanding than the consultants of the culture within the governmental organization resulting in slightly different job perspectives.

Misuse or misunderstanding of the evaluation system. This is expected only during the learning stages of the evaluation effort.

Any one of these three ways could be caused by systematic bias on the part of evaluators. (Exhibit R-28, pp. 1-2 of the addendum)

437. By late November, 1988, the management side of the JUMI Committee were still dissatisfied with the Wisner/Willis analysis of the MEC evaluations. Ouimet forwarded a four page letter to Willis detailing her concerns (Exhibit HR-19). Willis responded to her letter on December 5, 1988, in a six page letter in which he attempted to deal with those concerns. Part of that letter is reproduced as he attempts to persuade Ouimet that some variance between evaluators will occur and the reasons for this variance. He says:

Evaluation Tolerance

As I indicated in the Addendum to the Responses to the Management Side of the Joint Union/Management Initiative on Equal Pay for Work of Equal Value, it is expected that some variance in interpretation of position information provided to evaluators will occur. A tolerance of plus or minus 10 percent in random evaluation variance is acceptable between two teams evaluating the same positions, given complete and accurate factual information.

As a practical matter, analysis and assessment of evaluation reliability requires making judgments considering a number of variable factors, such as:

  1. Completeness, factual content and definitiveness of the information used. Lower quality of information normally results in wider random bias.
  2. The nature of the job. Is it unusual or complex, or one the evaluators should be reasonably capable of understanding (e.g. research scientist or cleaner)? To evaluate a position properly, the evaluators must be able to understand its content.
  3. How far removed is it in organizational level from the experience or knowledge of the evaluators? This is similar to the previous factor in that evaluators may have trouble conceptualizing a job that is several organizational levels above their own experience.
  4. 93

  5. Do evaluation variances depict a pattern? Does there appear to be a systematic bias, or is it a random variance? Systematic bias is much more significant than variance that is simply difference in interpretation or understanding of the job's requirements.
  6. If the comparison evaluations are by a consultant, could the deviation result from difference in understanding of the culture or value systems within the organization, resulting in a slightly different job perspective?

In essence a value judgment must be made as to the extent of allowable variance in scores and whether or not a problem exists. An assessment of this nature does not lend itself to precise and quantitative terms.

Of the fourteen MEC evaluations assessed as differing by more than 10 percent compared to consultant evaluations it was my considered judgment that one, MEC #428 Head Display Preparation Section, was not properly understood by MEC and should be submitted to that committee for questions to be asked, and re-evaluated. (Exhibit R-35, pp. 3-4)

438. In the final analysis, the management side of the MEC did not whole-heartedly support the MEC benchmark evaluations. Although they were prepared to continue the study, their intention was to conduct further reviews of the benchmark evaluations, this further review was not addressed by the Employer in the presentation of their case.

439. The Tribunal heard limited testimony from evaluators on the subject of MEC challenges. Pauline Latour, one the evaluators who testified before us on the issue of committee challenges to benchmarks, viewed the MEC challenges as a small issue. Only one particular benchmark caused her committee, (Committee #5), difficulty. It was this committee's view that the position was rated higher by the MEC than what it ought to have been. (Volume 171, pp. 21641-43).

(ii).IRR Testing in the Multiple Evaluation Committees

440. Shillington also conducted IRR testing on the remaining five and nine evaluation committees using the same methodology he used to identify outliers on the MEC. Willis was provided with two written reports on the IRR testing of the five and nine evaluation committees. The first disclosure made to Willis occurred in May of 1988 and was primarily based on the original five evaluation committees. The second disclosure occurred in July of 1989 and was based on the expanded nine evaluation committees.

441. The IRR Sub-Committee reported to the JUMI Committee, at its meeting of August 25, 1989, that an analysis of individual ratings to the end of July, 1989, revealed 11 outliers, six female evaluators from the staff side, three female evaluators from the management side and two male evaluators from the management side. Seven of these outliers expressed an

94

apparent preference for male positions and four expressed an apparent preference for female positions.

442. The sub-committee further reported seven of the outliers had been previously identified in the earlier disclosure. However, Willis, in his testimony, was able to recall eight outliers who had been previously identified. The identification of the previously identified outliers was reported in the second disclosure by the IRR Sub-Committee in order to confirm Willis' opinion as to the ineffectiveness of his intervention/counselling following the first disclosure.

443. The JUMI Committee decided the names of the outliers would only be revealed to Willis and the Chief of the EPSS. It was Willis' understanding that the JUMI Committee's decision to deal with the question of outliers in this confidential manner, was done to protect the individuals concerned. The JUMI Committee had made an earlier decision they were not going to remove any evaluators from the committees and it would not be productive to release their names at this point.

444. Shillington prepared exhibits identifying what he referred to as an underlying attitudinal dimension of these outliers. He was unable to explain why these differences occurred or what they were. Exhibits HR-117 and HR-133 indicate the male and female preferences crossed union/management lines and female/male lines. With respect to the cross over of male/female lines, some female evaluators displayed a male preference; however, no male evaluators displayed a female preference.

(iii). Inter-Committee Reliability Testing

445. Willis testified inter-committee reliability (ICR) testing is designed to determine whether evaluations from a series of committees are related. As explained by Willis, this testing looks at consistency between committees and identifies where committees need to be retrained. Willis testified ICR testing is not designed to identify any form of bias. In the JUMI Study, it was intended instead, as a means for assessing whether or not the evaluation committees were adapting successfully to the discipline of the MEC.

(iv). ICR Testing in the Multiple Evaluation Committees

446. The process generally involved taking a series of questionnaires and submitting the questionnaire to each committee. Each of the evaluation committees performed an evaluation on the same questionnaire and the consultant then attempted to identify the extent to which different committees rated the same job similarly or rated the job differently. According to Willis the first ICR testing began early in 1989 and included 26 tests altogether. The ICR testing continued until July of 1989.

447. The JUMI Committee established an ICR testing sub-committee (the ICR Sub-Committee) to establish policy and oversee procedures for the testing. The ICR Sub-Committee consisted of three management representatives, two staff representatives, Willis, one of his consultants and two Commission representatives. The purpose of the IRR Sub-Committee

95

as stated in the IRR Sub-Committee report of March 3, 1989 is listed as follows:

  1. examine the results of the tests administered to the evaluation committees in relation to the baseline provided by the consultants,
  2. examine the baseline score provided by the consultants,
  3. determine the significant differences in the consensus ratings of the committees in relation to the benchmarks and the baseline,
  4. formulate if needed, recommendations for training, re- training by the consultant and/or other courses of action for JUMI considerations, and
  5. identify procedural/process problems and potential for improvement including the revisions to the formulation of rationales.

448. The IRR Sub-Committee requested the Commission conduct the actual testing. The Commission determined the timing of the tests, distributed the questionnaires and explained the process to the committees. The JUMI Committee asked Willis to evaluate the test questionnaires and provide a baseline score for each of the test jobs.

449. The baseline score was the independent evaluation of the test questionnaires by the consultants. In each case, Willis had two consultants review the questionnaire and arrive at their own independent evaluation, which was then compared with the test evaluations done by the five or nine committees. The purpose of the comparison between the baseline score and the committee score was to identify any deviation between, first, the individual committees and, second, the consensus of the multiple evaluation committees compared to the consultants' evaluations, thus identifying areas where the multiple evaluation committees needed to be retrained because of difficulty in interpreting the evaluation factors.

450. Willis used rationales in the ICR testing to analyze differences between the baseline scores and the committee scores. His use of the committee rationales in the ICR testing was for a different purpose than the use of rationales generally in the evaluation committees. In the ICR testing Willis explained why rationales were to be used in this exercise, as distinct from his reasons for not wanting them to be used in evaluations by the committees, in which case he wanted the members to focus on the questionnaire itself.

451. The consultant baseline score was compared with each committee's consensus score and also to the overall consensus of the five, and later nine, evaluation committees.

452. Willis had minor input into the procedure that was adopted by the ICR Sub-Committee and he was opposed to their approach. In other studies, Willis always provided a list of questionnaires to his clients, and then

96

introduced the questionnaire into the committee's portfolio of questionnaires in such a way the committees did not know which questionnaires were part of the test. In the case of the JUMI Study, time was set aside and the questionnaires were distributed to the evaluators who then became aware of the testing. The Commission randomly selected the questionnaires and approached the Willis consultants about an hour before the test to give them an option as to which questionnaires should be used for testing. The consultants were not given the opportunity of selecting questionnaires that were more complete. As a result, Willis testified there was frustration on the part of the evaluators, as well as varying levels of conscientiousness in completing the tests.

453. The procedure for ICR testing as conducted by the Commission was very strict at first. It was announced there was going to be a test. The Commission was on hand to oversee the test and had an observer in each room. The questionnaires were distributed and the evaluators were informed they could not leave the room during the actual period of the test.

454. In other studies, if the committees needed more information on the test questionnaires, Willis arranged to have individuals waiting at telephones to answer any questions. No time was permitted for this in the actual ICR testing. Consequently, each committee was allowed to make its own assumptions and fill in any gaps in the information. Committees were required to write down their assumptions but the problem was that each committee made different assumptions. Willis testified because committees were making different assumptions, variance occurred in the evaluations. For these reasons Willis was not comfortable with the results of the ICR testing.

455. Willis found the committees did not take the ICR tests as seriously as they did their actual evaluations. He observed a considerable amount of resentment on the part of the evaluators and this increased over time. Moreover, the committees had to stop their regular evaluations to go through the testing exercise. The committees were being pressured by Willis to keep moving but, at times, were subjected to two tests a week. As Willis stated in Volume 58, at p. 7166, lines 13 - 16:

They resented it every step of the way and some of them quite frequently took the testing with somewhat less than a serious approach.

456. On February 6, 1989, Willis produced a report on the first nine tests conducted between November 7, 1988 and January 5, 1989. This report examined the variation among the original five evaluation committees and essentially concluded they had learned the Willis system and were evaluating positions in line with the MEC discipline when they feel comfortable with that discipline.

457. The ICR Sub-Committee report of March 3, 1989 was based on an analysis of the first 11 tests conducted. The report noted:

  1. that the consultants needed to go through a revision of the initial training program with the committees and to address problems that were identified;
  2. there was some concern with respect to evidence of cross-family job comparisons and the job evaluation process ought to be amended to provide
  3. 97

    for these comparisons; and

  4. the rationales needed more attention and a revised job evaluation process ought to be developed.

458. Willis was asked to describe the amount of variance between the consultant scores and the consensus of the five committees on the first 11 tests. He responded in Volume 58, at p. 7227, lines 1 to 9, as follows:

A. Considering the various handicaps and expressions of frustration and concern that we heard, I think that they did very well. I felt very positive, particularly after discussing with each committee what their differences were, why they had selected the assumptions they had. While I did agree that the additional re-training was desirable, I felt very positive about how well they were doing.

459. The ICR Sub-Committee attached to their report a description of an Improved Evaluation Process which it recommended for adoption by the committees. The revised process provided for comparisons of benchmarks outside of the job family. It asked evaluators to first reference benchmarks in relation to their factor ratings before independent ratings were passed to the committee chair for posting.

460. Willis testified the committees tried the improved job evaluation process and found it was not really practical and actually required more time than the original process. The evaluation committees resisted the change so it was finally dropped.

461. Willis also produced a report on the ICR testing. He described the sub-committee's report as similar to his own report for the most part, but did not completely concur with all of the sub-committee's findings.

462. The essential purpose of the ICR testing was to identify whether or not committees understood and were applying the Willis Plan in accordance with the MEC discipline. This testing also gave the consultants an opportunity to examine whether or not the committee's reasons for their ratings were suggestive of gender bias.

463. Willis did a careful review, factor by factor, of what each committee did, why they did it, and how the consensus was reached for each factor and for the total. When one committee's score differed from the other committees' scores, Willis explored the reasons for the difference. If those reasons suggested, in any way, they were influenced by a particular gender or by a particular kind of job, it was information that would be available to the consultant for follow up action.

464. Within this testing framework, Willis was asked whether he found any evidence of gender preference in the work of the committees during the first series of ICR tests. His response is given in Volume 58, at p. 7227, line 10 to p. 7228, line 19:

Q. At this point -- and it looks like we are in the late winter or the early spring of 1989, when only half of the tests had been performed -- did you have any evidence from these tests, or

98

otherwise, that there might be a problem of gender preference exhibited by the evaluators?

A. Unrelated to the ICR testing, in the early part of the year I began to express some concerns, not related directly to the actual evaluations themselves, but I had some concerns regarding some circumstantial things that had been happening. I became increasingly uncomfortable with how the committees were working with the confrontations between the staff and the management side, and some of the circumstantial things that I had observed happening.

In stressing with my consultants working with the groups, and doing our own analysis of how committees were actually evaluating, and how occupational groups were coming out among and between the committees, I could identify nothing specific that would suggest there was a gender bias that was developing.

Nevertheless, I had strong mixed emotions because I knew some things that were happening, some attitudes that were apparent that were giving me a great deal of concern.

So at this point in the study I had some problems with my own level of comfort. I discussed these problems individually and with the members of the mini JUMI and collectively with them as a group. I felt that I was going to have to take some sort of an analysis, a more in depth analysis of results if I was going to be able to support the outcome of the study.

465. Willis did not think it possible to identify gender bias simply by looking at the results of the ICR testing. If there is gender bias, Willis finds evaluators usually tend to talk about their conclusions or opinions rather than about the facts of the questionnaire. He indicated to his consultants to watch very carefully for this sort of behaviour but he does not think a consultant can decide whether or not there is bias just by looking at a score or on a job by job basis. He testified on an individual job evaluation, a consultant has to look at the reasons why the committees selected what they did, what was stated in the rationale and then quiz the evaluators personally as to what were the reasons for the differences. In Willis' opinion these tests did not provide any conclusive evidence of gender bias and the information obtained from these tests should be discounted because the committees did not take the testing seriously.

466. In late May or early June of 1989, Willis recommended to the ICR Sub-Committee the testing be discontinued because it was becoming very clear to him the evaluation committees were becoming more and more frustrated with this procedure. He also concluded the tests were beyond any point of usefulness. Willis understood it was at the insistence of the Treasury Board representative on the sub-committee, that the testing continue. The sub-committee did not accept his recommendation and continued with the testing into July of 1989. It was Willis' opinion the reaction he observed by the evaluation committees to these tests might have an affect on the reliability of the results. (Volume 59, p. 7291).

99

467. Although the testing continued, Willis did not perform any additional formal analysis on the results. He reviewed the remaining tests submitted to him by the sub-committee and continued to meet individually with committees.

468. A draft final report of the 26 ICR tests was prepared by a management representative, Michel Papineau, on the ICR Sub-Committee. This report is dated October 26, 1989. Willis had no input into this draft. The conclusion reads as follows:

The ICR test results tend to support the gender preferences found in the IRR report and in the Consultant's study of a sample of 220 questionnaires already evaluated by the committees. The differences are such that there is little doubt as to whether or not these are due to systematic or random biases. The proportion of these discrepancies are significant enough to exceed the degree of tolerance expressed by the consultant. Thus, it is strongly recommended that further investigations be conducted prior to reaching any conclusion based on the evaluation results.

(Exhibit HR-90, p. 4)

469. Papineau concluded there was evidence of gender bias in the evaluations but it was Willis' judgment, the analysis of the ICR testing should be discounted for two reasons. He stated the 26 evaluations were too small a number from which to draw any firm conclusions, and secondly, the committees were not taking these tests as seriously as the actual evaluations and were rushing through them as quickly as they could without much discussion. It is Willis' opinion the tests were not valid for any particular use after about the first 10 or 12 tests. (Volume 59, p. 7297).

470. As to the assertion in the report that further studies should be undertaken, Willis testified he had already decided on the basis of the Wisner 222 a further study needed to be undertaken and this draft ICR report did not add to his conviction.

471. According to Willis, the Wisner 222 was not related to the ICR testing at all. He testified he would have asked for the Wisner 222 whether or not the JUMI Committee had agreed to conduct the ICR testing. Willis saw it as a totally separate issue.

472. Elizabeth Millar, a union member of the ICR Sub-Committee, employed by the Alliance as Head of Classification and Equal Pay Section, testified she was under the impression the ICR testing was being taken very seriously by the committees. She testified one of the problems of the ICR Sub-Committee was in getting timely feedback to the committees. Millar said she did not think the ICR Sub-Committee functioned in an effective manner after May, 1989. She stated the management representatives on the ICR Sub-Committee appeared to adopt a different agenda from the rest of the committee. These representatives wanted an increase rather than a decrease in the schedule of testing to the end of the evaluation process.

100

473. By memorandum dated November 10, 1989, Millar responded to the draft report prepared by Papineau. Essentially, she found the draft unacceptable to the Alliance as it did not reflect the discussions and deliberations which took place within the ICR Sub-Committee. The analysis contained in the report did not reflect the committee's findings and the conclusion contained in the report had never been discussed by the ICR Sub- Committee. In his testimony, Willis agreed with Millar that the concluding statement contained in the draft report was perhaps overstated. It implied the comparison left little doubt as to the existence of gender bias. (Volume 59, p. 7304).

474. In Papineau's memorandum, which is attached to the minutes, he indicates his intention to table the report at the next JUMI Committee meeting which was held on October 31, 1989 (Exhibit R-44). However, the final report was not tabled at this time, since the report had only been distributed one week prior.

475. Millar testified in the spring of 1989, she observed a change in attitude by the management representative on the ICR Sub-Committee toward the consultants. She said the Employer's attitude before May of 1989 was more accepting of and in tune with the consultant's view so that the sub- committee was able to reach agreement in problem areas. It was agreed the evaluation committees had trouble understanding the Willis Plan and needed further help in training. She testified after May, 1989, the Employer representatives became very critical of the consultants and the ICR Sub- Committee meetings became extremely difficult. She recalled one particular meeting in which Scott Gruber, a Willis consultant, reported on one of the tests that had been done. Gruber had met with all committees to discuss the results and found overall the work was going well. The Treasury Board representatives took issue with Gruber's report. According to Millar, one Employer representative commented to the effect the committee ought to have expected something better from the consultants.

476. Millar referred to another incident in Volume 185, at p. 23775, line 17 to p. 23776, line 11 as follows:

At one meeting in which Mr. Owen was the consultant, two Treasury Board representatives turned up with reports that we hadn't known were in the preparation which had calculated the difference between each committee score and the base line score and had used these calculations to indicate whether or not a problem had existed.

Mr. Owen, who I have described as unfailingly polite and a kind individual, as well as very competent, became extremely agitated. He threw his pencil across the desk and accused both the Treasury Board representatives of neither understanding job evaluation or the Willis Plan. Mr. Willis reported to me later on that he had worked with Fred Owen a long time and he had never seen him so angry. Needless to say, these reports, the uncommissioned reports, were never accepted by the subcommittee and were never tendered further.

101

Mr. Owen was not questioned about this incident.

477. Willis testified the ICR testing fell short of his expectations. He said for future ICR testing, he would arrange to do it covertly so the evaluators would not know they were being tested. He did comment concerning the ICR testing results as follows in Volume 59 at p. 7352, lines 2 - 7:

But the bottom line is, apparently, in spite of lack of management support, in spite of some variances in the quality of information and in spite of some attitudinal problems, the result was within satisfactory limits.

478. Willis was also asked whether there was an indication in the first 11 ICR tests that the committees valued higher along gender lines. He testified it was his assessment there did not appear to be a gender preference. Any differences in interpretation between the consultants and the committees on the problem-solving factor in the Willis Plan was due more to a lack of clear understanding of how to use the evaluation system than anything else.

(v). Wisner 222 Re-Evaluations

479. Willis testified it was clear to him there were agendas both on the staff side and on the management side affecting the way evaluators worked together. He observed attitude problems on the part of some of the evaluators. As the study proceeded, Willis became concerned that he could not defend the results without doing further analysis. Willis' discomfort did not result from what he was able to observe, in terms of actual gender bias in the evaluations. It centred primarily around what he viewed as an attitude problem on the part of the evaluators. Because this was a large and important study, he wanted to be sure there was no subtle bias creeping into the process.

480. Willis made a recommendation to the JUMI Committee to conduct a snapshot assessment on the validity of the evaluations, with the intention that if his preliminary analysis revealed the possibility of problems, he would subsequently do a more in depth analysis. When Willis made his proposal to the JUMI Committee in the spring of 1989, he did not advise that he anticipated adopting a two-tiered approach if a problem was encountered in the first small study undertaken. His recommendation to the JUMI Committee was made about the time the first 11 ICR tests had been completed. At this point, the committees had evaluated approximately 2000 questionnaires, and Willis wanted to examine 10 per cent of these completed evaluations using one of his consultants to independently evaluate a sample.

481. Willis testified the only way he knows of determining whether gender bias is present in an evaluator's evaluation is to look for patterns of preference for one gender or the other. In his opinion, the only possible way of identifying gender bias would have been to have an impartial third party, such as one of his consultants, re-evaluate selected questionnaires, then to compare the results between the committees and the

102

consultants. Willis usually solicits the assistance of a statistician to perform this comparison. Willis refers to differences between the committees and the consultants as disparities.

482. During the course of the hearing, there were questions about whether consultants should or should not be considered the baseline for comparison. Willis pointed out the JUMI Committee had agreed to use the consultants as the baseline in the ICR studies, and that agreement was expressed in writing by the JUMI Committee.

483. Willis believes his consultants who were involved in the JUMI Study were unbiased and testified to this in Volume 208, at p. 26934, lines 10 to 16, as follows:

We understand the system. I think it would not be appropriate to say that all consultants are necessarily unbiased. However, our experience, our background, our intent, our own philosophy, has always been not to favour one side or the other, but to walk in the middle road, if you will.

484. Willis testified the disparities form the basis for identifying whether or not there is a gender based pattern. In this context, he said bias simply means if there is a pattern of different treatment for male- dominated jobs versus female-dominated jobs, then the different treatment would result in some degree of gender bias.

485. The positions included in the Wisner 222 were selected randomly from a list of all the evaluations provided by the EPSS. The sample taken included at least 10 per cent of the total number of positions evaluated by the nine evaluation committees at the time of the Wisner 222. The sample included the full range of evaluation levels and the variety of types of work seen by the nine committees.

486. Wisner did not testify at this hearing. His study was explained by Willis who described the method used by Wisner in his analysis. First, Wisner read the position questionnaire and any reviewer notes. He then determined whether a similar position was included among benchmark positions evaluated by the MEC. When there was a similar position, Wisner reviewed the benchmark questionnaire to confirm his impression and adopted the MEC benchmark evaluation as the consultant evaluation. When there was no similar set of duties among the MEC benchmarks, Wisner proceeded to do an independent evaluation of the position, supported by reference to appropriate benchmarks. Many of the positions included in the sample were found to require this step. After determining an evaluation, Wisner reviewed the committee evaluation for that position. He paid particular attention to the committee's use of benchmarks and the facts they used to support their evaluation. Wisner then adjusted his evaluation as appropriate in view of the committee's rationale and benchmark references.

487. When Wisner found differences between his final evaluation and the committee's evaluation, he wrote a brief rationale in support of his position.

103

488. Wisner then proceeded to do a special analysis on the results. This analysis was initiated in order to assess the quality of the position evaluations by the nine evaluation committees. As stated in his report of July, 1989, the considerations he included in determining the quality of the evaluations were:

  1. Proper use of the Willis evaluation system in accordance with the Guide to Position Measurement and the training and technical advisories issued by the consultant.
  2. Consistency of the evaluations by the nine committees with the benchmark evaluations and evaluation discipline established by the Master Evaluation Committee.
  3. Absence of any systematic bias in the evaluations by the nine committees.

(Exhibit PSAC-4, p. 1)

489. Wisner's analysis also included statistical testing. According to Willis, Wisner is a statistician. His findings on the first consideration regarding the proper use of the Willis evaluation system, was that there was no evidence found, with two possible exceptions, of any consistent misinterpretation or misapplication of the evaluation factors and dimensions. As to the two exceptions, he noted because the number of positions sampled were so small that it was impossible to draw any firm conclusion about these.

490. Regarding the overall consistency of evaluations by the nine committees with the MEC evaluation discipline, he found that the committee and the consultant had an exact match in 70 of the 222 positions, and that an additional 34 positions showed differences of +/- 2.5 per cent, so that almost 47 per cent of the positions in the sample had approximately the same overall evaluation. He concluded these differences indicated fair consistency of evaluation between the nine committees and the MEC benchmarks. Since Wisner found more than half of the positions differed by more than 2.5 per cent, he recommended that further analysis of the differences was warranted.

491. As to the third consideration, in analyzing the differences between the consultant evaluations and the committee evaluations, Wisner found for the female-dominated positions, 35 were under-evaluated compared to the consultant, 40 were over-evaluated and 43 had no difference; and for the male-dominated positions, there were 55 under- evaluated, 22 over-evaluated and 27 with no difference. His report states at p. 5:

This indicates that female dominated positions were over evaluated somewhat more often than the total sample, and male dominated positions were under evaluated somewhat more often than the total sample.

104

492. And his conclusions at p. 8 reads:

The findings of the analysis described above suggest that the consistency of the evaluations by the nine committees with the MEC benchmarks is less than would be desirable, and that there may be some gender-related bias in the evaluation results. It is the consultant's opinion that these findings indicate that a wider review of the evaluations by the nine committees would be proper. Such a review would serve to confirm or refute the apparent problems found in the sample of positions examined in this study. [emphasis added]

493. Wisner, however, advises caution in dealing with his report. The statistical analysis between gender dominance and evaluation differences between the committee and himself are based on a comparatively small number of positions and his findings between the two variables does not mean that there have been deliberate or unconscious sex bias in the evaluations. He goes on to say there are a number of other possible explanations for the differences. He refers, for example, to the tendency in the positions in the male-dominated classifications to have more complex duties and responsibilities than the majority of positions in the female-dominated classifications. He suggests the observed pattern of evaluation differences could occur if the committees tended to under evaluate more complex positions in relation to the MEC discipline as viewed by the consultant.

494. Willis' covering letter of July 17, 1989, addressed to the co- chairs of the JUMI Committee, which accompanied Wisner's report, states in the third paragraph:

Our findings indicate the existence of some systematic divergence from MEC evaluations. Statistically, however, the size of the sample reviewed, 222 evaluations, was insufficient to permit specific conclusions as to the degree of the problem.

(Exhibit PSAC-4)

495. Willis was asked in Volume 58 to clarify exactly what it was he was trying to state in this letter. He responded in Volume 58, at p. 7249, lines 1 - 5, as follows:

A. The results of our analysis appeared to suggest that there is some pattern of deviation from the Master Evaluation Committee's evaluations. It could be interpreted as a gender bias.

496. At the completion of the Wisner 222 there were about 1,000 evaluations remaining. Since the nine evaluation committees had just started their work, Willis felt it was critical that a more extensive analysis be done as soon as possible to correct a potential problem. He recommended to the JUMI Committee an additional analysis be undertaken without delay.

105

497. In his testimony, Willis referred to the following table contained on p. 4 of Wisner's Report to explain why he wanted a further study and his concern about possible gender bias:

Table 1 Per Cent Differences

Group <-15% -14.99 -9.99 -4.99 -2.49 0 0.01 2.50 5.00 10.00 >15 to to to to to to to to to -10.0 -5.00 -2.50 -0.01 2.49 4.99 9.99 14.99

Female 6 8 7 5 9 43 9 10 4 9 8

Male 8 15 13 9 10 27 6 4 4 4 4

Total 14 23 20 14 19 70 15 14 8 13 12

498. Willis testified the above table breaks down the total group of the 222 evaluations. In the first line which reads Female, highlighted under 0, the 43 indicates Wisner and the MEC agreed on 43 evaluations. To the right of 43 is the number of MEC evaluations above Wisner and these total 40, and to the left of 43 is the number of MEC evaluations below Wisner and these total 35. On the Male side, highlighted under 0, the number in the chart indicates Wisner and the MEC agreed on 27 evaluations. The right hand columns indicate that the MEC rated 22 evaluations higher than Wisner, and the left hand columns indicate that the MEC rated 55 evaluations lower than Wisner. This suggested to Willis the beginning of a pattern because there are approximately twice as many (55) male-dominated evaluations rated lower than the number of evaluations which agreed with the consultant (27) and the number (22) which were over-evaluated compared to the high number of male-dominated jobs within the total male-dominated occupational groups which were evaluated lower by the committees than the consultant's evaluation.

499. This aspect of the Wisner 222 concerned Willis. Another concern with the report was that it showed one female-dominated occupational group (ST) in which the numbers indicated a comparatively large degree of over- evaluation. This to Willis was some evidence, however slight, of gender bias.

500. Willis stated the Wisner 222 was very limited. It was not intended for a basis on which to make a determinative judgment as to whether or not true gender bias existed and to what extent. He testified it contained enough evidence to justify a further look before he could feel comfortable in defending the results.

501. Following the release of the Wisner 222, the unions sent a letter to Durber, expressing their concerns. This letter is dated September 27, 1989. The letter, which is written by Christine Manseau, the union co- chair, indicates the unions did not agree the Wisner 222 Report supported

106

the contention there was gender bias in the evaluations. Paragraph 2 of the letter reads as follows:

Our analysis shows that, on average, there is remarkably little difference between the evaluation scores of the consultant and the committees. Of the 118 female positions in the sample, the average consultant score is 182 and the average committee score is 181. Of the 104 male positions, the average consultant score is 273 and the average committee score is 263, a difference of 3.7%. We do not believe these differences are significant and we note that they are well within the + or - 5% accuracy level for average scores that the parties agreed to in dealing with the issue of sample reduction and overall sample size for the JUMI study.

(Exhibit PSAC-5)

502. According to Kathryn Brookfield of the Institute, after having received the Wisner 222, the unions expressed concerns as to how the data in the report matched with the conclusions. Brookfield testified the union looked at the distribution of the evaluations from female-dominated occupations and did not see evidence of an imbalance in the evaluations and yet the report came to that conclusion. Brookfield testified the unions wanted to sort out in the Wisner 222 why the data and the conclusions did not agree. Until that question was resolved, the unions did not have sufficient confidence to ask Willis to go ahead and repeat the exercise. Brookfield further stated the unions wanted to meet with the Treasury Board representatives, go through the report, discuss the differences and see if they could come to some understanding about them.

503. There was considerable debate within the JUMI Committee as to whether Willis should undertake further re-evaluations. Willis met privately with members of the Mini-JUMI as well as with the full JUMI Committee to request a more in depth analysis. He never wavered from his position that a further analysis was needed, although, the extreme positions taken by some evaluators seemed to settle down during the course of the summer of 1989 as the committees began to work with new, fresh, and in some cases, reorganized committees. He said in Volume 58, at p. 7285, lines 4 to 8:

A. Call it a gut feel, I just felt that the importance, the size of the study was such that I wanted a better feeling of confidence that I could, in fact, defend the results.

504. At that time, Walt Saveland, an employee of the Commission, did a technical examination of the Wisner 222 analysis. Saveland was a staff person with the Commission in Policy and Research. Durber had asked him to assist in interpreting the Wisner 222 and to pinpoint the problem of bias. The Saveland Report, Exhibit PSAC-6 entitled Technical Observations and Suggestions on Willis & Associates Special Analysis of Working committee Results provided a list of male jobs which the Commission ought to give priority attention because the committees differed from the consultant by 10 per cent or more. In the end, the list contained 25 jobs,

107

notwithstanding 27 had been identified. Wisner and the committees agreed with an additional 2 jobs which had somehow been missed.

505. The Saveland Report appears to pinpoint the source of apparent gender bias to the male-dominated questionnaires. The balance of the Saveland report, from page 6 onward, uses a number of statistical measurements which, according to the statistical expert, Sunter, are absolute nonsense. (Volume 105, pp. 12696-97).

506. Paragraph 2 of this report states as follows:

The most important evidence of apparent gender bias is found among male-dominated jobs. A pivotal role seems to be played by 27 jobs in which Committee evaluations where [sic] between 5 and 15% lower than Consultant evaluations. (Evidence of apparent gender bias was also found among the clerical portion of female-dominated jobs.) (Exhibit PSAC-6)

507. Saveland's report states it is this kind of asymmetry in the male-dominated line which indicates apparent gender bias. Saveland explored the effects of asymmetry by expanding the standard for relative agreement from a +/-2.5 per cent to a standard of +/-5 per cent. If the expanded standard is imposed for the category of relative agreement with respect to the female-dominated line, it results in a perfectly symmetrical distribution with a sizable majority of jobs, showing 76 relative agreements. For the male-dominated jobs, 56 are now counted in relative agreement but apparent under-evaluations out number over-evaluations by exactly 3:1 or 36 to 12.

508. The report contains, in the end, technical suggestions. One suggestion was a re-examination of the specific jobs in dispute, to be done by some existing or newly formed review committee, whose members are experienced in job evaluation. The report states this review committee should consider all jobs in a suspect category and this means all existing and additional male-dominated jobs (and possibly all clerical jobs). The report notes an examination of only selected jobs playing a pivotal role in gender bias runs the risk of losing objectivity. The report makes suggestions about what approach ought to be used when a review committee accepts or rejects a specific committee evaluation. The report also suggests, while the review committee is doing its work, the consultant could be re-evaluating the same jobs. Wisner would be the preferred consultant for job re-evaluation because according to the report, he offers the best assurance of continuity. The report states at p. 24:

If others do the work for Willis and Associates, then quality- control procedures should be put in place to make sure that new Consultants would have done the previous work in exactly the same way. (Exhibit PSAC-6, p. 24)

509. At the October 31, 1989 meeting of the JUMI Committee, Saveland was in attendance. He presented his analysis of the Wisner 222. (His

108

report was released subsequent to this meeting and bears the date November 10, 1989.) Brookfield testified that Saveland, in his presentation, had concurred with the unions' position which was there was no evidence in the report of systematic over-evaluation of female positions. Saveland also told the committee, most of the differences between Wisner and the evaluation committees were found with 27 male positions.

510. Durber also attended the October 31, 1989 JUMI Committee meeting. The minutes (Exhibit R-44) state at p. 9 that Durber requested the JUMI Committee to indicate how it would deal with the apparent gender bias referred to in the Wisner 222. Durber offered the Commission's assistance to the JUMI Committee. At that time, the management side of the JUMI Committee was willing to do further reviews of the Willis results. The staff side position, communicated by Manseau, the union co-chair, was that prior to this meeting, the staff side were not in a position to proceed further with the Willis study. Manseau promised to reply to the management side by November 10, 1989, about whether the staff side would proceed and who would represent the staff side in the joint process.

511. Following the JUMI Committee meeting of October 31, 1989, Durber sent a letter dated November 10, 1989, to Manseau. In his letter, Durber notes the Commission's concern is with apparent gender bias and the Commission had drawn no further conclusion at that time, but expected the parties to resolve the question of bias in a way that would satisfy the requirements of the Act. He referred to the fact that Saveland, in his written report makes reference to reviewing the 27 male jobs, and offers a caution that the separate exercise should be done with care to ensure objectivity.

512. In an attempt to understand the Wisner 222 Report, the unions approached their members who had been on the MEC to obtain information which might assist in explaining the differences between evaluations done by Wisner and those done by the committees. Brookfield testified she received information from the CATCA union. A member of CATCA, Rick Smith, was provided with the 27 male questionnaires and assigned by the union side to analyze these questionnaires. The information he provided was reported and filed as Exhibit PIPSC-129. The author of the report did not testify at this hearing. His conclusions are contained on page 2 and 3 of the report which reads as follows:

In summary, after careful review of the committee results and consultant results I find that the consultant has been consistently higher in ratings for several reasons. Some are outlined above and others are individually pointed out in his rationales. The % differences which I have indicated between Committee and Consultant range from insignificant (in my opinion), 5.4%, to 17.4% which is just at the edge of an acceptable error tolerance. I can find no evidence of bias nor can I say that I could discount the possibility. The committees and the consultant have provided complete, sound ratings with logical rationales to support them. They are slightly different in all cases but this is to be expected. My own analysis of the positions was often

109

slightly different than both or leaning toward the committee or the consultant rating.

The process is not an exacting science and the Willis plan does not provide for a wrong or right evaluation of a job. A consensus is the best one can expect and I have no reason not to accept the ratings of the committees as they stand.

513. According to Brookfield, the unions were anxious to meet with the Treasury Board representatives with all the information the unions had gathered, including Smith's report, supra, to determine if the differences between the consultants and the committees could be explained.

514. Ouimet wrote to Manseau, by letter dated November 27, 1989, indicating the management side required a response to its request that Willis & Associates be instructed to do further work. The letter stated management required a response by December 1, 1989 or they would proceed unilaterally (Exhibit HR-17, Document 22).

515. The next meeting of the JUMI Committee was scheduled for December 13, 1989. Brookfield testified there was no opportunity for the union side to discuss with management side the report received from Rick Smith of CATCA. It appears from the letter of November 27, 1989, from Ouimet that the management side had embarked upon a review of the 27 questionnaires identified by Saveland of the Commission. The second paragraph of the letter reads:

As requested, we are prepared to exchange comments on the 27 questionnaires identified in Mr. Willis' analysis on December 8; the modalities of a sub-committee will be discussed at the December 13 meeting. Its work however, is independent of the research required by Willis and Associates; this work must proceed immediately and would be concurrent with that of the committee if it is established. Even if the committee finds an explanation for the 27 questionnaires in question, we still require more evaluations to make bias estimates for the various employment groups in the study. At this late date, delays are a luxury we can ill afford. We require a response from you concerning Willis and Associates further work by December 1, or we will proceed unilaterally. (Exhibit HR-17, #22)

516. The union side concluded from its reading of this letter, even if a joint process to find explanations for the 27 male questionnaires was undertaken, Treasury Board intended to proceed with Willis' recommendation for a further study with or without the consent of the unions. This became a reality when the union co-chair received Ouimet's letter of December 11, 1995 which reads in part:

We remain firm in the belief that the uncertainty surrounding these evaluations mandates further study. We accept the recommendation by Willis and Associates to undertake further analysis (supported, it would seem, by the CHRC). We have agreed

110

with your proposal to examine the 27 evaluations cited by the CHRC as relevant to `apparent bias', but you have not responded to our proposal to proceed with further evaluations at the same time. To quote Mr. Durber `...we are anxious that the matter of gender bias be dealt with quickly'. Your responses to our letters leave us no choice but to conclude that you do not want to resolve this issue in the near future. We have decided therefore, to comply with the recommendations expressed by both the Consultant and the CHRC and to proceed as of December 11 at which time the process by which Willis and Associates may undertake further analysis will commence. We will keep you informed of the progress of the study. You may have our assurance as well, that the same methodology unanimously agreed to by JUMI in the first phase will be carefully followed. [emphasis added] (Exhibit HR-17, #7)

517. Willis testified the decision of the Employer to proceed unilaterally and authorize him to do additional re-evaluations was announced to the staff side without consulting him in advance. When the December 13, 1989 JUMI Committee meeting convened, a statement was read by Manseau. At the request of Manseau, the statement was appended to the minutes after which the unions withdrew and no further business was conducted. The statement made by Manseau is reproduced in full.

STATEMENT BY CHRISTINE MANSEAU CO-CHAIR OF JUMI ON BEHALF OF THE PUBLIC SERVICE UNIONS

For some time the unions represented at JUMI have not felt equal partners in this joint undertaking. We had wanted to discuss jointly the conclusions of the CHRC on the findings of Willis and Associates in an informal setting so that perhaps JUMI could arrive at a joint agreement on how to deal with their recommendations. We had suggested the establishment of a sub- committee to review jointly our conclusions on the consultant's evaluations reported in the Willis Special Analysis prior to proceeding with further analysis - we were denied that. We had asked further analysis not proceed unilaterally for we felt it would endanger the joint character of the Study and undermine JUMI's credibility - we were denied that.

In view of Ms. Ouimet's letter of December 11 announcing that Treasury Board has decided to proceed unilaterally with further analysis by Willis & Associates, we feel this Study is no longer joint. We therefore are not willing to participate in any discussions on any outstanding issue at this time.

111

We request this statement be recorded verbatim in the minutes and that the correspondence exchanged since the last meeting of JUMI be attached to the minutes.

(Exhibit HR-11B, Tab 34)

518. From the August 25, 1989 JUMI Committee meeting when Willis first recommended a further study to the December 13, 1989 JUMI Committee meeting when the union side temporarily withdrew from the study there was considerable tension between the parties. This tension manifested itself even earlier during the work of the IRR and ICR Sub-Committees but it was after the release of the Wisner 222 that the relationship between the management and union sides began to rapidly deteriorate.

519. From August 25, 1989 onward, the union side wanted to move forward with the JUMI Study to conclude the evaluation phase, to determine the methodology for compensation and wage comparisons and if a wage disparity was identified to continue with bilateral and multilateral meetings as required. On the other hand, from the August meeting, the management side felt strongly that an additional study was required and the matter of apparent gender bias could not be dismissed without this study.

520. As the parties became more entrenched in their positions throughout the fall of 1989 the tension escalated. Between November 7, 1989 and December 11, 1989, there were no less than 21 letters introduced into evidence written between the JUMI co-chairs with as many as three letters written by one side on the same day. As Brookfield said in Volume 169, at p. 21296, line 24 to p. 21297, line 9:

Q. Had you ever had that kind of flurry of paper before in the years that you had been involved in dealing with each other?

A. No. I think HR-17, over, I think we are talking, a six-week period, every issue imaginable about several -- four or five, issues are going on with correspondence, some it [sic] simultaneous, and I think that speaks rather directly to the fact that people were having a lot of difficulty communicating with each other, that there was this flurry of correspondence.

521. Since the unions refused to go along with a further analysis, Ouimet advised him the Employer intended to commission Willis & Associates to do the work on behalf of the Treasury Board. On December 19, 1989, Willis wrote to Ouimet declining to conduct a further analysis unilaterally on behalf of the Treasury Board. Willis testified he understood from the very beginning he was answerable only to the JUMI Committee. Willis felt this was inappropriate. Willis had hoped the JUMI Committee would reconvene. He was asked by a Treasury Board representative, Gaston Poiré, under what circumstances he would conduct the analysis. Willis responded that he would conduct a study of a larger sample if the Commission requested it, since the Human Rights Commission was an objective third party and it was their bill. (Volume 59, p. 7311).

112

522. In Willis' letter to Ouimet of December 18, 1989, he mentions for the first time what the information from a second study should provide. The relevant portion of the letter reads:

It is my belief that an expansion of this analysis is necessary to determine the extent of any actual bias that may exist in the evaluations. This information should afford a basis for any adjustment in evaluation results that may be required to assure a fair and objective study. [emphasis added] (Exhibit HR-92)

523. On January 23, 1990, the Alliance announced its permanent withdrawal from the initiative and three days later, on January 26, 1990, the President of the Treasury Board announced the implementation of equal pay for work of equal value adjustments, with the assurance the government's action did not prejudice any conclusions and findings of the Commission relating to the resolution of the issues still to be investigated by the Commission.

524. Brookfield testified she noticed a change in the attitude of the Employer toward the end of the study. She made reference to the fact the discussions between the unions and management was initially about apparent gender bias. Following the Wisner 222 report however, the Treasury Board no longer discussed apparent gender bias and had changed their approach by suggesting they would adjust for actual gender bias.

525. The unions were very concerned about this change in the Treasury Board's approach after the Wisner 222. Brookfield testified there was correspondence about adjusting scores and referred to a letter written January 26, 1990, after the break down of the study (Exhibit HR-41), from the President of Treasury Board to Max Yalden, Chief Commissioner of the Commission, explaining the equalization payments were calculated on the basis of adjustments for gender bias made by Treasury Board.

526. In the letter, the President of the Treasury Board, Robert de Cotret, wrote to Yalden with details of the government's decision to implement service wide measures based on the evaluation results of the Joint Initiative. The letter does not refer to the extent of apparent gender bias identified in the Wisner 222, but instead alludes to the extent of gender bias. An excerpt from de Cotret's letter reads as follows:

It is my strong belief that an unprecedented study of this magnitude must be fair, statistically sound, and credible, given its significant ramifications. This further analysis was needed to determine the extent of gender bias and adjust the Initiative's evaluation results accordingly. I appreciate, therefore, the Commission's agreement to conduct this analysis to determine the extent of gender bias. [emphasis added]

(Exhibit HR-41)

113

527. The above excerpt seems to confirm the union's belief of the changing emphasis by the management side from a concern for apparent gender bias raised in the Wisner 222 to an issue of adjusting results to account for actual gender bias. Brookfield testified it appeared to her the Treasury Board had made a decision there was definitive evidence of gender bias in the Wisner 222 and all that needed to be done was to adjust the scores for the bias.

528. In early 1990, Willis was contacted by the Commission. This contact was made after the Alliance had announced their withdrawal from the JUMI Study. Willis was informed by Durber that the Commission had determined an additional analysis was necessary based on re-evaluations to be undertaken by Willis & Associates. The Commission itself would, however, analyze the results of the Willis re-evaluations.

529. In Willis' opinion, the only alternative to a further study, would be to use some other evaluation system which would have, in effect, reconstructed much of the study. This exercise would have been extremely costly. Willis also expressed his opinion as to what ought to be done with the study results. He suggested the Tribunal has three alternatives:

  1. to implement the study as it is;
  2. to adjust the results; or
  3. to trash the study. Willis maintained he would rule out trashing the study, and would adjust the results for any possible gender bias.

E. THE COMMISSION

530. When the Commission responded, in April of 1985, to the invitation of the President of the Treasury Board to support the JUMI, the Commission agreed to put on hold the investigation of s. 11 complaints filed prior to the announcement of the JUMI, as well as complaints filed subsequently to the announcement of the JUMI. The Commission indicated it would await the results of the study before taking action. This also depended upon the circumstances at the time of the filing of the complaints.

531. The Commission's response to the invitation was contained in a letter dated April 17, 1985 (Exhibit HR-18, Tab 18), from Gordon Fairweather to the Honourable Mr. de Cotret. That letter indicates that if the Commission satisfied itself the methodology employed in carrying out the study was consistent with s. 11 of the Act, then it would issue a special guideline advising that the study was consistent with the Act. It would also issue guidelines for the implementation of corrective action in accord with s. 11.

532. The Commission participated in the JUMI Process only as an observer. Representatives of the Commission attended the JUMI Committee meetings and when asked by members of the JUMI Committee provided clarification and advice relative to the JUMI Study. Participation by the Commission was mainly of a technical nature, and involved such tasks as selecting samples in the ICR testing and dealing with problems of interpretation relevant to the Act and Guidelines. Commission employees also attended as observers during the operation of the five and nine evaluation committees.

114

533. The Commission did not intend to be a party to settlements reached by the parties to the JUMI. It did, however, intend to examine any agreement reached to determine whether it met the requirements of s. 11 of the Act.

534. In early May, 1989, Durber joined the Commission as Chief of Equal Pay. This title was later changed to Director of Pay Equity. On June 12, 1989, Durber met the JUMI Committee co-chairs and expressed his concern that if the parties were unable to determine what should be done with the Wisner 222, the initiative could easily founder. Durber testified the co-chairs agreed at this meeting that all the parties, including the Commission, ought to have free access to the job evaluation results from the JUMI Study.

535. Durber advised the co-chairs at that time the question for the Commission was how to interpret the job evaluations that had been done. He emphasized if there was gender bias the Commission would have to be involved because it needed to know whether the evaluations were acceptable as evidence, should the Commission pursue the complaints filed by the Alliance.

536. No formal investigation of the complaints was done by the Commission until March 6, 1990. On that date, at the request of the Commission, the JUMI participants met with the Commission to review outstanding issues. By that time the JUMI had permanently broken down.

537. The next significant date is March 6, 1990, when the Commission met with the JUMI participants. The Commission wanted to reduce the number of issues arising from the JUMI should the complaints be referred to a Tribunal. The Commission's press release, following the meeting, specified the Commission must be satisfied that all the requirements of the Act had been met. It also specified Treasury Board had given the Commission the calculations used to predict their adjustments which the Commission would examine in its investigation.

(i) Commission Investigation

538. When the JUMI Study ended in the beginning of 1990, it became evident to the Commission its role as observer in the JUMI Study was also at an end and it was time to begin pursuing the normal complaint process. The question of apparent gender bias raised by the Wisner 222 was a part of the investigation into the complaints. The approach by the Commission was to treat the question of apparent gender bias as the first focus of its investigation into whether wage discrimination persisted in the Federal Public Service. The government had made equalization payments in January, 1990, and the Alliance maintained those payments had not closed the wage gap, leaving wage discrimination still in place.

539. Gender bias was a consideration when the President of the Treasury Board announced the wage equalization payments in January of 1990. The Treasury Board President had not indicated the extent to which the equalization payments accounted for the bias, but did state in his announcement the Commission would be examining the matter.

115

540. The Commission's approach to the investigation as described in Exhibit HR-55, Notes for Presentation on Alleged Gender Bias in Job Evaluation of the Joint Initiative was conservative in terms of the amount of evidence it sought for in addressing the question of apparent gender bias.

541. Durber testified the Commission investigated all five complaints from both the Alliance and the Institute simultaneously. It was probably the speediest Commission investigation performed prior to that time because the Commission had before it all the job evaluation data gathered from the JUMI Study. The Commission had no need, therefore, to conduct its own job evaluations.

542. There were four areas for investigation by the Commission. The first involved the investigation of gender bias. The Commission had to decide whether they could rely on the job assessment information from the JUMI Study. The second involved looking at any wage gaps that might appear. The Commission had to develop a methodology to calculate wage gaps. The third area for investigation involved considering and valuing benefits. Finally, the fourth area, (not yet complete), involved parts of two complaints which bore on limitations on employment opportunities as a result of compensation practices.

543. An overview of the chronology begins with the Commission's investigation starting in March, 1990, arriving at tentative conclusions on gender bias in July of that year. In the same month, the Commission briefed the parties on its findings regarding apparent gender bias in the committee evaluations. In August, 1990, the Commission produced a draft report on the wage gap and the parties were briefed on the Commission's interim findings regarding its conclusions.

544. There was also a meeting in August with the parties on the status of the Commission's investigation pertaining to the valuation of benefits. In September, 1990, the Treasury Board submitted a written response to the Commission's August draft report. The final investigation report went to the Commissioners in late September, 1990. The following October, the Commission made its decision with respect to the wage gap on the five complaints and requested the President of the Human Rights Tribunal to appoint a tribunal.

545. The Commission's investigation into the s. 11 complaints is contained in Exhibit HR-250, entitled, Investigator's Report: Wage Adjustment in the Federal Public Service - Possible Gender Bias in Job Evaluation Data. Durber released the Investigator's Report on this subject to the parties in September, 1990. The report contains the Commission's findings and conclusions relating to the question of apparent gender bias in the committee evaluations. The Commission's conclusions are found in para. 51 of that report which states as follows:

116

51. Conclusions

Commission staff have found that the Willis checks reveal some differences between consultants' evaluations and those performed by the Joint Initiative. Investigators do not find that these differences reveal patterns that can be correlated consistently with gender or occupation in the Joint Initiative evaluations. The extent of possible undervaluation of male jobs is less than 3%, but can likely be accounted for by differing understandings of work described, as well as the meaning of bench marks and the application of the Willis plan. It is not apparently the result of bias linked to sex. Moreover, the 3% is not evenly distributed across occupations. Certainly, the fact that two sets of independent groups (Willis consultants and the Quality Analysis Committee) could produce results varying by a margin of 2% to 3% indicates that such differences may be expected and be due to reasons other than bias.

(Exhibit HR-250, Part I)

546. Durber's oral evidence corroborates and confirms the contents of this report and focuses on the steps pursued by the Commission in investigating the possibility of gender bias in the committee evaluations. It is noted from Durber's evidence, further testing procedures were undertaken by the Commission subsequent to the commencement of this hearing. Both the investigative procedures conducted as part of the Commission's initial investigation and the subsequent testings conducted at the request of the Commission will be reviewed by the Tribunal.

547. The Investigator's Report indicates there was no clear evidence of gender bias in the evaluation results. The report contains a recommendation of formulae for equalizing pay between males and females which pay ought not to be adjusted for possible gender bias. It proposed that the Commission accept its findings vis a vis the related complaints under s. 11.

548. A draft of the Investigator's Report (Exhibit HR-250), was provided to the parties for comment in the summer of 1990. The Treasury Board responded by letter and written report dated August 17, 1990, from Ouimet in her capacity as Assistant Secretary, Classification, Human Resources Information and Pay Division, addressed to Durber. Ouimet testified during the voir dire hearing of the Tribunal but was not called when the hearing reconvened. The last paragraph of her letter concludes that the Commission's investigation was deficient and did not demonstrate a clear case there was no gender bias. On the other hand, she expresses the view as to the unlikelihood of any party being able to demonstrate the existence of gender bias in the results. The paragraph is reproduced as follows:

On the other hand, it is unlikely that anyone could demonstrate gender bias does exist given that the Willis firm has not provided a baseline by which evaluation results may be compared from study to study. It is not possible to measure adequately the application

117

of the plan so as to conclude definitively that bias does or does not exist. Do not conclude however, that we should not examine very closely all the rating inconsistencies raised by the various committees of the Joint Initiative, your own research, and ours. It is now vital that we leave aside the `why' behind rating anomalies and focus instead on how they may be corrected. We would be prepared to contribute to the design of an appropriate study to resolve rating inconsistencies. [emphasis added]

(Exhibit HR-46, p. 2)

549. In the detailed comments attached to her letter, Ouimet asks the rhetorical question, Is it possible to distinguish between evaluation biases along sex lines and the overall application of the Willis Plan in a manner that would assign an appropriate weight to each? The report states the answer must indicate the degree to which the question of gender bias is purely a statistical or substantive question. In the latter case, according to Ouimet, statistics may contribute little.

550. In the Treasury Board's written response to the Commission's final report on Possible Gender Bias in the Evaluation Data, which is contained in a letter from Ouimet to Durber dated September 7, 1990, the Treasury Board is clearly of the opinion a statistical study is not the best approach when determining possible gender bias. The following excerpts from her comments at p. 1 of the report are helpful in understanding the Employer's response:

In essence our disagreement can be summarized as follows: the Investigator embarked on a highly restricted look at gender bias through statistical research that was inappropriately conducted. Even if it were appropriate, the restricted nature of the overall study is such that nothing can be said about the issue of gender bias since the important issues implied by it were never examined.

The Commission quotes at length the position of the Public Service Alliance of Canada (PSAC) that many of the issues are non- statistical. We are in agreement with this position and have argued that statistical analysis in this area is useful only insofar as it may raise the possibility of a problem that would require a non-statistical approach to answer. Notwithstanding this objection, we are of the opinion that any statistical study, no matter how adequate, is not the best approach in this matter. There is so much judgement involved in the scoring of any job questionnaire that to determine gender bias statistically is difficult at best because it requires that a weight be assigned to every factor of judgement/bias/inconsistency, what have you, to the score itself. Since you have decided to restrict your study to a statistical analysis of Willis evaluation data, we feel compelled nevertheless to critique your study on statistical grounds.

The long critique we sent to you was an attempt to demonstrate, through statistical arguments, that the approach taken and the

118

empirical findings do not, under any circumstance, permit you to conclude with certainty there is no gender bias in the Joint Initiative evaluations. The most you can conclude is that there is not enough evidence to decide one way or the other. You have not addressed any of our concerns systematically other than through an editorial comment that our 'statistical criticisms make rather too fine a point'. (Exhibit HR-250, Tab J, pp. 1-2)

551. The Treasury Board apparently had used an alternative line of enquiry into the question of possible gender bias described in Ouimet's detailed comments of September 7, 1990. Using a different approach, she writes the Treasury Board came to the same conclusion as Sunter, but in their view, the conclusion is misleading since it only represents half the story and says nothing about how often questionnaires are under- or over- evaluated. The Treasury Board's overall conclusion is found on page 10 which states:

Using criteria provided by the Willis firm, it is not possible to conclude that while there may be statistically significant differences in patterns of evaluations, they are not substantively important. As shown above, the issue of level of difference has ignored the frequency dimension and the differences in patterns are indeed significant. We attempted to take into account mis- evaluations in order to see whether there was a gender pattern to them and it would appear there is.

We have analyzed the same data and using the same measure as the Sunter analysis, and yet reached different conclusions. We are convinced that the data show serious problems with the evaluations and that these problems look very much like gender bias; in any event, further analysis is required. We remain firm in our belief that the scores need to be adjusted, but we are prepared to discuss a different adjustment strategy from the one originally used. Any adjustment is going to be difficult to estimate given the significant differences between the two Willis studies. [emphasis added]. (Exhibit HR-250, Tab J)

552. We will now describe and examine specific factual information found in the Commission's investigation provided by Durber. On March 8, 1990, the Commission received from the Treasury Board, a document (Exhibit HR-185) which explained the methodology used by the Employer in making its equalization payments. According to Durber, the Treasury Board paper, issued in March, 1990, estimated an average bias of +3 per cent for evaluations of positions from female-dominated occupations and of -4 per cent for evaluations of positions from male-dominated occupations. Accordingly, the wage equalization payments had therefore incorporated a corresponding across-the-board adjustment when calculating equal pay for work of equal value. The adjustments resulted in payments to public service employees in female-dominated occupational groups which were lower than they would have been without those adjustments for possible gender bias.

119

553. The revision of scores is explained in the methodology paper as follows:

A score revision factor based on simple statistical techniques was estimated by the Treasury Board. All questionnaires except those rated by the Master Evaluation Committee and the Willis consultant were revised: ratings for female questionnaires were reduced by approximately 3% overall and male questionnaires were raised by roughly 4% overall. All policy analyses presented in the remainder of this report use the revised evaluation scores as described.

(Exhibit HR-185, pp. 6-7)

554. In attempting to understand Exhibit HR-185, which contains a good deal of detailed statistical jargon and information, Durber sent the report to seven independent individuals for their comments. These individuals included pay equity experts, Weiner, Dr. Morley Gunderson, Lois Haignere, Willis & Associates, Roberta Rob, Judith Davidson-Palmer, and a statistician, Sunter. Durber viewed these individuals as potential participants in a workshop the Commission had scheduled for April, 1990, to review the Treasury Board's methodology (Exhibit HR-185) and to advise him how he ought to deal with it.

555. The Commission had difficulty in obtaining data from the Treasury Board during its investigation of the complaints. Durber testified the actual data the Treasury Board used to arrive at its conclusions in HR-185 were never produced. The Commission had to project salaries and create their own salary data bases because of the length of time it took the Treasury Board to provide salary information. A complete set of the salary data was finally provided to the Commission during these hearings.

556. On April 9, 1990, the Commission held its workshop and some of the individuals who are listed above attended, namely, Sunter, Roberta Rob, Judith Davidson-Palmer, and a representative of Willis & Associates. The others, who did not attend the meeting, provided written comments. Durber wanted to be as well informed as possible by some of the better minds in Canada on the issue of pay equity. (Volume 147, p. 18197). After fairly extensive consultation with these individuals, Durber consolidated the advice he received and formulated an investigation plan and hypothesis.

557. Following the meeting of April 9, 1990, Durber consolidated the advice resulting from his discussions with these individuals in order to clarify the issues needed to be addressed by the Commission. A decision was made to challenge the Treasury Board methodology by detailed questioning.

558. The Commission was also interested in knowing whether the factors in the Willis Plan, were different from the results for male-dominated occupational groups as opposed to the results from female-dominated occupational groups. Durber contracted the Wyatt Company, an international company of management consultants which enjoys a considerable job evaluation practice. The Wyatt firm was asked to use the database for all

120

the JUMI Study job evaluations. The Wyatt firm looked at the data to determine whether the relationship between the factors was the same regardless of the gender of the group and regardless of the occupation from which the questionnaire was taken. Their report was provided to the Commission in early June, 1990. The Wyatt analysis demonstrates there were correlations between various factors, for example, the extent to which a score on mental demands correlates with knowledge. The conclusions from this report was there appeared to be no significant differences in the correlations between the factors for the male and female jobs or between the overall patterns. The report further indicated there was some difference in scores on working conditions between male and female jobs. It was Durber's belief this was explainable by the nature of the work. (Volume 147, p. 18208).

559. The approach of the Commission in assessing gender bias was not to prove no bias but simply to find whether or not a reasonable person would see bias operating. (Volume 149, p. 18521). According to Durber, because there is a different pattern for males as opposed to females, as for example in the Wisner 222, this does not tell the investigator anything except, perhaps, whether one ought to look further.

560. During the initial investigation, a letter dated June 20, 1990 accompanied by a binder, was delivered by a representative of the Treasury Board to the Commission, which contained information relevant to the Commission's assessment of gender bias. The documents in the binder included IRR Sub-Committee documents, the ICR studies, the recommendations for changes prepared by the Willis consultant, Drury on the MEC evaluations, the Tristat Report, Willis' report on MEC's work dated July, 1988, questions referred to Willis in August, 1988 from the management side regarding the MEC evaluations, minutes of JUMI Committee meetings, copies of letters written in July, 1988 by the Alliance and the Institute to Drury regarding evaluation rationales and interpretation of the factors under the Willis Plan, Willis' response to committee challenges of the MEC evaluations, a copy of a letter from Willis regarding Committee #4 written on August 17, 1989, and copies of letters between the parties regarding the Wisner 222.

561. Durber stated the documentation in the binder provided by the Employer did not particularly pertain to gender bias. In the set of documents relating to the ICR Sub-Committee, Durber searched for specific evidence of gender bias. With regard to the Tristat Report he testified he was looking for bottom line conclusions because he wanted to know whether in fact there had been indications, or hard evidence of gender bias. As to the ICR studies, considering the small number, 25 tests, it was not possible, he said, to detect a trend.

562. Durber spoke with one of the Commission's observers, Brian Hargadon, concerning his observations of the ICR tests. Hargadon participated in all of the tests. He also administered some of the tests. Durber testified that in Hargadon's view the evaluation committees did not, over time, take the tests as seriously as when they had begun. Durber, therefore, found the ICR tests inconclusive on the question of gender bias.

121

563. On the question of the changes to the MEC evaluations prepared by the consultant, Drury, Durber primarily relied on Willis' opinion that the matters brought forward by her were resolved. As to the report prepared by Willis & Associates in July, 1988 and their analysis and conclusions regarding the MEC's work, Durber considered the bottom line in the report to be that there was no problem with gender bias. In Durber's opinion, after reviewing the materials submitted to him by Treasury Board, he came away with no better understanding of how gender bias might operate in the job evaluation results. Durber testified the Treasury Board material was not helpful and he needed to better understand whatever was going on with respect to so called gender bias. Accordingly, he decided to look elsewhere for answers.

564. Durber stated the only discussions he had with Treasury Board staff about material contained in the binder was during a presentation he made to the Employer on July 5, 1990, regarding issues surrounding gender bias. A more detailed analysis of gender bias as viewed by the Treasury Board was not made available to the Commission until August, 1990. It was then the Treasury Board submitted its more detailed written submission concerning this subject to the Commission.

565. Part of the Commission's investigation into the question of apparent gender bias was a follow through of the recommendation, contained in the Saveland Report, for further analysis of the 27 under-evaluated male jobs (subsequently reduced to 25). These jobs had been identified by Saveland as showing a difference of 10 per cent or more between Wisner and the evaluation committees. Durber convened a joint committee in the spring of 1990, composed of management and union employees under the chairmanship of Ron Renaud, Senior Consultant, Equal Pay Section of the Commission. They met for two weeks beginning on April 30, 1990. In the Commission's letter to the committee members, the committee was informed as follows:

The committee's mandate is to carry out a quality check of twenty seven positions that were evaluated by JUMI committees coming after MEC. In an analysis of 222 position evaluations by Willis and Associates, June, 1989, it was found that the evaluations were significantly different from the MEC discipline and contributed most to the finding of apparent gender bias. (Exhibit PIPSC-135)

566. Former MEC evaluators were selected to participate in this committee, including two management and three union representatives whose names were suggested by the Employer and the unions. Durber wanted participants who had a breadth of views. In his opinion, this goal was achieved. This committee was referred to as the Quality Analysis Committee (the QA Committee) and produced a report, The Quality Analysis Report.

567. Durber testified that within the context of the QA Committee, he was less interested in the fact there were differences between the consultant and the multiple evaluation committees, than he was on what accounted for these differences. He was interested in knowing whether the QA Committee members perceived the multiple evaluation committees and consultant differences in a way that related to the fact these were male

122

jobs, or whether they perceived any bias on the part of the multiple evaluation committees. He considered the five former MEC members to have a special insight into both the Willis Plan and the MEC discipline. He expected they would understand the mechanisms behind their own differences with the committees.

568. Durber testified the Commission was trying to determine if there was a reason, a motive, or some conscious or unconscious effort by the multiple evaluation committees to disfavour these male jobs. If gender bias was to be evident anywhere, he reasoned, it would be evident with these 25 jobs.

569. The procedure followed by the QA Committee in completing its assignment was for each committee member to read the questionnaire, independently evaluate the questionnaire, review the MEC benchmarks used by the JUMI Committee and those used by the Willis consultant, and then select additional appropriate MEC benchmarks.

570. The evidence before the Tribunal is contradictory as to whether or not the QA Committee was required to arrive at a consensus in their evaluations. According to Durber's evidence, the QA Committee were not asked to form a consensus. Durber testified the Commission asked each QA Committee member to report to the chair on their evaluations, then discuss them, but not arrive at a consensus. Durber further testified the Commission was not attempting to validate the ratings of the 25 jobs but simply wished to understand whether the members of the QA Committee might become aware, during this process, of any gender issues either in their own ratings or in the multiple evaluation committees' ratings.

571. On the other hand, two union members of the QA Committee testified the QA Committee was asked to reach a consensus and failed to do so. Their evidence is the QA Committee followed the same Willis procedure used by the evaluation committees. The only exception, according to these witnesses, was that the consensus had to be unanimous for each sub factor in the Willis Plan, rather than the two-thirds majority required for consensus in the evaluation committees. An attachment to the letter dated April 23, 1990, from the chair to the QA Committee members corroborates and confirms the unanimous agreement requirement for consensus. The relevant part of that document states:

Evaluation findings will be arrived at by committee consensus. This means that the evaluations by factor, sub-factor and points must be agreed to by each member of the committee. (Exhibit PIPSC-135, p. 3)

572. Durber testified that at the conclusion of the QA Committee's work, the chair of the Committee, Ron Renaud, reported to him the differences in the ratings between the QA Committee and the evaluation committees were due to perceptions of the work, but that the QA Committee found the gender of the jobs played no role whatsoever in the ultimate evaluations. A review of the written report does not include any reference to this verbal report from Renaud to Durber.

123

573. Durber concluded from this exercise, it would not be unusual to find a range of views between evaluators which would be reflected in a range of ratings. Durber interpreted the difference between Wisner and the evaluation committees as normal, honest disagreement about work as opposed to any problems with gender bias.

574. Durber stated in his evidence, that the entire edifice of the question of gender bias which is before the Tribunal rests on a foundation of one person's view [i.e., Wisner's] of 25 questionnaires. (Volume 149, p. 18581).

575. Durber used the QA Committee Report to compare the average of the QA Committee evaluator's ratings to the total point score given by the JUMI evaluation committees and Wisner. According to Durber, this comparison indicated to him the QA Committee disagreed as often with Wisner as with the evaluation committees and he states in Volume 149, at p. 18573, line 14 to p. 18574, line 22:

The patterns were that the low raters agreed, essentially, as often as they disagreed with the committee ratings.

The high end rater agreed only about one-third of the time with the committees, although a third of the time was still a reasonable number.

I concluded from this exercise that in fact one should expect a range of view, a range of ratings on jobs, that it wasn't unusual to find a range of ratings, that it certainly would not be unusual to find differences between any raters.

That permitted me to believe, interpret Mr. Wisner's differences from the committees as a normal, honest disagreement about work as opposed to any problems with gender bias.

The fact that they were male jobs may or may not have been coincidental, but I could not see any necessary reason to believe that there was bias operating as a result of the differences between Mr. Wisner and the committees.

I didn't, for example, conclude that Mr. Wisner was biased in favour of male jobs, which could have been one of the interpretations from his report. He being a male, one might have concluded that. But whether he was a professional consultant and objective or whatever was another issue.

But we did find these five individuals from MEC also disagreed less than Mr. Wisner, but probably about as often or a little more than Mr. Willis when he and his other three consultants had looked at male jobs.

576. One of the union representatives on the QA Committee, Tim Yates, was asked in chief about his understanding of the purpose of the QA Committee. His response was that its purpose was to look at the committee

124

evaluations, ascertain if they had chosen appropriate benchmarks and correctly applied them. Yates testified he could not recall any instances, during this review, where inappropriate benchmarks were used. As to differences between the consultant's evaluations and the committees' evaluations, Yates says the following in Volume 175, at p. 22226, lines 4 - 22:

A. Well, if one is to make a huge assumption, that we were the experts in the thing, sometimes we were higher than the consultant, sometimes we were lower than the committee. I think it was Mr. Willis who said many times, this is not a science.

I would say personally that what was the problem? It all appears to be within tolerance.

Q. What do you mean by it would all appear to be within tolerance? Where did that phrase come from?

A. The lowest possible difference is one step. One step is 15 per cent. That's the very slightest possible bit of shading in any factor is 15 per cent.

577. The other union representative who testified regarding the QA Committee was Mary Crich, who had been an alternate on the MEC and participated in the committee evaluations as a member of Committee #5. She was asked about her observations. On reflection, she found her participation on the QA Committee was a good experience because it led her to understand that what she had done as a committee member was precisely what the evaluation committees were supposed to have been doing. She found the QA Committee evaluations were reached by exactly the same discussions relating to the same points and with more or less the same kinds of agreements and disagreements she experienced in her evaluation committee.

578. As to Crich's understanding of the work of the QA Committee, she testified the individuals selected for the QA Committee knew the MEC discipline, and thus could decide whether or not the ratings of the evaluation committees respected the MEC discipline or differed significantly. Crich further testified when the QA Committee finished its work, there was general agreement among the members there was no bias. If there was a significant difference, it was, according to Crich, because it was a genuinely difficult job to evaluate which had no comparable benchmark. She described the 25 jobs as very difficult jobs. Crich was asked in cross examination what she understood was meant by bias. She responded in Volume 192, at p. 24830, lines 5 - 15:

A. What I remember the other participants saying is that there had been allegations in the media that there had been -- the results of the study were biased and by biased, that meant that the evaluations had not been fair to all jobs equally and that female jobs had been rated too high. I don't know if the -- also that male jobs had been rated too low. Maybe it was both. Maybe it was just one or maybe it was -- but that was -- the bias is that female jobs were rated too high.

125

579. A further clarification of this response was given by her in Volume 192, at p. 24841, lines 2 - 12:

Q. Mr. [sic] Crich, I just have one question, really, and I will try to phrase it as clearly as I can.

When your Quality Assurance Committee agreed that there was no bias in these questionnaires, these 27 questionnaires that you evaluated, were you looking at the reasons for the difference and, therefore, concluding that the gender of the questionnaire was not the reason for the differences?

A. That's correct.

580. Willis testified he had a number of problems with the QA Committee. He was disappointed with the composition of the committee and would have preferred if the total MEC membership had been reconvened rather than only the five individuals selected. Another factor which troubled him was although three of the members were from the MEC, two of them had acted only as alternates. Moreover, one of the members had been identified as an outlier in the Tristat Report. Willis also believed, since two of the members had participated in the evaluation committees, their opinion about the committee results might be suspect.

581. Another area of concern for Willis was these five individuals had not done any prior evaluations for at least two years. Willis testified this committee should, at the very least, been given a day or two of refresher training by the consultants. In his opinion, it would be difficult after a two year lapse in time to return and do evaluations, particularly evaluations which were to be critiqued. His biggest concern is noted in Volume 208, at p. 26950, lines 3 - 8:

However, I guess my biggest concern about the QA committee was it was my understanding that there was no consensus process. To me I look at the consensus phase of the evaluation process as being part of the data-gathering collection.

582. Willis stressed the consensus phase of the Willis Process is a very important exercise because it gives the committee members opportunity to discuss the facts of the job and time for all members to consider the information thus elicited. It is the fine honing of the information which is important, according to Willis, in this stage of the process when committee members change their evaluation at this point, Willis believes the change is appropriate as long as it is based on facts which are brought out as a result of the discussions. On this basis, Willis discounted the results of the QA Committee because an essential and critically important step was left out. Willis testified that to some extent he might change his opinion regarding consensus, if indeed the QA Committee included the consensus process in their deliberations.

583. Durber testified, in the normal course of an investigation, the Commission expects an employer to provide evidence in support of their defence. He testified the Commission receives a defence from the employer

126

which says, in effect, it ought to be excused from accepting the results of its own study. Notwithstanding, the Employer was duly represented, and presented no evidence on which to support such a conclusion.

584. According to Durber, differences between a committee and a consultant are bound to occur, but the Commission needs to be vigilant about understanding those differences and their relationship to gender.

585. The Commission opted to conduct further analysis of the consistency of the evaluations by the nine evaluation committees compared to those of the MEC. Durber felt he had no alternative but to order another study so as to complete the picture. He was not happy with the alternative because, in his opinion, it was impossible to replicate job evaluations done by the committees. Durber felt uneasy about the validity of the process, which he described as people in a sense second guessing what a rather large number of people had done over a period of time. Durber would have preferred to have the parties to the JUMI Study deal with the issue of apparent gender bias in their own way. He elaborates in Volume 149, at page 18599, lines 1 - 11:

But conceivably they might well have had committees explain their results, look at the differences between themselves and Mr. Wisner. There might well have been some judgments raised or brought to bear on the patterns themselves and on the differences between the committees and Mr. Wisner.

There could have been some good rationalization, if you like. But in the event, that proved not possible. Once the committees were gone, they were gone.

586. Durber said in the course of his investigation, he did not contact Wisner because he preferred to relate to what he considered reasonable criteria for judging the quality of job evaluation. The issue, in Durber's view was one of differences between committees and consultants and the process followed by the committees. Durber questioned why he should prefer to believe a consultant over the evaluation committees. Given a choice between the judgment of a group of people who are well informed as opposed to following the discipline of one individual, Durber would prefer to believe the group of people. This was one of the indirect measures which Durber used in drawing his conclusions about gender bias. Durber believed if he contacted Wisner, he then would have been bound to call each of the working committee members.

587. Durber contacted Willis to do a further evaluation in the early part of 1990. Willis confirmed his acceptance by letter of February 12, 1990 to Durber which states in part:

The purpose will be to determine the extent of any systematic bias that may exist in the results of evaluation committee efforts.

...the sample size should be 300 positions, with at least 131 being from male dominated occupational groups and the balance from female dominated occupational groups.

127

As to the sample selection, the random selection methodology we used in the earlier special analysis would, I believe, be acceptable to the unions and management. The Human Rights Commission should have input into this methodology...

The method employed for the analysis will be the same as used in our previous analysis. Each selected questionnaire will be reviewed and a determination made as to whether a similar position is included among the Master Evaluation Committee's evaluations. In cases where a similar MEC benchmark exits [sic], the MEC evaluation will be adopted as the consultants evaluation. When no similar benchmark exists, the consultants will do an independent evaluation of the position, supported by reference to appropriate MEC benchmarks. Comparison will then be made with the sample committee evaluation and rationale for that position. When differences are found between the consultants evaluation and that of the committee, a written rationale explaining the consultants evaluation will be provided. (Exhibit HR-93)

588. Durber stated his objective in commissioning Willis to re- evaluate the additional 300 positions was essentially to pursue the issue which had been raised as a result of the Wisner 222 relating to possible gender bias. In view of s. 9 of the Guidelines, supra, Durber wanted to be assured there was no question of gender bias. He further testified he could see no alternative but to pursue the same approach as Wisner had because it was through that approach the issue had arisen in the first place.

589. Durber would have preferred to engage Wisner to perform the second set of re-evaluations but, in the meantime, Wisner had left the Willis firm. Accordingly, Willis was authorized to form a committee consisting of four consultants, (collectively referred to as the Gang of Four), who were to perform the 300 re-evaluations (the Willis 300).

590. Willis testified he understood there was a concern the four consultants working together would arrive at a slightly different result than Wisner. Accordingly, their additional task involved selecting jobs from among the Wisner 222 and independently evaluating them without making any judgment as to differences between the Gang of Four's and Wisner's re- evaluations. Using the Gang of Four, Willis was to review approximately 20 per cent of the evaluations of the Wisner 222, i.e., 44 questionnaires, as a double check on Wisner's interpretation of the jobs.

591. The Gang of Four tried to match, as closely as possible, the methodology that had been used in the Wisner 222. The sample of positions was selected by the Commission and was taken from the total sample of evaluations excluding the MEC evaluations and any re-evaluations included in the Wisner 222. Willis was not asked to do any analysis of those re- evaluations. Once the Gang of Four completed the 300 re-evaluations, the results were turned over to the Commission for analysis.

128

592. The Gang of Four consisted of Willis, two of his associates, Owen and Davis, and one outside bilingual consultant, Esther Brunet. Questionnaires were assigned to each consultant and a second consultant reviewed each of those evaluations, so that there were always two consultants involved. The work took approximately two months. A report entitled Report to the CHRC Equal Pay, Quality Analysis of Sampled Committee Evaluations, Joint Initiative Equal Pay Study, was presented by Willis to the Commission in March of 1990.

593. Although this review was to assess the quality of the Wisner 222, Willis & Associates were instructed by the Commission not to draw conclusions as to the quality of either their work or Wisner's re- evaluations. In the course of these hearings, and in the context of this review, Willis was asked his opinion on the quality of the Wisner re- evaluations. He replied in Volume 59, at p. 7337, lines 11 - 24:

THE WITNESS: I was satisfied with the quality of the Wisner evaluations six or seven months earlier when I looked at his rationales and I looked at his actual evaluations. I had a great deal of confidence in Mr. Wisner's ability as a professional job evaluator.

I did not, at this point, sum up the 44 evaluations by our team of consultants and compare them in total with the Wisner evaluations. They were not identical, there were some differences. But I felt that it was up to Mr. Durber to analyze those differences and, in effect, decide whether the quality was consistent between both consultant teams.

594. In terms of analyzing the results of the 300 evaluations, Willis stated it would have been appropriate, in his opinion, to perform a statistical analysis to identify the existence or non-existence of a systematic pattern of gender bias. Had the Commission asked Willis to perform this analysis, he would have retained a statistician, Dr. Milczarek, who in the normal course of events, performs this kind of analysis for him.

595. The last communication between Willis & Associates and the Commission, concerned the 44 re-evaluations. This took the form of a letter dated May 1, 1990, written by the Willis consultant, Keith Davis to the Commission. During the re-evaluation of the 300 positions and the review of the 44 Wisner re-evaluations, the Gang of Four inadvertently referred to a list relating to the working conditions factor in the Willis Plan. Changes had been made by the JUMI Committee to this factor which the Gang of Four had failed to take into account. Davis informed the Commission when using the re-evaluations, the working conditions factor needed to be changed. In the end, one re-evaluation by the consultant required a change.

596. The Tribunal had the benefit of hearing evidence from Esther Brunet concerning her participation in the re-evaluations as a member of the Gang of Four. Brunet was the only member of the Gang of Four who was an employee of the Federal Public Service. She had been involved in the

129

JUMI Study as a chair in the first version of Committee #4. Her employment background at the relevant time was Director of Personnel, Finance and Administration with the Status of Women Canada. Willis testified he needed a French speaking consultant to participate in the Willis 300 and, because he and his staff had a great deal of confidence in Brunet's ability to evaluate, they contracted with her to evaluate the French questionnaires.

597. Brunet evaluated approximately 100 questionnaires out of the total of 300. About 70 per cent of those were French questionnaires. She first evaluated the questionnaires independently. If the evaluation committee had used only one benchmark, she would try to find more. Once her evaluation was done, she would look at the evaluation committee scores and rationales, and if she felt the reason for the difference made sense, she would give the benefit of the doubt to the evaluation committee scores, if not, she would then prepare her justification and present it to the other three consultants. During this presentation, Brunet would try to convince the other three team members of the need for the change she was proposing. If she was unable to persuade the other members, the evaluation committee scores remained as they were. Brunet explained the Gang of Four did not write rationales in the same manner as the committees because the reason they wrote them was simply to justify the difference between the consultant and the committee.

598. Brunet's evaluations of the French questionnaires can be compared to the committee scores because she was the only consultant in the Gang of Four evaluating French questionnaires. The French questionnaires are summarized in PIPSC-162 and confirms that for female-dominated questionnaires, Brunet's average score was 157.1 compared to the committees' average score of 157.9. With respect to the male-dominated questionnaires, her average score was 250.7 compared to the committee's average score of 249.7. Brunet rated the same as the committees except in eight cases, five from the female and three from the male.

599. Unlike other Commission investigations, the investigations here under s. 11 differed somewhat, in that the factual foundation for the complaints were known to the Commission because it had participated in the process as observer from an early stage. The Commission observers attended the JUMI meetings and observed the committees during their evaluations on an ongoing basis from the commencement of the study. The Commission did not have enough observers to attend all of the committee sessions, and over the years the number of observers was reduced.

600. Daily notes were made by these observers when they attended an evaluation committee at work (Exhibit R-142), and these notes, which were quite extensive, were entered in evidence during the cross-examination of Durber. Durber had not read the notes himself. He asked Brian Hargadon, one of the Commission's observers, whether there was anything in the observer notes relating to the committee process in particular, which needed to be explored as part of the investigation. Durber testified he received an overview from Hargadon about difficulties in the process of job evaluation, in arriving at consensus, and dealing with the issues. However, at the end of the day, there was nothing in them to be concerned about in terms of the bottom line, that is to say the reliability of the

130

results. Consequently Durber expressed the opinion he did not consider it necessary for the observer notes to be provided to the Tribunal as evidence in this hearing.

601. Excerpts from the observer notes were read to Durber during his cross-examination and he was asked whether he was given the information either by Hargadon or in any other context to form his conclusions about the JUMI Study. Some of these excerpts include the following:

Committee #5 ...A gender bias problem appears to be developing in this committee.

There is one woman (Sherry) who gives higher scores than the rest of the group for female-dominated jobs and lower scores for male- dominated jobs. She also claims to have first hand knowledge of most jobs and when describing them makes extremely subjective comments reflecting this bias. She will rarely change her rating even if she has taken an extreme position.

There is also a man in the group (Paul) whose ratings reflect the opposite bias. However, his ratings tend to be closer to the consensus rating.

There is another woman in the group (Mary) who, in discussion, appears to have a strong alliance with Sherry. However, Mary's evaluations do not appear to indicate a bias.

Discussion tends to be extremely drawn out in this group as there are consistently opposing views...

(Exhibit R-142, Volume I, page 6)

Functioning of Committees:

In general, committees have settled into routines which are efficient and also reflect the uniqueness of each group. Given that working conditions are not ideal (i.e. working time is tightly structured and individuals with very different personalities and views must spend extended working hours together) committees are working well.

However, there are a few problems which need to be monitored. I do not know enough about Committee #3 to comment. Committee #5 also has its problems which affect productivity, although not to the same extent as Committee #3.

Members of Committee #5 have problems listening to the views of others. They constantly interrupt each other and often the emotional tenor of the Committee is extremely high.

131

Committee #5 needs a chair who can be very firm with such disparate and strong personalities. The present chair does not seem to have this capacity...

(Exhibit R-142, Volume I, page 90)

Committee #3

Splits in committee union/management. Job was well written and complete. Louise moved to conform with Jake & Al on K&S. No improvement on committee operations. Atmosphere tense.

Committee #2:

Committee works well.

(Exhibit R-142, Volume II, page 125)

Committee #4:

Took 5 hours to deal with this job (simple). New raters prolonged process, obstinate, even after clarification by consultant.

(Exhibit R-142, Volume II, page 130)

Committee #4:

...Language gender used, Chairperson trying to influence raters.

(Exhibit R-142, Volume II, page 176)

Committee #5:

...Pierre Collard noted a blow-up in Committee 5. He felt it may have been indirectly influenced by the fact that some members of #5 would have no jobs when this process is complete...

Wednesday - the Pay Equity Section, CHRC, rec'd a call from TB to intervene in a blow-out by 2 members of Committee 5.

Thursday - Brian H. and I wandered around committee and things were quiet.

(Exhibit R-142, Volume II, page 203)

Weekly meeting, Monday, October 31, 1988:

132

Ron [Renaud] brought out the point that the ground rules with regard to meeting of the consensus guidelines is not being followed. Result is that after it is all over one party could say that it was not a valid agreement because the rule was not followed, as covered in the procedures guidelines.

(Exhibit R-142, Volume I, page 36)

Additional Observer Notes dated November 24, 1988:

3. Majority vote. Committee 3 & 5 have a problem with this. Apparently they are not following the rules for consensus as spelled out on page 2 of the Working committee Procedures. Committee 2 follow the instructions with no exceptions...

There is also some question on reaching consensus by using the median. Fred had suggested this. For example, you have under working conditions, the following scores, 13, 13, 15, 17, 17. You should settle on 15 as the score. Should this be the solution?"

(Exhibit R-142, Volume I, page 79)

JUMI Committees - Observations:

Today, during my visit to committee number three, I noted that the committee was not observing the two third rule in order to reach consensus. Committee decided to take an average value as a consensus, however, I was consulted in the matter and they went along with my advice. Moreover, a comment was made: We do not follow this rule unless somebody is here observing us.

(Exhibit R-142, Volume II, p. 102)

Committee #6:

...Also assumption made in working conditions as the committee felt the incumbent was not thorough in filling out the questionnaire.

(Exhibit R-142, Volume II, p. 187)

Notes from Brian Hargadon to Ted Ulch:

I see a couple of problems, at least, with Committee #2.

...

133

Lack of utilization of the original bench marks. We are told that we as a committee do not have any obligation to follow them. Is this so?

(Exhibit R-142, Volume II, p. 111)

JUMI Committee:

...Keith and Sharon made comments on the analysis they did on their respective committees there was a concern shown by all of the people sitting in for the CHRC that it is obvious there are certain people rating consistently high or low, it may not be resolved soon enough if the information from the tests is not quickly analyzed.

...it has been suggested that we keep out of personal dynamics, that is fingering any person in a committee that may not be up to snuff because it could come back to haunt us. There is a feeling that some committee members, particularly union, are being advised on how to approach the evaluations which would best fit the interests of a specific union membership.

(Exhibit R-142, Volume I, p. 56)

Committee #2:

In position 2317, committee did not follow MEC benchmark and it seems that the position has been overrated...A comment was made: It does make a difference to have your presence here during evaluations. People here are not discussing the jobs at all.

(Exhibit R-142, Volume II, p. 180)

Weekly Activities:

...The problem is that committee 5 has well over 100 evaluations to sore thumb and there is a question as to how they were allowed to accumulate so many.

(Exhibit R-142, Volume II, p. 208)

Weekly Meeting, November 8, 1988:

...The members of the committee brought up a number of inconsistencies that have been noted in the various committees. There is a concern as to them being limited to questioning if there is an obvious standard set that may not be followed with other committees.

134

For example, one committee decided that level D under job knowledge can only be re-asked if the job requires a university degree. Ron asked the consultant if that was the case, and the answer was that was not correct.

Ron will be writing more specifics to be submitted to Ted under separate cover. There is a real concern that these inconsistencies will be allowed to go on and grow in number with the end result that the credibility of the committees, and indeed of us, will be challenged...

Committee #3 is continuing to have some problems. The committee will do their ratings, then search for a benchmark to fit the rating rather than check their rating against an appropriate bench mark.

(Exhibit R-142, Volume I, p. 45)

Consistency JUMI Study:

I would like to bring to your attention what I consider an important issue at this stage of the study and one that should be brought to the attention of JUMI.

Essentially, we should confirm our position that consistency is important; consistency with the MEC discipline and consistency of the five evaluation committees in applying the Willis Plan. I believe we have some legislative authority in the Equal Pay Guidelines in respect to consistency.

...

There have been a number of instances not only mentioned above where committees have for some jobs followed an evaluation process which is inconsistent with MEC and among the various evaluation committees...

-There are other situations like this which makes us concerned about inconsistencies and how we can help ensure that they are corrected as early as possible without compromising our role.

In summary, I recommend that JUMI be advised of our opinion as to how Acting situations are to be handled. In addition it would be timely to confirm our position on the importance of consistency; consistency with the MEC discipline and consistency of the five evaluation committees in applying the plan.

(Exhibit R-142, Volume I, p. 47)

135

Update on Observers Remarks, December 7, 1988:

The observers decided they wanted to go over a number of points that concerned them so a meeting was held this morning.

Before getting into the individual items I want to confirm that we are having some concern shown by various committees during testing time...

The reason for our numbers being diminished at the committees has been discussed with the observers so that we would give the same reason, a) other commitments and b) committees are now requiring less observation because of the time they have been in operation.

Committees 1,2, and 4 are operating quite well. Committee 5 does still have some problems however, they will probably sort themselves out.

Committee 3 is still not functioning up to par. The question arises whether the remaining observers, Sharon and Keith, should spend a disproportionate amount of time in committee 3 because of the problem. So the question remains, do we give preference to committee #3?

When we go back and look at the reason observers from the CHRC were brought into the picture, there is concern that our efforts will be for nothing should a) JUMI fold up, or b) we are to attest to the credibility of both the Master Evaluation Committee and the current five committees in operation.

As it stands now, no observer would attest to the evaluations being fair, balanced and objective. There are too many irregularities within committees and between committees.

(Exhibit R-142, Volume I, pp. 82-87)

602. It should be borne in mind the role of the observers was to act as a watch dog in the committee evaluation process. They were to observe, critique and when asked to do so suggest improvement in the functioning of the committees. The observers' notes need to be viewed in this context.

603. Durber accepted the bottom line opinion of the Commission observer, Hargadon, and decided not to rely on the notes as evidence of the reliability of the evaluation results.

604. The Tribunal heard testimony from witnesses who were evaluators on committees and who provided evidence in response to specific observer notes about their particular committee. Having considered Durber's responses to the questions raised during his testimony, the vagueness and lack of specificity of these notes and the responses of the evaluators who testified at this hearing, the Tribunal finds as a fact that the notes do

136

not significantly impact in a negative sense on the broader issue of reliability.

605. Another aspect of the Commission's investigation involved a three member committee organized by Durber to review re-evaluations conducted by the Treasury Board relating to the Nursing, Home Economics, Occupational and Physical Therapists and Computer Services benchmarks. These re- evaluations are contained in two reports which were presented to the Commission in July, 1990, in response to the Commission investigation into the question of apparent gender bias in the evaluation results. The reports are entitled Evaluation of CS Benchmarks and Corrected Version of NU, Annex B (Exhibit HR-252), and Final Report on Evaluation of Equal Pay Study Questionnaire (Exhibit HR-253).

606. The Commission had asked the Treasury Board whether the Employer subscribed to the observations offered in these reports which raised questions about the specific job evaluations of the multiple evaluation committees. The Commission received no response from the Treasury Board to their enquiries. Durber concluded these reports could be viewed as possible evidence in the investigation, but in the short term, excluded the reports as valid evidence in the Commission's investigation, reserving however, the option to advise the Tribunal of the documents in greater detail. Nonetheless, Durber decided to have a committee explore the substance of the reports (the Benchmark Review Committee).

607. The Benchmark Review Committee consisted of Esther Brunet, Christine Roberge, an employee with the Commission and Brian Hargadon, an investigator for the Commission. Hargadon and Roberge were trained by Willis. In early September, 1990, the three participants, using the Willis Process, started to re-evaluate each of the evaluations found in the Treasury Board reports. These included 65 benchmark questionnaires. They also examined 203 multiple committee evaluations from the OP, HE, NU and CS Groups. The process, defined by Durber, was that all three committee members had to agree on the evaluation for each job that was re-evaluated. After reaching consensus, the committee then compared their score to the Treasury Board consultant score and the score of the multiple evaluation committees.

608. If the Benchmark Review Committee score was different from the Treasury Board and the evaluation committees' scores, there was an attempt to examine the reason why the scores were different. Then the Benchmark Review Committee gave the benefit of the doubt to the Treasury Board consultants, or to the evaluation committees or failing that, the Committee would justify its own score if it differed from Treasury Board and the evaluation committees.

609. Since Durber was not informed by the Treasury Board as to the purpose of the reports provided in July, 1990, his conclusions were primarily based on the conclusions contained in the Benchmark Review Committee's report.

610. Brunet did not participate in writing the Committee's final report (Exhibit

137

HR-254). It was prepared by the Commission members, Roberge and Hargadon and was reviewed by Durber. The conclusion contained within the report and attested to by Durber is that no weight should be placed on the Treasury Board reports. The Benchmark Review Committee's examination confirmed the JUMI evaluations with very few exceptions.

611. An earlier draft of Exhibit HR-254 was prepared by the two Commission members of the Committee, and is dated June of 1991. That draft was introduced in the cross-examination of Durber as Exhibit R-140. There were two passages, at pp. 26-27, which were not included in the final report. These pages refer to sore-thumbing and difficulties experienced by the evaluation committees in the use of benchmarks. Durber removed these from the final report. It is his opinion, these pages were not particularly relevant to what they [the Benchmark Review Committee] were doing... (Volume 159, p. 19790). Durber instructed these pages be dropped from the final version. In his view they were interesting comments on difficulties encountered with benchmarks but did not add to what the Commission already knew. In his opinion, they were more instructional for use in future pay equity exercises.

612. Durber testified he asked both Roberge and Hargadon about the considerations raised on pages 26 and 27 of the original report, Exhibit R- 140. Durber testified he was told that the purpose of these two pages was to comment upon lessons learned, and their own perceptions of the difficulties the Commission might encounter in fulfilling their observer role in future initiatives. The Commission would, as a result, be forewarned of the problems which occurred during the JUMI Study including the difficulties with the rationales. Durber did not consider their comments as solid evidence, but more as useful material for future work of the Commission.

613. With respect to the report of the Benchmark Review Committee, Durber considered that the matters contained in pages 26 and 27 would come forward through Willis during the Tribunal hearings. Durber claimed the Commission had neither the resources nor the time to begin an investigation of the MEC process while preparing for its participation in these hearings.

614. The Tribunal did have the benefit of Brunet's testimony relating to pages 26 and 27 of Exhibit R-140, which appears in Volume 214, at p. 27852, lines 5 - 15:

I noticed that pages 26 and 27 made me smile when I saw them because, when I was working with Christine and Brian Hargadon, Jim Sadler was heading the study from the Northwest Territories. He would often come and see how things were going, and all that. Once we found out that he was going up there, we said, How about we share some information that we have, so that you can bring it up.

When I saw pages 26 and 27, a lot of that I had input in.

615. Brunet was under the impression she would be called upon to review and sign the report. In fact, she was not asked to do so but she

138

did, however, receive a copy of the report. She testified while the committee was doing its work, Jim Sadler, an employee of the Commission who was heading a pay equity study in the N.W.T., often came to see how the Benchmark Review Committee was functioning. The Benchmark Review Committee suggested that they share information with Sadler so he could take it with him to the N.W.T. study.

616. Both Brunet's understanding of the comments contained on pages 26 and 27, and Durber's opinion as to their usefulness, are corroborated in Exhibit R-141, a letter written by Sadler addressed to a union representative involved in a pay equity study in the N.W.T. This study is referred to as the Joint Equal Pay Study (JEPS) which was using a newer version of the Willis Plan. Some of Sadler's comments in that letter were based on discussions he had with members of the Commission's Committee. Those discussions corroborate both Durber's and Brunet's evidence about the Committee's perception about sharing this information with the Commission.

617. Durber's opinions and conclusions about Exhibits R-140 (Draft Report) and R-142 (Observer Notes) led him to decide not to introduce these documents as part of the Commission's case. The Tribunal hearing is in the nature of a public enquiry, and the Commission's role is to represent the public interest. Decisions about the relevance of documentation garnered by the Commission during its investigation of the s. 11 complaint is within the purview of the Commission. In circumstances such as these however, the Commission's decision to exclude these documents from its case is open to criticism if the documents are found to be relevant and sensitive to the issue of reliability.

618. Before proceeding further, the Tribunal is of the view the reports in question, namely exhibits R-140 and R-142, should have been introduced in their entirety as part of the Commission's case with accompanying explanations. The decision as to their usefulness ought to have been left with the Tribunal. The Commission's case would have been better served had the entire exhibit been entered in the first place.

619. During cross-examination Durber offered a further explanation as to his reasons for not interviewing Wisner. He conducted an ex post facto review of Wisner's rationales for purposes of what he described as clarification. Durber used both the committee's and Wisner's rationales to do this analysis. It involved a review of each difference and a determination of the extent these differences cancelled one another out. After Durber categorized the differences between the committee and consultant, he looked at the numbers to determine whether the distribution of these differences were patterned or random.

620. Willis was asked to comment on Durber's analysis which was based on the examination of rationales. Willis replied in Volume 208, at p. 26939, he had trouble with Durber's conclusions. Willis doubts very much if bias could be recognized by looking at rationales. In Willis' opinion, bias is very subtle and not something that can be looked at on a job by job basis. Willis testified in Volume 208, at p. 26939, lines 8 to 13:

139

You have to look at a total pattern and, to me, it would be totally inappropriate to single out certain ones of those re- evaluations and say, We will discount those. I think you either take them all and look at them at their face value or you don't take any of them.

621. According to Willis, if his consultants are doing an evaluation during the course of the study, the reasons for the differences are very important as they will provide the consultants with some basis for retraining of a committee. Willis recognizes there is always going to be some random variance, and random disparity after the study is completed, and therefore, he does not, at this stage, concern himself with the reasons. In the context of Durber's analysis, Willis said he always expects some differences between consultants and committees, but he did not see any value in attempting to use those differences to analyze whether or not there is a problem. Willis elaborates further in Volume 208, at p. 26944, lines 17 - 23:

A. What I have said or at least what I intended was that since bias is a very subtle thing, I think our only opportunity for examining the extent to which there is a different interpretation of male versus female jobs is by looking at the total results after the study has been completed.

622. The analysis done by Durber was presented in mathematical form as numbers and tables and conclusions about symmetry between numbers and whether these numbers were demonstrative of patterns. The Tribunal's view is that this analysis has a statistical component because of the particular methodology used by Durber. Without the assistance of a qualified statistical expert, we are unable to properly interpret Durber's analysis which, therefore, must be disregarded.

623. In 1992, during the appearance of Willis before the Tribunal, Durber decided to further investigate the quality of job information contained in the questionnaires. Accordingly, he retained a researcher, who had no experience in job evaluation but who had pretty good analytical ability for the purpose of examining a cross-section of the questionnaires. The cross-section included 63 benchmarks, 587 non- benchmarks for a total of 650 questionnaires. The researcher did not appear before the Tribunal.

624. The researcher's task was to look at the information to assess completeness, consistency, legibility, and whether the safeguards had been followed and finally, to determine if there was an indication each questionnaire had been validated by the employer's supervisor.

625. Durber's evidence is he discussed with the researcher some of the characteristics that could lead to deciding whether or not the questionnaires were complete. In this regard, Durber prepared some procedures and questions for the researcher. As background, the researcher was provided with the purpose of the job information, the process used during the study to collect and screen the information, as well as

140

information for identifying basic data such as department, questionnaire number, occupational group and other such information.

626. This project took the Researcher two months to complete. A meeting between Durber and the researcher occurred every week to discuss problems. Durber personally reviewed any questionnaires where problems were encountered, which involved approximately 5 per cent of the questionnaires. Durber testified he closely supervised the researcher during the examination of benchmark questionnaires.

627. The following is a list of criteria used by the researcher in this exercise:

  1. Legibility - can the questionnaire be read?
  2. Language - whether the questionnaire was French or English?
  3. Script - whether the questionnaire was typed or hand written?
  4. Signature - whether it was signed or not?
  5. Comments - whether the supervisor commented?
  6. Completion - whether all of the parts of the questionnaire had been completed?
  7. Consistency - whether supervisor was consistent with the incumbent?
  8. Notes - whether there was evidence of interviewer or reviewer notes?
  9. Facts - whether the questionnaire contained fact versus editorial comment?

628. The report entitled An Examination of the Quality of Questionnaire Information used by the Federal Pay Equity Study (Exhibit HR- 245), contained both findings and conclusions about the completeness and accuracy of the job information. In the Tribunal's view, Durber is expressing, in the report and in oral evidence, the opinions of his researcher which may or may not be well founded. Due to the researcher's lack of expertise in pay equity job evaluation, it is the Tribunal's conclusion it must reject any opinions contained in this report. There is, however, factual content in the report, not based on opinion, which in our view is helpful. These are listed as follows:

Findings:

¨Required questions were answered 95% of the time.

¨Supervisors provided signatures on just over 99% of questionnaires. In just over 96%, the supervisor commented, seeming to contradict incumbents about 9% of the time. In 95% of these contradictions, subsequent interviews clarified the work.

¨In two-thirds of the files, interviews were carried out, with supplementary information provided. The investigator noted that the latter was frequently extensive...

¨Legibility of the description in questionnaires was in all cases good.

141

Conclusions:

¨There was a system for reviewing and assuring the completeness of the information about work in the Joint Initiative.

¨There was a system for ensuring the accuracy of the job information...through supervisory review.

¨Those involved in reading questionnaires made efforts...to obtain further information to improve their understanding...where the supervisor and incumbent appeared to disagree about the work. (Exhibit HR-245)

(ii).Sunter's Analysis

629. The Commission asked a former director of Statistics Canada, Alan Sunter, to examine the full set of data from the Wisner 222 and the Willis 300 and look for patterns relating to gender composition. The Commission also requested Sunter, to assess the statistical significance of the formulae relating to possible gender bias used by the Treasury Board in its March, 1990 methodology paper.

630. Sunter, a qualified statistical expert, did not have a background knowledge in pay equity prior to his involvement with the JUMI Study results. He became involved in the analysis of the JUMI data as a result of a request by Durber on April 6, 1990 who asked him to attend the workshop scheduled for April 9, 1990. The workshop was to focus on the Treasury Board methodology document (Exhibit HR-185). Sunter testified he was unable to contribute in a constructive way to the workshop and he simply listened to the discussions. After the workshop, he met with Durber and began to realize there had been a large study addressing the question of pay equity between male- and female-dominated occupational groups. He also learned there had been subsequent re-evaluations of samples taken from the evaluations. This led to the question of whether there was gender bias in the evaluations. This was a matter of concern to the Commission.

631. The statistical evidence concerning the question of gender bias in the evaluation results was provided by Sunter and Shillington, both experts in statistics. Shillington was not employed by the Commission to do any statistical analysis of the results. However, because of Shillington's involvement in the IRR testing and other aspects of the JUMI Study, he testified before the Tribunal. During his appearance, he was requested to provide opinions on Sunter's analysis.

632. Sunter was asked specifically by Durber to perform three analyses. Firstly, he was to look at the question of gender bias in the re-evaluations and for this purpose, he was given two sets of data, the Wisner 222 re-evaluations and the Willis 300 re-evaluations. Secondly, he was provided with the whole data set from the JUMI Study and was asked to examine the question of equal pay for work of equal value between male- and female-dominated occupational groups. Thirdly, he was given the Treasury Board methodology document (Exhibit HR-185) and asked to examine

142

specifically the Treasury Board methodology and offer whatever criticism seemed appropriate.

633. Sunter's interpretation of the term gender bias used in his analysis of the data is provided in Volume 102, at p. 12275, lines 3 - 17:

A. I supposed gender bias to mean that there would be some systematic tendency of the evaluation committees to underscore positions from male-dominated occupations or to overscore positions from female-dominated occupations or perhaps both of those things.

Q. What do you mean by systematic tendency?

A. At this point, of course, I didn't know, but since the term bias had been used, then I assumed that bias would have to mean a consistent tendency that would display itself in some kind of recognizable pattern in the data, that I would see that when I looked at the data and performed some kind of analysis on the data.

634. Willis testified a consultant trained and experienced in the application of the evaluation system possessing an objective view point, can be expected to evaluate consistently and without a predilection towards either male or female dominated jobs, or towards either management or union sides. Willis asserts consultant evaluations are useful in examining the consistency of committee evaluations, and, more importantly, in assessing any pattern of bias which may have occurred. Willis' view is that his consultants' experience, background, intent and philosophy has always been not to favour one side or the other but to walk the middle road. Willis' objective in doing the re-evaluations was to identify whether or not there was a gender based pattern or difference in treatment between male- and female-dominated jobs. Willis referred to the differences between consultant and committee as disparities. It is within this framework Sunter began to examine the data.

635. Sunter testified statisticians collect and analyze data with two quite distinct concepts. One he refers to as descriptive and the other as analytic. In his view, the distinction between these two broad areas of enquiry is important in respect of the work he did and his interpretation of the JUMI data.

636. Sunter compared the two re-evaluation data sets, the Wisner 222 and the Willis 300 against the committee evaluations of the same jobs to see whether statistically there was a patterned difference in the manner in which evaluators treated different types of positions and if so to measure the size of the differences he found.

637. Sunter performed a statistical test known as a t-test to measure whether there was a difference between the treatment of male and female questionnaires by consultants and committees using only the Wisner 222, then using only the Willis 300 and then pooling the two data sets together.

143

638. According to Shillington, who also performed t-tests in his IRR analysis, the t-test is a statistical test that summarizes information about how far two averages are from each other. In this case, the statistician is looking at the male average and the female average to see if there is evidence they are treating male and female questionnaires differently. He states in Volume 86, at p. 10668, the t-test hinges on three things:

  1. How far apart are the two averages? The more apart the two averages are, the more likely it is to say yes it comes from different populations; yes, this person is treating male and female questionnaires differently.
  2. The larger the sample size the more likely it is to say that there is significant evidence that they are treating the two populations differently.
  3. The more concentrated the values are, the easier it is to say that this is a true pattern.

639. If the difference in the average scores is substantial then, according to Shillington, it is more likely you will get a significant result in statistical terms of measurement. A significant difference reflects a true difference between two groups and will demonstrate the result, most likely, could not have happened by chance. Statistical significance in this context pertains to mathematical probabilities and whether the numbers are unlikely to have happened by chance. (Volume 87, p. 10673).

640. Sunter testified about the limitations of the t-test. One such limitation is that when the sample is very large, even if the difference is minuscule, the t-test would find it to be significant. In other words, the t-test rejects the null hypothesis of no difference when the sample is large enough. Another limitation is that the t-test is not attentive to differences of practical importance, it simply follows a mathematical routine of testing the null hypothesis of no difference against the alternative hypothesis that a difference exists.

641. Sunter found the size of the difference in the treatment of positions from male- and female-dominated occupational groups by committees and consultants was 2.3 per cent in the pooled data. He performed further t-tests to determine if the consultants and the committees differed in their treatment of female-dominated positions. The results showed that for positions from female-dominated jobs, there was no statistically significant difference between the manner in which the consultants and the committees rated these positions. For positions from female-dominated occupational groups, the consultant and committee ratings are not significantly different whether one compares the committees to the Wisner 222, the Willis 300 or the pooled consultant re-evaluations (522). The size of the non-significant difference in the treatment of positions in female-dominated positions for the pooled data was 0.05 per cent. For the Wisner 222, this difference was 0.02 per cent and for the Willis 300, this difference was 0.07 per cent (Exhibit HR-191).

144

642. Sunter then performed the same t-test on the male-dominated positions. He determined that the consultant and committee ratings were significantly different for positions from male-dominated occupational groups. The size of the difference between the committee and the consultant treatment of positions from male-dominated occupational groups depended on which of the consultant re-evaluations, the Wisner 222 or the Willis 300, were used as a basis for comparison with the committee results. It also depended on whether the committee or the consultants were placed in the denominator of the equation. Sunter testified since there is no true value for any given questionnaire, there has to be some standard by which to compare committee and consultant evaluations. When it is contended the committee is biased relative to the consultant, Sunter states the consultant is taken as the baseline or standard of comparison and the consultant scores are found in the denominator of the equation to determine any difference in treatment.

643. The size of the difference in the treatment of male-dominated positions for the pooled consultant re-evaluations (522) was found to be 1.8 per cent, when the consultant evaluations are used as the denominator. For the Wisner 222, this difference was 2.5 per cent and for the Willis 300, it was 1.3 per cent (Exhibit HR-191).

644. Having found the consultant and committee ratings were significantly different for positions from male-dominated occupational groups, he testified the size of the difference in the treatment of male- dominated positions was twice as great in the Wisner 222, a difference of 2.5 per cent than in the Willis 300, a difference of 1.3 per cent.

645. Sunter preferred using the Wisner and Willis pooled result (522) as more reliable in establishing the size of the difference between the committee and the consultants rather than using either the Wisner 222 or the Willis 300 independently. This difference is stated as 2.3 per cent.

646. As to whether there was any pattern in the differences between the committees and the consultants, Sunter found in over half of the evaluations between the consultants and the committees there was no difference at all. In separating the data, he found in about one-third of the comparisons between the Wisner 222 and the committees there was no difference and in about two-thirds of the comparisons between the Willis 300 and the committees there was no difference. He found it inconceivable that, given this number of agreements, there was a consistent pattern of discrimination.

647. Sunter testified having found differences between the committee and the consultant in the treatment of male questionnaires, he would not conclude the committee was biased or that the consultant was biased. In his opinion, the only conclusion to draw was that both the committee and the consultant appear to have a bias relative to each other with respect to male evaluations. Sunter went on to say you may call this a relative bias, or you may attach the term gender bias to it. However, he had difficulty with the term gender bias because without further testing, one could not conclude whose gender bias it is and whether the bias is merely incidental

145

to gender or whether it is contingent on something else, which itself is incidental to gender.

648. The crucial question at this juncture in Sunter's evidence is whether the t-test results indicate a systematic pattern in the disparities or whether the differences are merely random. The Commission submits systematic patterns of gender differences must, by definition, be differences which are demonstrative of a system at work, something regular or methodical. (Para. 199 of written submissions).

649. The Employer submits a different treatment of male and female questionnaires is indicated by a pattern in the disparities such that the evaluation of female jobs systematically differ from the evaluation of male jobs. (Para. 289 of written submissions). The Employer's interpretation of pattern can be better understood in the following exchange with Sunter which appears in Volume 217, at p. 28243, line 8 to p. 28244, line 1:

Q. Mr. Sunter, I am just talking about the chi-square and the T test when you split the questionnaires by male and female. There was a pattern there.

A. There is a difference in the pattern. I wouldn't use the term pattern. There is a difference. We have acknowledged the difference. We are trying to explain the difference.

Q. But there is a difference in treatment, let's put it that way.

A. There is a difference in the average -- I don't like the term treatment, I must say, because it implies some physical process. There is a difference in the differences between consultant and committee scores. You may use the word treatment for that if you would like, but I prefer not to use the word treatment.

650. Sunter then attempted to explain and understand the differences between committees and consultants by fitting models to the data which he says are necessary in order to attach meaning to the notion of gender bias. It is in this area of his analysis where Sunter emphasizes the distinction between the descriptive use of statistics as opposed to the analytic use. The latter use involves his adaptation of models to data. Sunter testified if gender bias is present in the results, a statistician expects to see some degree of consistency across evaluations which are somehow related to gender. Therefore, he tested the data for consistency by using models to illustrate how gender bias might operate.

651. Sunter examined three plausible models to explain how gender bias might affect the committee's results. For example, one such model he termed additive which he described as a constant addition by the committee to the consultant scores or a constant subtraction by the consultant from the committee scores. Sunter eventually disposed of all of these models because the data did not support such configurations.

146

652. Sunter again tested the differences between the committee and the consultant by using the chi square tests. He also applied this test to the Wisner 222, the Willis 300 and the pooled data. All of these tests indicated statistically significant results. Sunter criticized the usefulness of chi-square analysis in these circumstances. In his opinion, the chi square tests are not helpful in understanding the difference between the treatment of male- and female-dominated jobs by the consultants and the committees. His concern about the chi square test is that this test measures the frequency rather than the size of the difference, as is the case with the t-test. Therefore significant results from the chi- square test can be misleading about the real difference between the numbers. Accordingly, he preferred to use t-tests which showed a difference of 2.3 per cent from the pooled data as best representing the size of the difference between committees and consultants.

653. Having seen no difference between the consultants and the committees, on average, for female-dominated occupations, Sunter went on to explore the idea of gender bias being an unconscious discrimination for or against occupational groups by gender. He suggested, that the way gender bias might work in this context, is that there are certain underlying male characteristics or female characteristics and that occupational groups that have more males will tend to show this pattern of discrimination rather strongly. Having tested for that, he did not find any such correlation between the degree of maleness of an occupation and a pattern of relative differences. Sunter concluded from his analysis that he was unable to find any consistent pattern of differences, and that there was no plausible or conclusive explanation for the differences between the committees and the consultants.

654. Sunter concluded from his analysis that without a level of consistency in the incidence of differences along gender differentiated lines between committees and consultants he was unable to conclude the difference was attributable to gender bias. He says in Volume 102, at p. 12277, line 25 to p. 12279, line 1: A. My general conclusion on the question of gender bias -- mind you, I still don't know what gender bias is, you understand, but my general conclusion on this was as follows. There was a slight difference between -- there was virtually no difference between the consultants and the committee on positions from female- dominated occupations. This could be put aside.

On positions from male-dominated occupations, there is indeed a difference, not large but indeed a statistically significant difference, between committee evaluations and consultant evaluations. This does not lead me to the conclusion, however, that there is gender bias, putting aside for the moment that I still don't quite know what I mean by gender bias because there are other possible explanations...

...

147

A. In order to conclude that this was gender bias, I would have to find some kind of consistency in the observations. I was unable to find the kind of consistency that would enable me to reach that conclusion.

655. He also found the lack of consistency in the differences and the absence of an alternative plausible model of gender bias did not justify adjusting committee scores in the manner adopted by Treasury Board in their 1990 methodology paper.

656. Sunter returned to the question of gender bias and explored other factors which occurred to him and were not pursued in his initial investigation. His exploration was with factors that might be associated in some way with gender and, therefore, considered possible causes for the difference in the scores other than gender bias. He examined other characteristics, such as perceived salary, nature of work, size of group, which he thought might be correlated with gender. Simply expressed, in ordinary language, Sunter explored the degree of association between the differences and some of the other characteristics of the data.

657. One characteristic which Sunter noted between male and female questionnaires is that the data showed female questionnaires coming from a small number of relatively large occupational groups. Male questionnaires, on the other hand, were coming from a large number of relatively small occupational groups. Sunter postulated evaluators might be more familiar with the female-dominated occupations which included jobs such as clerks, secretaries and nurses, than with the male-dominated occupations which included air traffic controllers, defence research scientists, patent examiners, etc. He divided the databases according to size of the group, and using group as a proxy for familiarity with the type of work, he compared the differences between the consultants and the committees for the Wisner 222 and the Willis 300 data. Although the results of this statistical analysis did not indicate statistically significant results, Sunter believes they did demonstrate a strong association between size of group and the pattern of differences between committee and consultants.

658. Another characteristic he noted which differentiated between male and female questionnaires, is the relative distribution of positions from male- and female- dominated occupational groups across the range of evaluation points. He found 75 per cent of questionnaires from female- dominated occupational groups fell below a certain point value while only 25 per cent of positions from male-dominated occupational groups fell below the same value. He hypothesized any bias that relates to point distribution, such as a bias in favour of placement in the hierarchy of jobs, or bias in favour of or against managerial or supervisory positions, or a bias in favour of the skills acquired in post-secondary education, could look like a gender bias.

659. Sunter then performed several comparisons to see if the differences between committee evaluations and consultant re-evaluations were associated with the relative distribution of questionnaires in the high or low point range. Again, in this comparison, the results did not demonstrate a statistically significant difference. He concluded, however,

148

they did show an association between high and low points and the differences between committees and consultants when split along gender lines. Sunter referred to this bias as a point bias or value bias, that is, the higher the value of the job the more likely there is to be a difference between the committee-assigned score and the consultant-assigned score.

660. Willis responded to Sunter's evidence about value bias during his second appearance before the Tribunal, which followed Sunter's testimony. Willis said he would like to see further analysis as to whether the differences between the committees and the consultants might be associated with value bias. Willis wanted to know if 10 per cent of the high evaluation scores were removed from the database, whether the extent of the differences between the consultants and the committees would be reduced. On this point, Willis says in Volume 211, at p. 27491, line 19 to p. 27492, line 4:

I had said I would rely on a statistician. This task was not given to me, but if it had been given to me and my statistician had said there is an appearance of bias here and it doesn't necessarily represent bias, I would say Okay, let's take those top ones out and let's see what it looks like then. Maybe it will be less than 1.8 per cent and maybe it won't. Since we are dealing with several million dollars, my suggestion would be that if it doesn't change that percentage, then I would tend to adjust.

661. As a result of Willis' comments, Sunter performed an additional analysis to determine whether the differences between the consultants and the committees could be reduced by value effect. His analysis, which is termed value effect, was introduced by the Commission in response to the question raised by Willis. Sunter defined value effect in Volume 216, at p. 28049, line 23 to p. 28050, line 1:

A. The value effect would be some systematic tendency for differences between consultant and committee to show up in association with increases in value of the job.

662. Sunter's further statistical work explored how much of the difference between committee evaluations and consultant re-evaluations could be attributable to value bias. By this he meant the difference between how the committees and the consultants treated high and low point questionnaires. Sunter's analysis included statistical methods for standardizing the data because of what he described as a distribution problem. Because of this problem, he could not merely discard 10 or 20 per cent of the top end scores as suggested by Willis. On the basis of this analysis, Sunter concluded that at least one half of the apparent gender differences between the committees and the consultants is immediately accounted for in differences in value distribution.

663. Relying on the analysis he performed (Exhibit HR-265), Sunter testified, once he removed the value effect, the overall difference of 2.3 per cent between the consultants and the committees was reduced by 1.2 per cent.

149

664. There has been doubt expressed by Shillington, on whether or not statistically or otherwise, you can separate two data analysis issues, one being whether or not a pattern is related to gender, and the second being whether or not the pattern is related to the scores being high or low. Shillington explains this problem in Volume 131, at p. 16045, line 23 to p. 16046, line 15:

The regressions were done in a way to try to see if there was a relationship between the differences between the consultants and the committee in gender.

It is also possible that any differences that might have existed between the consultant and the committee scores were not directly related to gender but perhaps were related to high values versus low values. This has been talked about here.

The confounding is introduced because there is a strong trend in the data for the male questionnaires to all have high values relative to the female and the female questionnaires have a fair tendency to come from the lower end of the spectrum, which means you cannot separate those two data analysis questions, or it is difficult to separate them.

And also in Volume 131, at p. 16048, line 16 to p. 16049, line 11:

In this circumstance, back to the analysis of the Willis scores and the possible adjustment, we have a situation which -- to the extent that there is a pattern here, if someone came and said this is possibly not due to gender, maleness or femaleness, but rather could be due to professionalization or some questionnaires having much higher values than others, you would have a problem extracting those two separate hypotheses from the analysis because you have a situation in which the males predominantly had high values, the females predominantly had low values. So maleness is confounded with high and low values.

That is reflected in the distribution. That is why it is a distribution question. The distribution of the Willis scores for the males tended to be quite a bit higher than the distribution of the Willis scores for the females. It is a confounding issue. That is why in interpreting it you are going to have to be cautious about that.

And further on this point, he says in Volume 131, at p. 16051, line 12 to p. 16052, line 5:

THE CHAIRPERSON: ...But just looking at these and what you can say about what they describe in terms of their distribution, what you can interpret from that is that the males tend to be high, the females tend to be low, but you can't, because of this confounding effect, you can't really interpret anything else with certainty. Is that ---

150

THE WITNESS: That is right. You have to be very careful when interpreting the results because you have to keep in mind that if somebody came with an alternative explanation for the data and the explanation was that this had nothing to do with gender, that this was high score/low score effects, you have collected your data in such a way that most of the high scores are males and most of the low scores are females. So they are two equally valid explanations for the same data.

665. While Sunter acknowledged difficulties in unconfounding data, he said he was able to isolate or distinguish from the disparities, a portion that could be attributed to different value distributions of the male and female questionnaires. Sunter maintained he did not find it difficult to make a differentiation between gender and value and he could unconfound the data to this extent. Under cross-examination by Respondent Counsel, he was not prepared to agree that gender is a proxy for value or that value is a proxy for gender. He did agree, however, there are many factors correlated with gender, and if the difference between committees and consultants stems from some other causal factor, which itself is associated with gender, then he could never determine how much of the difference would be attributable to gender bias. (Volume 217, p. 28247).

666. Sunter believes the question of association of the differences in scores with other characteristics in the data becomes important if there is going to be some adjustment in the committee results to eliminate gender bias. In this context, Sunter believes it is important to demonstrate the magnitude of gender bias, how it operates and how it can be adjusted out of the actual data. Sunter believes the association of the differences in scores with value bias becomes vital at this stage.

667. Sunter concludes the whole question of association with other characteristics is intimately connected with the process of adjustment. Accordingly, Sunter found it difficult to separate the question of how to analyze the data from the question of what you wish to do with the results.

668. Sunter was aware of the Treasury Board's methodology paper in which the Treasury Board used and adjusted the Wisner 222 data when calculating the equalization payments of January 1990. Sunter refers to this adjustment as an across-the-board adjustment. He describes what he means by an across-the-board adjustment of evaluation scores in Volume 103, at p. 12426, lines 16 - 22:

What I do, if I am about to make an across-the-board adjustment, let us say, of values assigned to questionnaires from male- dominated occupations, would be to say, Let us increase all of these, all of them, by four per cent without exception. That is what I mean by an across-the-board adjustment.

669. In Sunter's view an across-the-board adjustment requires some consistency in the pattern of gender bias, and an across-the-board adjustment can only be made on the basis of an across-the-board bias. He explains this in Volume 103, at p. 12427, lines 8 - 10:

151

...these are two sides of the same coin. If I cannot find the one, it seems to me that I cannot be justified in doing the other.

670. According to Sunter, the Employer performed a regression analysis, another form of statistical measure, on the Wisner 222 data as described in their methodology paper (Exhibit HR-185). The regression analysis conducted by the Employer assessed differences in treatment between committees and the Wisner 222. The regression analysis was the basis upon which the Employer calculated the unilateral adjustments to the scores in January, 1990. A critique of the Treasury Board's approach, given by Sunter, included an analysis of overlapping confidence regions of regression lines that represented scores for male- and female-dominated jobs.

671. It was Sunter's opinion the Treasury Board's regressions should not have been used to adjust the scores from the female-dominated occupational groups at all. With respect to the male data, the regression line comparing the Wisner 222 re-evaluations and the committee scores were significantly different over the second half of the point range of scores. Sunter found that the overlap of the male and female confidence regions, up to the 250 Willis point mark, is not strong evidence the consultants and the committees differed significantly or consistently below 250 Willis points.

672. Sunter concluded from his analysis of the regression lines there appeared to be no difference between the consultants and the committees for at least three-quarters of the female questionnaires. Accordingly, he found no justification in the Treasury Board regression lines for making relative adjustments to all of the male and female questionnaires.

673. Shillington, under cross-examination by Respondent Counsel, indicated he did not have any problems with the way Sunter conducted his analysis of the Treasury Board's adjustment methodology. He was of the opinion Sunter had drawn a reasonable conclusion from his analysis. (Volume 136, pp. 16741-42).

674. The Tribunal did not hear any expert evidence concerning Treasury Board's methodology of adjusting scores, other than what has been was provided by Sunter and Shillington about their understanding of the methodology contained in Exhibit HR-185.

675. Sunter testified the use of regression analysis to identify differences in evaluation scores between Wisner and the committees, is an unsuitable statistical tool. The regression equations, in his estimation, do not provide support for the Treasury Board's adjustment of female questionnaire scores downward, which average 3 per cent overall and male questionnaire scores upward, which average 4 per cent overall. In Sunter's opinion, which is supported by Exhibit HR-213, the regressions predict for the first three-quarters of the female questionnaires, either an increase in the female questionnaire scores or no change at all.

676. Insofar as the three areas Sunter was asked to review at the request of the Commission, his conclusion on the gender bias analysis in

152

the first two areas are: (i) there was no where near the level of consistency in the incidence of differences along gender differentiated lines which would enable him to conclude there was gender bias. Sunter testified this is not to say there is not gender bias, only that one cannot conclude there is gender bias and a review based on that finding of the Treasury Board methodology leads him to conclude there is no basis on which the Treasury Board could have justified any adjustment of the committee scores. The third aspect which deals with an analysis of the differences in compensation from male- and female-dominated occupational groups, is not in issue at this stage of our decision.

F. ROLE OF CONSULTANTS IN RE-EVALUATIONS

677. Both statistical experts testified under cross-examination consultant scores can be used as a reference point to compare committee and consultant scores on the assumption the consultant scores are free of gender-related bias. This is a term introduced by Respondent Counsel to describe a bias unrelated to gender but to some other characteristic which is itself related to gender.

678. Both statistical experts expressed the opinion they preferred committee scores over consultant scores. Shillington, in particular, found it difficult to accept that any individual could be free of gender related bias and he says the following in Volume 139, at p. 17084, line 4 to p. 17085, line 2:

A. I think that is more in the line of a decision that could be made. You have indicated that the issue of gender-related bias is the area of concern and not being as concerned as to whether or not it was directly related to gender or not. So I think deciding not to be concerned with the reason that the gender-related bias, if there is evidence of that, is present -- that is a decision.

If that sentence is to be interpreted to mean if you decide that you don't care for the reason, then you don't need to look for it, you are right. But I certainly never -- several times in testimony you asked me to assume that Mr. Wisner was without gender-related bias and I more than once said How can that be. How can someone be so free of thoughts about high score/low score, dirty work/clean work. How could this person be equally familiar with all jobs, but you asked me to assume that.

So, I am not sure that the sentence the way it is presented there is a fair or complete summary of my opinion about this, and I certainly can't speak for Mr. Sunter.

679. The position of the Employer essentially is the consultants' re- evaluations are only used in the statistical analysis as a point of reference for determining whether there is a pattern of different treatment of male and female questionnaires by the committees. Willis testified the consultant scores are not to be substituted for committee scores, therefore the Employer submits using the consultants' re-evaluations as a reference point does not mean the consultant re-evaluations are to be preferred to

153

the committees', because there is no substitution of scores. However, the Employer contends, for purposes of using consultant re-evaluation to determine a pattern of different treatment, the Tribunal may prefer the consultants' relative treatment of male and female questionnaires without preferring their scores on any one questionnaire. (Respondent's written submissions - paras. 319 and 320).

680. Shillington expressed the opinion in using the consultant scores as a reference point, an assumption had to be made the consultant scores were to be preferred to the committees'. He gives the following response in Volume 136, at p. 16692, line 16 to p. 16693, line 15:

Q. When we are using the consultants as a reference point only, we are not saying that we prefer the consultant's score on any one questionnaire over the committee score. We are only making the assumption that the consultant scores across the board are free from gender-related bias.

A. But that you are not preferring them?

Q. But that we are not preferring them. So, we won't take the score on any one questionnaire and say the consultant scores are better. That's not a necessary assumption.

A. But I still think you have to end up assuming they are better and the example again is when I used -- suppose that the consultant didn't look at the questionnaires at all and the consultants just wrote down daytime temperatures, blood pressure, whatever. Right? They would certainly not be preferred and they certainly would not exhibit a gender preference if they just ignored the questionnaires totally. So, I think you do have to assume that the consultant scores are to be preferred.

681. Sunter testified the committees should be preferred to the consultant's for four reasons. His first reason is based on his own experience in the field of statistics which led him to conclude committees often apply a system better than the consultant who developed it. His remaining three reasons for supporting committee evaluations over consultant evaluations are based on his analysis of the data. One of his analysis tested for consistency between Wisner and the Gang of Four.

682. Sunter tested for consistency between Wisner and the Gang of Four by performing statistical tests such as t-tests and chi square analysis. The results he obtained confirmed, in his mind, that Wisner and the Gang of Four, differed among themselves. Sunter's conclusion was that if the consultants cannot agree among themselves, it cannot be the case that the consultant is always right. His analysis led him to conclude the consultants were not consistent among themselves and on this basis the committees should be preferred.

683. When Sunter was called as a reply witness by the Commission in November, 1994, he testified he had undertaken a further analysis on the question of the relative reliability of the committees and the consultants.

154

Sunter also used standard statistical measures in the form of regression analysis, to support the use of the committees as a point of reference in any analysis of gender bias. Sunter formed regression line comparisons using two sets of the data, the MEC scores and all the scores on which the committees and the consultants agreed, which led him to conclude any notion of committee bias for male job evaluations could not be sustained.

684. With the exception of Sunter's further analysis given in reply evidence, the remainder of Sunter's analyses were commented on by Shillington. Shillington concurred with Sunter's statistical conclusions, with the exception of one analysis, namely, Sunter's variance co-variance analysis. Shillington had an opportunity to meet with Sunter to discuss this analysis. Having had that opportunity, Shillington continued to maintain he had problems with drawing the conclusion from the variance co- variance analysis that the consultant is to be preferred to the committee. Dr. Shillington offers the following explanation in Volume 133, at p. 16306, lines 8 - 22:

So, I would have a difficult time believing that the data can help you unravel that that the data can actually help you decide that one rater is preferable to the other, unless you had a third set of numbers which you believe to be the correct values.

So, I look at the models and I say the models look reasonable and, yes, it's clear that the correlation matrix in one case is closer to the observed data than the correlation matrix in the other case, but even after discussing this, I have to step back and say: This may be true, but how can the data help you unravel which rater is better if you have no third set of numbers, which is the correct values?

685. Shillington went on to say his opinion on this aspect of Sunter's testimony did not distract from his approval of Sunter's analysis on the issue of gender bias. He responds as follows in Volume 133, at p. 16306, line 23 to p. 16307, line 23:

Q. Having had that opportunity to discuss this matter with Mr. Sunter and standing by your opinion, how does this opinion affect your opinion with regards to his approaches taken that we have seen summarized in HR-184 that deal with the gender bias issue?

A. This was one piece of Mr. Sunter's evidence, this was one piece of his argument for not preferring the consultants to the committees and there are other pieces to that argument. I don't have problems with the other parts that I have seen and I have indicated to -- I have given evidence on the other part, so I don't have problems with those pieces of evidence.

In general, despite the fact that I disagree with this part of his testimony, I don't have problems with the way he has handled the committees versus the consultants, even though I disagree with this particular step in his argument.

155

Q. I wasn't just referring to the committees versus the consultants, I was referring to the whole gender bias picture, all the other testing in HR-156.

A. I am restating that those analyses don't cause me a problem, no.

686. Shillington was asked to comment on Sunter's inclination to prefer committees to consultants, not for any statistical reason, but rather from Sunter's perspective that a decision of a group of individuals was preferable to a decision by an individual who may have had more advanced technical training. Shillington shared Sunter's opinion, and indicated he too preferred the consensus of seven people chosen in a balanced way, rather than one well-trained technical expert, at least on an issue like pay equity.

687. Both statisticians, Sunter and Shillington, agreed and informed the Tribunal that if we plan to use Sunter's t-test results to make adjustments to the evaluation scores of the committees, then the consultant scores are no longer simply a reference point but are, in effect, being preferred to the committee scores. In this context, the statisticians are of the opinion the consultant scores must be deemed to be free of gender bias and gender-related bias, before any adjustments are made to the committee scores.

688. Shillington testified the basis for his opinion was not statistical, but based on scientific reasoning and logic. His response is found in Volume 136, at p. 16706, line 14 to p. 16707, line 6:

A. I will leave it to you people to debate whether or not it's statistical. The question is if you are asking that you could use as a reference point for assessing gender-related preference someone who was consistent and unbiased, that wouldn't imply that you are preferring those scores. I think that's the nub of the question here.

Q. That's right.

A. I am having problems with that because I just don't see the logic to it. I'm saying it's scientific reasoning. To me it's logic.

Q. Can I put it to you this way: Your concern is that you can't see how someone can apply a plan consistently without gender-related bias and yet not be preferred. Is that it?

A. Yes.

689. Willis defended the impartiality and objectivity of his consultants, and testified the consultant's re-evaluations can be used as a point of reference for determining a pattern of different treatment between male and female questionnaires by committees. He based his opinion on his belief his consultants had always followed a philosophy of not favouring

156

one side or the other, they had more experience in performing job evaluations and could evaluate consistently and without bias. Finally, the consultants had more experience interpreting difficult questionnaires.

690. Fred Owen, a pay equity expert, and a former consultant of Willis & Associates who participated in the JUMI Study, testified he believed it very important in determining the reliability of evaluations in the JUMI Study, that the consultants provide a frame of reference in order to determine the accuracy of evaluations. It was his opinion the consultant evaluations could be used as a standard for comparison for several reasons. His first reason is the consultants have an extensive knowledge and experience not only with the evaluation plan, but have a broad exposure to evaluations in a wide variety of jobs. His second reason is the consultants had access to an entire array of jobs that were being evaluated and that individual committees only had access to a smaller group. His third reason is the consultants had no knowledge of the Employer's classification system or pay ranges for any of the classes of jobs and did not have any preconceived ideas about the pay system. He testified the consultants themselves did frequent, almost daily, quality checks not only to determine how consistently the MEC discipline was being applied, but also to check the evaluations done by the consultants themselves to determine if the consultants were correct in their evaluation.

691. In Owen's written opinion (Exhibit R-167), confirmed by his oral evidence, he outlined criteria for adopting the committee evaluations. He suggested if the committees exhibit a good grasp of the evaluation plan as demonstrated by the reasonableness of their evaluations, and if there was no observable attempt on the part of any committee members to manipulate the evaluation outcomes nor to give prejudicial favour to any occupations or incumbents, there would be no need to assess committee evaluations against the consultant re-evaluations. In Owen's opinion, the evaluations fell short of these criteria owing to the lack of complete job information, as well as the observable behaviour on the part of some committee members who manipulated the evaluations so as to over-score female-dominated jobs and downgrade or under-score traditional male-dominated jobs.

692. There is ample evidence the JUMI Committee, during the operation of the study was prepared to use the consultants as a standard. The JUMI Committee had agreed to use the consultant scores as the baseline for comparison during the ICR testing. In that case, the consultants evaluated the test questionnaires that were provided to the committees, and the consultant scores functioned as a baseline for the ICR testing.

693. Throughout the study, the consultants were used by Willis as a standard to validate the committees' work. In a letter to Willis dated January 6, 1989, the JUMI Committee co-chairs requested Willis to provide baseline scores for the test questionnaires in the ICR and the letter reads in part:

...Your failure to provide baseline scores has delayed the work of the Inter-Committee Reliability (ICR) Sub-Committee as this information is necessary to analyze the consistency of ratings of committees with respect to a standard.

157

(Exhibit HR-82)

694. There were other occasions, during the JUMI Study, when both the management side and the union side jointly and separately requested the Willis consultants to review committee evaluations. Although this did not occur in the same framework as the ICR testing, in which the consultant scores were used as a baseline for comparison with committee scores, the consultants opinion, however, was sought as a check on the quality of the committee scores. Consultant reviews, with respect to the MEC benchmark evaluations have been previously described in this decision. There remains the agreement by the JUMI Committee to have Willis engage his consultant Wisner to do the 222 re-evaluations of the evaluation committees. There is also the less formal reviews done by the consultants during the operation of the five and nine evaluation committees to test for consistency.

695. The following excerpts are further examples of consultant evaluations of committee questionnaires by Willis to validate the results, in Volume 60, at p. 7435, lines 3 to 23:

Q. While the Master Evaluation Committee was performing their independent evaluations, were you also reviewing the questionnaires that they were looking at?

A. Yes.

Q. For what purpose?

A. Part of the job is to, in effect, validate the consistency of their evaluations. My role, for the most part, would be to review the questionnaires along with the committee, to listen to their discussions and to do my own personal evaluation of the job based on the information that was brought forth. Then I would track that.

While I did not give the committee my evaluation, I would track the consensus against my evaluation as a means of controlling and assuring myself that they were in fact being consistent in their interpretation of the information in the questionnaires and in the evaluation system itself.

Also in Volume 67, at p. 8429, lines 2 - 10:

A. I responded to a number of concerns expressed and re- expressed by the Treasury Board from the summer of 1989 -- the summer of 1988 on. I had felt that we had put to rest the issue of whether or not the Master evaluation committee was evaluating fairly and equitably. I, in effect, validated the results. I said they were creditable and credible and yet the problems kept surfacing.

696. The JUMI Committee's reaction, during the study, to Willis' request to conduct the Wisner 222 did not, at that time, call into question Wisner's impartiality. It is reasonable to conclude the parties

158

themselves, at that time, assumed the consultants were bias free in performing their role in the process.

697. The parties understood from Willis there was no correct score to any one questionnaire. As the process continued, the only measure taken in the event of possible gender bias as contemplated by the parties and the consultant was to implement steps for improving the process. These steps or safeguards have been previously described. Having Willis counsel evaluators and provide additional training for either individual evaluators or committees occurred as part of these safeguards.

698. Willis testified the use of consultant re-evaluations after the process is concluded is quite different from their use while the process is ongoing. Willis testified after the process, the re-evaluations are used to identify whether or not there is a gender based pattern of difference. At the end of the study, Willis does not think it is particularly important to know the reasons for the disparities between consultants and committees because it is only the existence of a pattern that is important in his opinion.

699. Willis' firm belief is he and his consultants are without any kind of pattern in their evaluations. Willis states in Volume 210, at p. 27323, lines 9 to 12:

A. It's my considered judgment that the experienced consultants with Willis & Associates tend to be bias-free or as nearly as it's humanly possible to be.

700. He went on to explain by bias-free he meant there was no differentiation on a gender basis between males and females. He was questioned as to whether he believed his consultants were without gender- related differences, such as hierarchical treatment where a consultant would be more liberal at the high end of a point scale or more conservative at the low end of a point scale. He responded as follows in Volume 210 at p. 27323, line 23 to p. 27325, line 22:

A. That's an interesting point.

Q. That one is a little harder to say, is it?

A. Well, there is some evidence in a number of studies that we have done that it's difficult to get a good handle on a job that's two or three levels above your own. Alan Sunter made an observation that what might be viewed as gender bias might be something else.

Q. Yes, that's good. I'm going to talk to you lots about that point, so we don't have to -- bring me back to it later if I haven't dealt with it in detail. You say there are some studies to suggest it's difficult to get a handle on jobs three or four levels above your own, but your consultants were normally people who had very high-level jobs before they joined you, weren't they?

159

A. And they are consultants who have had some experience in evaluating higher level jobs. One of the problems in addition to it being difficult for a committee member to evaluate a job several levels above their own -- that is, having to have a good understanding of principles and theory and how is this important and what does strategic planning mean and things like this, things that are somewhat foreign to them -- and at the same time we find that the more complex jobs are more difficult to describe.

So, it's not unusual for -- I think it was Alan Sunter that suggested that perhaps the consultants had evaluated the higher level positions more liberally than the committees had.

Q. And that would be consistent. I gather what you are saying is that would be consistent with experience you have had in watching consultants and committees evaluate jobs?

A. I would say that would not necessarily be unusual.

Q. That's one take on it, that the consultants may be in a better position to appreciate those jobs. I would suggest the other factors at play with higher level jobs, I think I recall you telling us at one point that people tend to evaluate their own jobs more highly than they tend to evaluate other jobs that perhaps they are not as familiar with. Right?

A. I think maybe we are all a little bit biased in that direction.

701. Willis' rationale for not examining the reasons for the differences in the disparities between committees and consultants is because he believes it would be very difficult to pick out individual evaluations in order to explain the difference.

702. However, there were occasions during the study when Willis examined the consultants' (Drury and Wisner) evaluations to achieve an understanding of the differences between the consultants and the MEC. Willis did this sort of analysis with 46 MEC benchmarks that showed differences between Wisner and the MEC of more than 10 per cent. In this analysis he was looking for a pattern. Willis' analysis involved an assessment of whether or not there was any pattern or apparent pattern of gender bias. He did this by reviewing the differences and the reasons for the differences as identified by the committees' rationales.

703. Willis agreed in cross-examination the difficulty when comparing differences between the consultants and the committees lies in determining how much of the difference is attributable to a particular factor, because there is no guarantee it is just one factor which accounts for the disparity. (Volume 210, p. 27350).

704. As to the differences in the way the committee and the consultant treated higher level jobs, Willis testified he was willing to accept the fact the consultants were probably more liberal in evaluating the higher

160

level positions. Based on his own experience, the consultants probably had a better understanding of the higher level jobs than the committees would. Willis gives a further opinion on this point in Volume 210, at p. 27355, line 18 to p. 27356, line 23:

Q. You have told us why they might have a better understanding of them, but you will also agree with me that it is possible that in those situations where you have fewer benchmarks -- right?

A. Yes.

Q. And you have to exercise more judgment. Right?

A. Yes.

Q. -- that the consultant's view of those jobs might be influenced by their experience with high level jobs outside of the federal public service.

A. And other studies which they have done. Yes, that's possible.

Q. So you can see that there are things that might make them in a better position to have a preferable view of those high level jobs. Right?

A. I don't think there is any question about that.

Q. You have just told us that one thing could be that they could be influenced by things outside, by their baggage from outside studies.

A. I would say that when we are talking about high level, complex positions, the consultants should have a better grasp on the content of the job than any one of the evaluation committees that may not have had that kind of experience on their teams.

G. WHETHER THE RESULTS SHOULD BE ADJUSTED - THE EXPERTS

705. Willis testified the Tribunal has three alternatives in dealing with the reliability of the results:

  1. to implement the study as it is;
  2. to adjust the results; or
  3. to trash the study.

706. As to option (i), Willis said without statistical analysis and the advice of a statistician he could not accept the results. In Volume 78, at p. 9576, line 19 to p. 9577, line 8, he said the following:

It is true that I was not happy with the various steps that were undertaken and to some extent we were able to do some shoring-up. However, without any analysis at all, without any opportunity to do some statistical analysis or to have it done and have some advice of a statistician, I don't think I could have accepted the results.

161

Once the study is complete, then it is possible to look at the results without regard to the other issues and make a separate determination: Do we have a consistent result or do we have a certain amount of bias and how much bias? In a sense, you do change into a different gear after the study is over.

707. With regard to the third option, Willis stated the following in Volume 78, at p. 9574, line 15 to p. 9575, line 7:

THE CHAIRPERSON: Could you tell us when the third option would be utilized?

THE WITNESS: I would want to sit down and talk to Milczarek and review all of the details with him. But it is possible, I assume, that the results would be so far out of line that they just would not be believable. At that point, they should be trashed.

If we had stopped after the 222 evaluations, nothing had happened after that, and I were asked by the decision-makers what to do with it, given no opportunity to analyze the results, at that point I would say there is nothing we can do with it. We can't use what we have so far for any valid results. The 220 was too small a test by itself to make any judgments. So, if we aren't going to be able to do anything more, then we have to forget the study.

708. In his last appearance before the Tribunal in June of 1994, Willis testified, as he had done previously, he would rule out trashing the study. Willis suggested the study was about fairness in the treatment of employees, and the difference between the consultants and the committees, which resulted from Sunter's analysis was so small, in terms of a single employee's salary, by the time you take out the income tax, that is not enough to pay for coffee. (Volume 211, p. 27489). On the other hand, Willis remarked We are dealing with millions of dollars, so maybe there is more to it than just fairness to the employee. (Volume 211, p. 27489).

709. After having met with Sunter, Willis was interested in knowing how much of the difference between the committee and consultant was really a value bias. It was Willis' opinion if the value bias reduces the extent of difference between the consultants and the committees to the point where it is immaterial, no adjustment to the committee evaluations was necessary. Willis suggested if the difference between consultant re-evaluations and committee evaluations did not decrease after further analysis, then in view of the amount of money involved, he would tend to adjust. (Volume 211, p. 27492).

710. Although no witness testified on behalf of the Employer concerning the Treasury Board's methodology paper (Exhibit HR-185), the evidence demonstrates that the Treasury Board made an adjustment to the evaluation scores by taking the Wisner 222 re-evaluations as a baseline. The adjustment preceded the Employer's equalization payments of January, 1990. The Employer adjusted all scores, other than the benchmark scores,

162

for which there was a consultant re-evaluation. The questionnaires were adjusted according to two regression equations contained in Exhibit HR-185, at p. 11, footnote 7. Shillington was asked his opinion on the regression equations contained in Footnote 7 and responded in Volume 134, at p. 16401, lines 13 - 25:

THE WITNESS: I would not adjust. I can tell you that when I first saw those equations and knew much less about the background to the data, I formed the opinion that I have expressed several times, that the onus is on the person -- before adjusting, I think there's an onus on the investigator to show the adjustment is warranted and the evidence here is that the adjustment does not warrant it and yet it was done. I formed that opinion as a statistician before I knew much more about the background of the study and nothing that I have heard in the background has changed that view.

711. Sunter expressed the same opinion as Shillington about the Treasury Board adjustments and said the following in Volume 106, at p. 12745, line 21 to p. 12747, line 10:

The point about this is that the regression equations given in the Treasury Board document do not even approach the level of certainty that I would consider necessary to make any adjustments at all to the male and female evaluations.

THE CHAIRPERSON: Could you explain that a little more.

THE WITNESS: Because they are not significantly different. If I wanted to make an across-the-board adjustment on the basis of gender, I would have to be virtually certain of a number of things.

One, I would have to be certain that the consultant is to be preferred to the committee, and I am by no means certain of that. As I tried to show yesterday, there are good reasons to doubt that.

Second, I would have to be sure that the reason for the difference is gender, not something which is merely related to gender in some fashion.

Finally, I would have to be sure of the numbers that I am using if I wanted to make an adjustment.

We have seen that the order of magnitude of difference between the consultants and the committee, depending on which particular equation you use and which particular set of observations you use, is of the order of about 2 to 2.5 per cent. Nevertheless, we have a methodology here that arrives at an adjustment of 7 per cent. How can that be?

163

The answer is that this regression analysis is a very poor, crude instrument for estimating the difference. Even if I were to believe all the other things, it remains a very poor instrument for making that adjustment because of the inherent uncertainty of the regression analyses themselves.

712. Durber, on behalf of the Commission, supported the results of the study without adjustment. His conclusion on the issue of reliability is contained in Volume 154, at p. 19167, lines 4 - 24:

A. My conclusion is that the parties were enormously successful in producing a body of excellent job information. They went to enormous cost and effort to produce evaluation results. They tested those evaluation results, we have seen, exhaustively, at least they were exhaustive and I am not sure of the results on us.

I am quite confident that the studies I have looked at fall short of the quality of work that we see in this particular study. I think the parties deserve a great deal of credit for what they have produced and certainly I had the confidence in those results to suggest that the Commissioners rely upon them in examining evidence of a wage gap.

I do not believe that there is what I would characterize as evidence of bias. My bottom line is that the results should be taken as they are and that any calculation of wage disparities ought to be based with a great deal of confidence on the job evaluation results.

713. Sunter has consistently maintained throughout his testimony no adjustment of the committee evaluation should be made. However, in response to questions raised by Willis and at the request of the Commission, he did suggest possible adjustment procedures to the evaluation results. We will elaborate more fully on these procedures in the event we conclude adjustment of evaluations is necessary.

VII. DECISION AND ANALYSIS

714. Throughout the JUMI Study, the Employer and the Alliance relied on the expert testimony of Willis to advance their positions. However, during the hearing and in both written and oral argument, there was considerable debate between the Treasury Board and the other parties concerning this consultant's role in the re-evaluation of questionnaires and whether the consultant could be relied upon to produce gender bias free evaluations.

715. The Tribunal finds the position of the Commission and the Alliance particularly puzzling. Willis' impartiality was not an issue prior to this hearing. In its submission, however, the Alliance cited a number of reasons why the committee evaluations should be preferred to the consultant evaluations. It claimed, for example, consultant baggage and other factors such as age, sex, education, and lack of gender sensitization training which, it alleged, would contribute to consultant gender bias.

164

716. In our view, the Alliance was attempting to discredit the witness upon whose expert opinion it relied upon in terms of the data-gathering and job-evaluation process which occurred during the study. By way of further illustration, we refer to the following exchange between Counsel for the Alliance and the Tribunal which appears in Volume 224, at p. 29495, line 17 to p. 29500, line 19:

THE CHAIRPERSON: Before you go on, Mr. Raven, I think I would like to respond to the word antagonism that you perceive from the Tribunal. I think it's a fairly -- it's a word that carries some connotation. I think that what the Tribunal has tried to do is understand your argument.

These parties have put forward or engaged these consultants to assist them in conducting a study over a period of five years. When we are faced with an argument that these consultants could be gender biased -- I think that what the Tribunal has tried to do is understand and to challenge you on these types of arguments that you are putting forward. I don't think that our conduct in doing that -- I don't think it's fair to say that we're antagonizing or we're being antagonized, or whatever. I think that's our role and we will continue to do that role to try to understand and appreciate what it is you are trying to put forward to us.

MR. RAVEN: I appreciate that. I was really attempting more to provide some added definition to my submission for that purpose.

MEMBER FETTERLY: Before you start, I would like to make a couple of comments about this issue.

To begin with, if this were a civil trial and Mr. Willis was your witness -- technically he's not; he's the Commission's witness -- would you be permitted to discredit him after having introduced him as a witness?

MR. RAVEN: Mr. Fetterly, let me respond to the question in this way. I am not attempting to discredit Mr. Willis. That may be where we ---

MEMBER FETTERLY: You certainly give that impression, Mr. Raven. Let me just add this: Mr. Willis and his fellow consultants were the only experts who were actually involved in JUMI. You rely and he has defended the MEC results not only before this Tribunal, but also before JUMI. He has defended the ICR results. And he has done that both before this Tribunal and before JUMI. He has defended the total results before this Tribunal. Basically, he has said that they should not be trashed. And it's his plan that was adopted as being a gender-neutral plan.

To hear you and, to some extent, Ms. MacLean, attack, in a sense, his neutrality really puts the Tribunal in a very awkward position. I find it a matter of real concern. It's not a question of antagonism.

165

MR. RAVEN: What I had hoped to do this morning is try to clarify where we're going with this. Your comment, Mr. Fetterly, is very apt. It allows us and it affords me an opportunity to deal with that.

There is, in no sense, an attack here on Mr. Willis. That suggests that Mr. Willis or anyone associated with his firm was guilty of some malfeasance or misconduct in the way they conducted themselves in the course of the study or in the way they --

MEMBER FETTERLY: Not at all. Not at all. Mr. Willis and his associates hold themselves up to be experts in pay equity. They promote their plan as being gender-free. They train evaluators in order to evaluate on a gender-free basis. Now you are saying that their own ability to evaluate on a gender-free basis is suspect. That to me is a real contradiction.

MR. RAVEN: The fact that the Willis Plan was accepted by the parties here as being gender neutral for purposes of this study is one thing. But if you will permit me to make this point, Mr. Fetterly, there's no personal attack on Mr. Willis or his associates. What we are trying to grapple with here is a very, very minute pattern difference between the consultants and the committees for the high end top quartile of male jobs, and we are now having to wrestle with the problem of whether we should adjust those scores to bring the committee scores in line with the consultants and whether there are compelling reasons to do that or not do that.

The submissions that are advanced here that I am about to get into is to raise with the Tribunal pertinent considerations in determining whether or not it makes a lot of sense to adjust in these circumstances. It's not intended as a personal attack on Mr. Willis.

MEMBER FETTERLY: That I understand. I think that's quite legitimate.

As I said to you yesterday, is it necessary, in order to achieve that, to allege that the consultants' ability to evaluate on a gender-free basis is suspect? Is it necessary for you to do that in order to establish or to argue that the committee results are to be preferred over the consultant results? I don't think it is.

MR. RAVEN: I tend to agree with you that there are a variety of reasons that support preferring the committees' scores, not just the questions that have been asked here and that are raised as to the manner in which the consultants themselves did these re- evaluations.

For example, if I understand Ms. MacLean's submission the other day, it was that the consultants had a slightly different discipline, a more liberal discipline, than the committees did.

166

Mr. Willis recognized that, and in his reports to the Joint Union/Management Committee, recognized that and found it quite suitable. In fact, he in his own words said Given the context, our previous understanding and application of the Willis discipline in other contexts is not to be preferred to MEC's.

I don't know that that necessarily raises the question of bias, conscious or unconscious, or pattern differences. It does, however, confirm that (1) there were differences in the discipline that Mr. Willis has adopted in other studies and the MEC discipline; (2) that the MEC discipline was more conservative; and (3) the committee scores were more conservative than the consultants in high-end male jobs. So I don't raise that necessarily as an allegation of bias.

717. Statistical evidence was introduced by the Commission which, they submit shows inconsistency between the consultant Wisner, who conducted the 222 re-evaluations and the Gang of Four, who conducted the 300 re- evaluations. This evidence was also introduced in the context of whether committee evaluations should be preferred to consultant re-evaluations. In our view, it has a similar effect of discrediting the very expert the Commission contracted to do a further study and upon whom they relied during their investigation. Reference is made here to paras. 184 and 185 of Commission Counsel's written submissions:

(184) A reasonable inference that the consultants as a group were not evaluating without gender bias or with relatively more gender bias than the committees may be drawn from the fact that Esther Brunet, a rater in the Willis II re-evaluations who was familiar with the federal public service, and considered a competent evaluator free of gender-bias by Mr. Willis, was almost 100% consistent with the committee evaluations (for the French-language questionnaires).

(185) If an allegation of gender bias is supported by inconsistent application of the evaluation plan to male and female evaluations, then it is important to assess the relative consistency of the consultant evaluations compared to the committee evaluations. Consistency can be measured statistically. The statistical evidence of consistency of raters - committees versus consultants - demonstrates that it is the committees who are more consistent in their ratings than the consultants. The existence of a greater degree of rater error on the part of the consultants is described by Mr. Sunter as conclusive evidence that the committee is to be preferred over the consultant. Thus, the allegation of gender bias in the committee results is not supported by the statistics, nor is an allegation that the consultant scores are more consistent or more reliable.

718. We are of the view, there are other valid characteristics that can account for the differences between the Wisner 222 and the Willis 300 which should be considered quite apart from a pure statistical analyses. Although the two studies followed the same procedures, they are very

167

different in other respects. The Wisner 222 was undertaken to validate a process which brought Willis discomfort. It was a smaller study conducted by a single consultant who had demonstrated a more liberal discipline than the MEC. Wisner's analysis was a snapshot assessment only and was not intended to portray the whole picture. Not only was the time frame between the Wisner 222 and the Willis 300 different but the sample of jobs re- evaluated by Wisner 222 were from a smaller population than the Willis 300. The Wisner 222 were taken from the evaluations of the multiple evaluation committees and excluded the MEC evaluations. The multiple evaluation committees had been operating for about three months at the time of the Willis 222.

719. The Willis 300, was a larger scale study, undertaken after the process was finished. The purpose of this study was to confirm or to dispute the analysis contained in the Wisner 222. Four consultants conducted the Willis 300, with two or more consultants working in tandem. One of the consultants was an evaluation committee member. The sample of jobs came from the entire population of jobs from the expanded evaluation committees, excluding the Wisner 222. Not surprisingly, there was greater agreement with the committee evaluations in this latter study.

720. The timing of the re-evaluations by the so called Gang of Four, the range of the sample, the number of consultants involved, the process followed and the circumstances then prevailing make the results, in our opinion, more likely representative of any real difference between the evaluation committees and the consultants.

721. The Tribunal had ample opportunity to observe Willis as he testified, during his first appearance which lasted 36 hearing days, and his second appearance which lasted 4 hearing days. We found Willis to be a credible witness who demonstrated patience, cooperation, and most importantly, impartiality in all respects. The Tribunal accepted Willis as an expert in the field of pay equity. Willis' experience, prior to the JUMI Study, was garnered entirely from his participation in U.S. studies in comparable worth. He had experience and had gained general recognition as a pioneer in this field. He was accepted as a qualified pay equity expert in the American court system.

722. We have reviewed the many occasions when the JUMI Committee asked Willis and his consultants to review committee evaluations or provide a baseline for comparison with committee evaluations. That role was well established and endorsed before the breakdown of the study. We do not now intend to view Willis' role differently from that which he provided to the parties in the JUMI Study. All appropriate factors will be considered by the Tribunal if the issue of adjusting scores should arise.

723. The difficulties experienced by the multiple evaluation committees were not unexpected and should be accommodated and understood in the context of the sheer size of the Federal Public Service, its geographical dispersion and the multifaceted occupations and skills of its diversified workforce. These complicating factors coupled with the logistical problems which were encountered imposed a daunting challenge for all concerned. The experts, Armstrong and Durber emphasized the

168

difficulties inherent in the complex job evaluation process as it pertains to pay equity.

724. Given the nature of the JUMI process, the numerous participants with diverse backgrounds, and the working conditions within which the multiple committees functioned, the Commission and the Alliance submit job evaluation for purposes of pay equity will and must involve some conflict. This conflict, they submit, arises from a clash of values between evaluators who attempt, in a pay equity study, to question stereotypes and the attitudes of those with a more traditional mind set. Within that framework, the conflict which occurred is, (it is claimed), understandable and in fact unavoidable.

725. Respondent Counsel submits not all committees were working together in a team effort but instead operated in an adversarial mode. Willis said some committees tended to feel themselves almost in a negotiation mode rather than a team of six or seven people trying to accomplish a common goal. Respondent Counsel submits it would be wrong for the Tribunal to accept the proposition pay equity job evaluation must inevitably involve conflict and adversity. Counsel submits pay equity job evaluation should be a cooperative problem-solving exercise in which evaluators work toward a common goal and evaluate based on the relevant facts. In the Employer's view, the process should instil confidence the relevant facts are being analyzed and that appropriate weight is being given to those facts. According to Counsel, when all these things happen, then the Tribunal can be confident the results are reliable.

726. In Weiner's opinion, the application of the plan is more important than the plan itself in ensuring gender bias free evaluations. She described the characteristics of the process which will prevent or minimize potential gender bias. In addition to having diverse committees of both genders and different organizational levels, Weiner stated other factors, such as the training of committees, discussion as to how gender bias might operate, complete and up to date job evaluation information and the manner in which the committee conducts itself, must all be considered. On this point, she says in Volume 8, at p. 1092, line 13 to p. 1093, line 3:

Q. Now, what about the way that the committee conducts its affairs on a day-to-day basis?

A. Traditionally, job evaluation committees strive to be very efficient. They try to evaluate as many jobs as possible in a day.

A pay equity committee has to take a different approach and open their questioning to asking for more information if they are unclear about something in the job information, to have a discussion about gender bias, to listen to themselves say things like, This is just a secretary, and realize what they are doing, how this dismiss women's work.

So all of those things take time, questioning, probing.

169

727. Weiner makes reference to questioning, probing in the context of committee evaluations. Although she did not comment directly on conflict in the committees, Weiner did insist that traditional values must be challenged in a pay equity job evaluation exercise.

728. The Tribunal is not persuaded, given the issue it has to decide that it should be asked to define the nature and degree of what is permissible, acceptable and legitimate discussion within the committee framework. Moreover, it is most difficult to measure its effect, especially when traditional values are being challenged and debated in a pay equity context. Nor is the Tribunal prepared to suggest answers for the resolution of conflict between committee members who may individually entertain strong opinions one way or the other on this sensitive subject. The study and implementations of equal pay for work of equal value in Canada is a relatively new discipline which is still in the developmental stage. Nonetheless, we do find it necessary, considering Willis' concern about committee conduct and individual evaluator behaviour, to assess whether the process achieved its purpose of producing gender bias free evaluations.

729. With regard to the effectiveness of the safeguards in place during the study, and more specifically procedures defined by Willis to be part of the Willis Process, we find the expert opinion of Willis to be most persuasive and informative. Because of its importance in assessing the results we have described in some detail the procedures and the safeguards which he recommended be adopted in that process.

730. The Tribunal believes it is incumbent upon it to comment on the JUMI Process as orchestrated by the JUMI Committee. Suffice to say, the JUMI Committee had a difficult working relationship from its inception. For incomprehensible reasons, the JUMI Committee chose to deprive both Willis and the Commission from real decision-making authority. This was done, notwithstanding the impartiality of both Willis and the Commission, their competence and broad experience in pay equity as compared with the parties themselves. In both the information gathering stage and in the evaluation stage of the JUMI Study, the JUMI Committee failed to follow Willis' advice and frequently refused to implement his recommendations. Some of the Willis recommendations were not implemented owing to make or buy decisions, largely controlled by the Employer and motivated by economic considerations. However, other Willis recommendations, not complicated by these considerations, were ignored as well.

731. Willis identified the JUMI Committee as a major weakness in the study and, in our view, his opinion is well-founded. The adversarial tone set by the JUMI Committee reflected the long-lasting and deep-rooted difficulties between management and union sides which permeated the JUMI Study throughout its entire life.

732. There is evidence the Chief of Pay Equity, an individual from the Treasury Board, viewed the JUMI Study in Willis' words, as a bunch of bunk. (Volume 210, p. 27280). On the other hand, the Alliance wanted to follow a cohesive strategy as described in the correspondence from Millar, speaking for the Alliance, in announcing the Mont Ste. Marie meeting. This

170

incident and others threatened the foundation of the JUMI Study from the beginning and contributed in no small measure to the resulting difficulties. The union/management split was evident in the manner in which they attempted to resolve the issues. It even manifested itself in the seating arrangements at the JUMI Committee meetings with union and management on opposite sides. The parties opposed an attempt by Willis to change those seating arrangements. Willis said ...they looked at me like I was crazy. (Volume 60, p. 7459).

733. Willis disapproved of meetings the Alliance convened with their members prior and during the course of the study. There was the meeting of Alliance members at Mont Ste. Marie before the commencement of the study itself, where the subject of under-evaluation of female work was discussed in the absence of the consultants and the other parties to the study. During the course of the study, the Alliance also held evening meetings in which the participants discussed their logistical problems but during which there was also discussion relating to evaluations. Further, the Alliance representative on the MEC attended the evening meetings and was available to answer questions concerning the MEC benchmarks. At one week-end meeting, occurring in the fall of 1988, the Alliance held a training session on pay equity job evaluation, without the knowledge of Willis or the other parties. At that meeting, members examined and discussed certain of the MEC benchmarks. During this week-end meeting, gender-sensitization training, as interpreted by the Alliance, was given to the participants. The Alliance justified this unusual action on the grounds it was necessary to correct what it conceived to be historical injustices to women as victims in the work force.

734. Within the framework of the study, Willis felt he lacked the necessary support and backing of those in authority, both from the government and from the union sides while the study was ongoing. Although the sub-committee on communications had devised a strategic plan for communicating the JUMI Study to employees, Willis felt there was not enough emphasis on the need for communication from top management. He had initially proposed at least 10 consultant days for face to face meetings with department heads and union executives. No briefing sessions of this type were held and Willis believed this most likely resulted in the long delays before the employees completed their questionnaires.

735. Willis' evidence is that he designed the process to ensure a sound result, if the result is sound, it is immaterial whether the process is flawed. In examining the Willis Plan itself we find it to be an appropriate tool to evaluate jobs for the JUMI Study. During final argument, the Tribunal was informed there is no dispute between the parties concerning the Willis Plan. We refer to Respondent Counsel's written submission at para. 41:

41. Nevertheless, for purposes of this litigation, the Employer accepts that the Willis Plan was an appropriate plan to use in evaluating jobs in the Federal Public Service. Therefore, the Tribunal need not decide whether weighting of the Willis plan is valid.

171

736. We rely on Willis' expert opinion that the Willis Questionnaire, with slight modifications, was capable of capturing sufficient job information to ensure pay equity evaluation could be accomplished in the study. In his opinion the questionnaire contained sufficient information on which a well-trained and supervised job evaluation committee could provide reliable unbiased evaluations.

737. The degree of effectiveness of the safeguards provided for in the information gathering stage was disappointing to Willis. It was during this stage, that efforts to ensure the questionnaires were properly completed were made. Details of these efforts are described in the decision under the heading, The Willis Process.

738. In assessing the role of the coordinators we find, given the breadth of the study, it would have been extremely difficult for Willis & Associates themselves to act as coordinators without significant time delays and significant additional expense to the JUMI Committee. Coordinators were responsible for communicating directly to employees who were targeted to complete the questionnaires. Also, the coordinators trained incumbents as to the proper manner in which they were to complete their questionnaires. The consultants were involved with the JUMI Committee in the preparation of training materials supplied to and for the training of coordinators. If the number of completed questionnaires is a measure of the quality of the work of the coordinators, then their work can be viewed as most satisfactory. The percentage of return was impressive; nearly 100 per cent of the questionnaires were returned.

739. Willis' greatest concern lay in the lengthy delays in returning the questionnaires. According to Willis, delay in return of questionnaires impacts negatively on the quality of information, and the longer the delay the poorer the quality. There is little evidence as to what contributed or caused these delays. The evidence does not show the incumbents failed to fill out the questionnaires in a timely fashion and within the required 10 to 14 days after receiving training. Furthermore, there is little available information concerning when the coordinator-incumbent training sessions were held. To an extent, the large number of substitutions almost certainly contributed to the delays.

740. Although the effectiveness of the coordinators' role appears weak, this did not deter Willis from continuing with the evaluations. He was willing to have the study proceed notwithstanding somewhat weaker information. Willis instituted other safeguards, such as screeners/reviewers and the evaluators themselves to ensure completeness of job information. We do not consider the limitations of the coordinators' role to impinge significantly on the issue of reliability.

741. The screeners/reviewers applied a sophisticated technique of double check or safeguard. They were responsible for ensuring the questionnaires contained factually complete information for evaluation by the committees.

742. The screening and reviewing function was not conducted by Willis. Its sufficiency must be assessed from the training given, the evidence of

172

the witnesses who actually performed this function, the Commission's research (conducted by the outside researcher, Exhibit HR-245), and Willis' own observations and comments. The screeners/reviewers who testified believed they had done their job well. Through follow up telephone interviews they believed they were able to obtain the required information. Although Willis would have preferred more face to face interviews, overall he saw no difficulty with their performance or the role they played in the JUMI Study.

743. The screeners/reviewers received the same initial training on the Willis Plan as was given to the MEC evaluators. They also received on the job training from the consultant when needed. We find they functioned well and with no apparent problems other than the involvement of some committee outliers in this work. However, there is no evidence the outliers, who bore this identification because they tended to evaluate differently than their committees, failed to perform their task fairly and competently or that they unduly influenced others. The six outliers who functioned as screeners/reviewers were relatively small in numbers compared to many others who fulfilled this role.

744. It is understandable why Willis would have personally preferred hands on involvement in the screener/reviewer function. However, it seems unlikely, given the volume of questionnaires, one consultant could have accomplished this task during the time frame allocated. Having carefully reviewed the evidence as it relates to the collection of job information, we accept Willis' opinion and find as a fact the job information was of satisfactory quality when all the shoring up is taken into account.

745. Consistency is an important feature in the process of pay equity job evaluation. The Willis Plan should be applied consistently especially when multiple evaluation committees are involved. This requirement, if met by the participants, does not necessarily imply the process is without gender bias and, on the other hand, lack of overall consistency between the committees does not necessarily imply that the evaluations are biased, nor is it crucial to the issue of reliability. In the final analysis, Willis' concern was whether the results were biased. However, within the context of this study and in assessing how well the process worked, we consider it prudent to comment on whether the multiple evaluation committees consistently applied the discipline established by the MEC.

746. There were some committees amongst the original five evaluation committees, namely Committees #1 and #2 and the first version of Committee #4, that worked well. After the restructuring of the original five multiple committees into nine multiple committees, the newly created nine committees appeared on the whole to have functioned well. Most of the multiple evaluation committees did, in fact, attempt to follow the MEC benchmarks, adhere to the discipline created by the MEC and follow the same job evaluation procedure as had the MEC. There is evidence, at least from the early ICR testing, of consistency between committees in interpreting the Willis factors and applying the plan. To some degree, the MEC benchmarks had a steadying effect on the functioning of the multiple evaluation committees and on the study as a whole. This is most evident

173

from Willis' response to a question by the Tribunal regarding the first incarnation of Committee #3 in Volume 69, at p. 8676, lines 8 - 18:

But, as it worked out, one of the things maybe that helped to stabilized [sic] the evaluations was that we did have those Master Evaluation Committee benchmarks for them and maybe they just got so tired each fighting for their own side that they went along with the Master Evaluation Committee's benchmarks. I was not at all satisfied that I could leave it at that or let it rest at that. But I could not observe any particular problem in the actual evaluations that we were able to examine.

747. The Tribunal will now refer to the training the committees received in order to properly perform their function as evaluators. Willis' approach in dealing with gender stereotypes and traditional values is to direct evaluators to break down a job into its component parts and to evaluate each part separately so as to ensure bias free evaluations. Willis' opinion differs from Armstrong's about whether his method of training should have included a more formal kind of gender sensitivity training which would focus on under-valuation of female work. In our view, the fact this training was not formalized by Willis does not increase the potential for gender biased evaluation. Willis preferred on the job training and this approach was used by him successfully in previous studies. Moreover, the JUMI Committee had authority to decide what was to be included in the training and what training it expected to be provided. Willis was criticized by the Alliance, during this hearing, for not providing gender sensitivity training in the form espoused by Armstrong and in the reference material from the Ontario Pay Equity Commission. It should be noted however, that the Alliance approved of Willis' training approach at the outset of the JUMI Study while it was a member of the JUMI Committee. The Alliance's criticism of Willis would seem to be motivated by Willis' disapproval of the Alliance undertaking this kind of training during one of their meetings held in the absence of the consultants and the other participants. In addition, Willis commented on another aspect having to do with the quality of such training in Volume 211, at p. 27483, line 24 to p. 27484, line 20:

Q. On another subject -- and this is one that you have discussed at some length with my friend Mr. Raven. It's the subject of training participants in a study to be sensitive to gender issues. Do you recall the subject?

A. Yes, I do.

Q. In deciding whether such training is beneficial, is it relevant to know something about the quality of the training?

A. Certainly.

Q. Could you comment on that, please?

174

A. I would think it would be important for whoever is providing the training of this nature to be accepted as an impartial individual and to have been trained in this area.

Q. If the training is not done well or impartially, could it have any effect other than off-loading baggage?

A. It's possible that it could have the effect of creating more baggage.

748. We hold the view, in recognizing Willis' extensive hands on experience in conducting pay equity studies, that his practical approach has merit and is acceptable. We say this notwithstanding Armstrong's opinion based, it would seem, entirely on research and on the available literature.

749. With respect to the actual job evaluation process, there is anecdotal evidence the process did not work as well as it ought to have. Willis testified about his discomfort with the behaviour of some committee members, particularly with the first version of Committee #3, which he characterized as consisting of two warring camps. He was thwarted by the JUMI Committee from taking the appropriate remedial measures he believed were necessary concerning those evaluators who were evidencing gender bias on Committee #3.

750. The Tribunal had the benefit of observing and hearing witnesses who had participated in the evaluation committees. Their evidence can be characterized generally as an 'injection of reality' into the evaluation process which is best described as a lengthy, arduous, complicated, stressful and difficult process. In general, these evaluators did not express difficulty with the sufficiency of the information provided in the questionnaires. If and when further information was required by a committee to complete an evaluation, this was accomplished through the procedural safeguard established for that purpose, that is, having the screener/reviewer supplement, clarify or obtain new information.

751. Willis testified about some of the strengths of the JUMI Study. He regarded three strengths of the JUMI Study as being, firstly, the large number of individuals who participated on evaluation committees, secondly, the large number of diversified jobs evaluated and thirdly, the large number of jobs in the sample which enabled him to deal with slightly greater disparity in job information than a study with a smaller population. Willis believed the committees represented a pretty good balance of union and management employees with different backgrounds despite the difficulties the unions encountered in naming male representatives. There is evidence some of the female evaluators were members of male-dominated unions which contributed to more diversification within the committees.

752. One of the problems Willis recognized was the participation of management individuals trained in classification. Seven evaluators nominated by the management side had extensive knowledge of the classification system in the federal government. They served on four of

175

the evaluation committees and on the MEC. The problems associated with classification backgrounds surfaced during the evaluation process. The statistical evidence, however, did not identify the classification background of these individuals had an impact on the multiple evaluation committees' consensus scores. There is anecdotal evidence these individuals had little or no influence and tended to be ignored by the other participants.

753. Another problem which arose was the participation of some Alliance supporters who evidenced an agenda for increasing the value of female-dominated jobs. There were misguided attempts to influence the evaluations of some of its members through confrontation and intimidation. The quantitative differences in the consultant re-evaluations point to the committees under-evaluating some male-dominated jobs but do not demonstrate these misguided individuals accomplished their objective of persuading others to over-evaluate female-dominated jobs. As Sunter's analysis reveals, significant differences between the committees and the consultants exist almost entirely in the treatment of male-dominated questionnaires. Furthermore, the IRR test results reveal the majority of both management and union outliers exhibited a male preference. Thus, any conscious attempt by Alliance members to over-evaluate female-dominated jobs was unsuccessful. There is also some comfort to be had in the testimony of all of the Alliance evaluators who gave evidence to the effect there was no Alliance meeting at which members were told to over-evaluate female- dominated jobs or to under-evaluate male-dominated jobs.

754. Some of the evaluators were identified by both the consultants and the IRR test results as outliers. During the JUMI Study, efforts were made to assess whether the outliers were exercising influence on the committee's final consensus. The statistical analysis demonstrated their influence was negligible. As well, while directly observing the participation of the outliers in the evaluation committees, Willis could not detect them exerting any influence on the other members.

755. One of the most redeeming features of the JUMI Study was the work of the MEC which had the unqualified endorsement and support of Willis. When the MEC completed their work, Willis was satisfied they had done a good job. There were several reviews of MEC's work by the consultant, revealing some differences between the MEC and the consultant evaluations. Willis was not concerned with the extent of these differences, as there was no evidence of gender bias in the MEC evaluations. Willis said he anticipates differences between committees and consultants. In his view, the presence of those disparities does not necessarily mean the consultant is always right.

756. From Willis' perspective, there are four questions that need to be addressed in deciding whether or not a real problem exists. They are:

  1. What is the extent of the disparities on total scores in a specific evaluation;
  2. How frequently do the disparities occur;
  3. 176

    The rationale: why have the committees done what they have done; and

  4. Is there a pattern to the disparities, and if so what is the pattern?

757. When the study is over, Willis examines the total score, to answer two of the above four questions, namely, what is the extent of the disparities and how frequently do they occur. Willis' examination is done with the assistance of a statistician, upon whom he also relies for the answer to the fourth question, namely, whether there is a pattern in the disparities. There can be a number of reasons for the disparities referred to in question (iv) but, at this stage of the study, Willis is not interested in those reasons.

758. In Willis' opinion, when the study is completed, the appropriate consideration is how much did the committees stray from the consultant evaluations. In his view other considerations are, at this point, immaterial. His reason for considering only the bottom line results is that the evaluation committees are no longer functioning. An understanding of whether or not the committees were applying the plan correctly is no longer useful to the consultant because counselling and training is no longer feasible.

759. Willis expressed the view, on a number of occasions during his testimony, that the results were more important than the process. By results, he meant the comparisons between the committee evaluations and the consultant re-evaluations.

760. However, in view of our interpretation of s. 11 of the Act, which is that causation is implicit in the legislation, we must address the question of whether the differences between the consultants and committees arising during the process are based on gender, or on some other consideration. It follows therefore, it is not only necessary but crucial that the evidence be examined in detail in order to determine whether or not the differences between the committees and the consultants are gender based.

761. There was evidence led by the Alliance concerning analyses done by two individual Alliance witnesses who examined committee and consultant rationales, with a view to explaining consultant and committee disparities. Prior to the commencement of the evidence of the first of these witnesses, the Employer provided an admission to the Tribunal which reads in part:

4. The Employer makes the following admission and clarification in order to narrow the issues and to avoid further unnecessary use of hearing time in tendering evidence.

5. The Employer admits that disparities between consultants and committees in the Wisner 222 and Willis 300 re-evaluations may have occurred for reasons other than gender bias in the Joint Initiative Committees.

177

6. To clarify the issues, the Employer will not rely on the reasons for disparities as evidence of gender bias in the process or bias in the results.

7. Therefore, the Employer contends that evidence analyzing the reasons for disparities does not assist the Tribunal to assess:

(a) the reliability of the process; or

(b) the reliability of the results.

(Exhibit R-154)

762. Willis had an opportunity to comment on the two analyses presented by the Alliance witnesses. Willis does not consider either of them helpful for identifying gender bias in a large study or for exploring consultant and committee disparities. In his experience, individual assessments of differences based on the rationales will not reveal the existence of gender bias. The Tribunal accepts Willis' view. Our determination will not be based on what is contained in the rationales for individual differences between committee evaluators and consultants on a given question, but instead will be based on an examination of all the evidence relevant to committee and consultant evaluations.

763. Willis wanted questionnaires that were complete and focused on factual information. Incomplete questionnaires lead evaluators to make assumptions which result in a wider range of possible disparities. The number of disparities in this study tended to be higher than what Willis usually experiences. On the other hand, Willis had never before participated in a study as large as the JUMI Study and was not in a position to supervise the entire 522 re-evaluations, some of which had been done during and some after the study was over.

764. We will now address Willis' questions (i), (iii) and (iv). Willis testified on numerous occasions about a tolerance level of differences between committee and consultant evaluations. The percentage variances he uses are simply a function of his experience and what he views as acceptable. Based on the quality of information available to the MEC, he would expect to find a 10 to 12 per cent random variance, either positive or negative, in evaluations. Because the information available to the multiple committees was not, in his opinion, of as high quality as was available to the MEC, he would expect to see between 15 and 20 per cent random variance in their case. There is more opportunity for evaluators to make assumptions when they are furnished with poorer quality information.

765. Willis testified random variance occurs when value judgments are made about the meaning of the facts presented in the questionnaire. Willis considers in a large study, such as the JUMI Study with the sheer numbers of jobs being evaluated, greater disparity is acceptable as a result of the relatively weak job information. Willis is concerned, if over time, the variance is no longer random and becomes systematic. He defines systematic variance as value or values which are consistently higher or lower than an objective evaluation of certain types of jobs. He treats the term systematic variance as equivalent to gender bias.

178

766. Shillington testified on the distinction between pattern and randomness in a large study and the difficulty in defining something as random. He said in Volume 86, at p. 10540, line 9 to p. 10541, line 13:

Q. How do you know that you have something that is random as opposed to something that is not, something that is patterned?

A. Sometimes you are comfortable using a term without trying to define it, and randomness is one of those terms that is easier for people to use comfortably. I think everybody knows what you mean, but as soon as you try to define it, it gets difficult.

If you show someone a pattern of numbers, quite often people will look at that patten and you can say, Is it random or not? It is very difficult to show that a pattern is random. It is often easier to show that it is not.

Let me write down a sequence. Suppose we toss a coin four (4) times and we get heads, tails, heads, tails. You can look at that and say that that is a possible outcome from a fair coin. You have fifty (50) per cent heads and fifty (50) per cent tails. But if you continued getting heads, tails, heads, tails, heads, tails, heads tails, something in our brain starts saying that this isn't random any more. Yes, you are getting half heads and half tails, but that is far too systematic.

Defining what is random is very, very difficult. It is much easier to say, This is not random. It looks like there is a pattern here.

767. He further states in Volume 86, at p. 10543, lines 1 - 8:

So, it is easy to show that it is not random, that there is a sequence. But proving it is random is virtually impossible.

We use the term random basically as a catch-all phrase for what we don't know. If you toss a coin over and over again, we say that the coin is random because we can't predict well the next outcome.

768. Willis confirmed at the conclusion of the study, he is willing to accept a wide disparity in evaluations provided there is no pattern. He does not like to see any pattern at all. He said in his earlier testimony if the variance is less than 2 per cent, he probably would not adjust the evaluations. He said in Volume 61, at p. 7596, lines 5 to 11:

A. In the final analysis when the study is over, obviously in many cases we are involved in recommending and implementation. At that point I might decide that there needs to be some adjustment to correct. But obviously, if it is less than 2 per cent, the difference in pay is so minimal that I guess I would have to accept it.

179

769. As a rule of thumb, even with the very best job information available, Willis expects to see more than plus or minus 10 per cent disparity between the committees and the consultants. Willis considers disparities over 10 per cent a red flag which suggests there may or may not be a real problem in the evaluations. In a large study, such as the JUMI, Willis seeks the assistance of a statistician to determine whether the disparities are systematic.

770. The nature of this exercise, which Willis describes as more an art than a science, renders it difficult to quantify job evaluation either statistically or mathematically. The Tribunal was occupied for a considerable time with the presentation of statistical evidence. In the end, we had opinions from the statistical experts, Shillington and Sunter, to the effect that statistical analysis cannot identify the existence of gender bias.

771. Sunter's conclusions are a product of hypothesis testing. In his interpretative analyses, he relies on probability criteria and mathematical models to explain variations in the data. His conclusions are not based entirely on scientific reasoning and mathematical applications but, in part, on assumptions about the nature of the world. Sunter repeated at different times in his testimony when his intuition assisted him in reaching his conclusions. The following examples, which are not exhaustive, are reproduced. In Volume 110, at p. 13221, lines 8 - 17, he remarked:

When I said that the original stuff is most unexpected it was because I felt that if the consultant is always right and the committee is always wrong, then my statistician's intuition tells me this should lead to a larger variance for committee scores and it should lead to a negative covariance and a negative correlation between difference and committee scores, which is exactly the relationship that you see reproduced by Model 2.

772. As well, in Volume 119, at p. 14387, lines 10 - 20, Sunter said:

There is a stronger, positive association between DIFF and CONS than there is between DIFF and COMM. Now, let me say that my statistician's intuition tells me -- I don't have to justify this, it's just that one develops an intuition, and my statistician's intuition is surprised by this, if it really is the consultant who is in error -- sorry, if it is the committee who is in error. I would expect the associations to be somewhat different, but I am just speaking intuitively now.

773. Also in Volume 123, at p. 15046, line 19 to p. 15047, line 2, Sunter said:

I think he asked whether they were relevant tools in the context of what Dr. Shillington was doing in the IRR, and I said yes. You know, he was in a different situation, concerned with different things, and I would assume that he used both of those

180

tests as a result of some kind of intuitive assessment -- which, under the circumstances, he was perfectly entitled to make...

774. And once again in Volume 217, at p. 28225, lines 9 - 23, Sunter remarked:

Typically, in decisions theory, with decisions, you associate losses and gains with various decisions, and how you make a decision is a consideration -- if you wanted to do it technically, you would have to go into all that stuff, and I am trying to skirt over it and say, I have no loss function to offer here. I don't know how you should make that decision. If you challenged me to come up with one, I suppose I could, a decision-making function here.

This is why I am not taking a position on it. Make the adjustment or don't make the adjustment -- it depends on your kind of intuitive decision-making process, but I am not about to make that decision for you.

775. Both statisticians agree statistical analysis can lend weight to the evidence even though it may not be conclusive in itself. Shillington discusses significant and non-significant results in terms of weak or strong evidence. In his opinion, a significant result is not conclusive in itself. It may, however, lead a statistician to conclude a hypothesis is suspect or the statistician may draw an inference which casts doubt on the hypothesis. In Sunter's opinion, statistical analysis will lend weight to something which already seems plausible. The analysis can very seldom by itself provide plausible explanations. In fulfilling this limited role, we believe statistical analyses are appropriate and helpful. Therefore, we conclude, statistics are ancillary to the primary function of the evaluators to render a value judgment, and of the Tribunal, which is to determine the reliability of the results.

776. In Sunter's last appearance before the Tribunal, he agreed there were limitations to the applicability of statistics for the determination of the issues before the Tribunal. This is found in Volume 217, at p. 28301, lines 13 - 22:

MEMBER FETTERLY: I guess the point that I am trying to get at is this: Statistics don't necessarily tell us the whole story. I think you might agree with that, would you not?

THE WITNESS: Yes, I would agree with that as a general observation.

MEMBER FETTERLY: So we may have to consider other factors that perhaps are not within the realm of your speciality.

777. Sunter's tests help identify the statistically significant differences between the Wisner 222, the Willis 300 and the combined database (522) compared with the committee evaluations. Sunter interprets the differences as not having a consistent pattern. He found significant

181

differences between the consultants and the committees in both studies in the male-dominated questionnaires, but more so in the Wisner 222 than in the Willis 300. The results of his tests identified differences found mainly at the higher end male-dominated and some few higher end female- dominated positions. Overall, the female-dominated questionnaires had a lower distribution in value than the male-dominated questionnaires. We are mindful of the fact the differences with the female-dominated questionnaires were not statistically significant.

778. Shillington provided an opinion regarding Sunter's analyses of other possible causes for the differences between the consultant and the committee scores, that is to say, other than gender differences. One of these analyses included comparisons to determine if the differences were associated with the relative distribution of questionnaires in the higher and lower point ranges. Contrary to Sunter's view, Shillington was of the opinion it would be very difficult to separate out these two data analyses questions as to whether there is some reason other than gender which is the cause of those differences. On this point, Shillington says in Volume 131, at p. 16045, line 21 to p. 16046, line 21:

A. Yes, and the analysis that is behind that.

The regressions were done in a way to try to see if there was a relationship between the differences between the consultants and the committee in gender. It is also possible that any differences that might have existed between the consultant and the committee scores were not directly related to gender but perhaps were related to high values versus low values. This has been talked about here.

The confounding is introduced because there is a strong trend in the data for the male questionnaires to all have high values relative to the female and the female questionnaires have a fair tendency to come from the lower end of the spectrum, which means you cannot separate those two data analysis questions, or it is difficult to separate them.

THE CHAIRPERSON: What do you mean?

THE WITNESS: You can't separate the question whether or not a pattern is related to gender or whether or not it is related to whether or not the scores were high or low.

779. On the same topic, he says in the same volume at p. 16048, line 16 to p. 16049, line 11:

In this circumstance, back to the analysis of the Willis scores and the possible adjustment, we have a situation which -- to the extent that there is a pattern here, if someone came and said this is possibly not due to gender, maleness or femaleness, but rather could be due to professionalization or some questionnaires having much higher values than others, you would have a problem extracting those two separate hypotheses from the analysis because

182

you have a situation in which the males predominantly had high values, the females predominantly had low values. So maleness is confounded with high and low values.

That is reflected in the distribution. That is why it is a distribution question. The distribution of the Willis scores for the males tended to be quite a bit higher than the distribution of the Willis scores for the females. It is a confounding issue. That is why in interpreting it you are going to have to be cautious about that.

780. In the end, Shillington suggests these analyses should be used with caution, and we refer to his response in Volume 131, at p. 16049, line 20 to p. 16052, line 7:

THE WITNESS: It is more of an interpretation issue and, I think can't be stronger than -- I am not Mr. Sunter, but I think that we have to make sure that when we use these analyses, because of the differences in the distribution, we have to be cautious.

THE CHAIRPERSON: For example, when we compare regression lines, we usually look at the differences -- or we have been looking at the wage gap using regression lines, for example, in calculating a distance between them. So you are comparing them to see what is the distance.

THE WITNESS: Yes.

THE CHAIRPERSON: That is what I think when somebody says to me that you can't compare these two regression lines. So when Mr. Sunter is saying that you can't compare these two regression lines, I am saying compare them for what? That is why I am a bit confused.

Are you saying you can't interpret them, meaning that because in the male regression line you have distributions of both, high and low distributions, but a tendency to be higher, whereas in the females you have a distribution of a low and high but a tendency to be lower, but when you interpret these lines you can't say it is definitely associated with a gender-related bias, for example?

Is that what you mean?

THE WITNESS: Yes. I think it is more of an interpretation of whether or not the patterns that you are seeing are clearly related to gender or whether or not those patterns are related to high score versus low score because they are, in the data, occurring together. The males are predominantly high score and the females are predominantly low score.

THE CHAIRPERSON: So it is not comparing them in terms of calculating a wage gap. Is it?

183

THE WITNESS: I think that is a different issue which we will get to, I think.

THE CHAIRPERSON: Okay. But just looking at these and what you can say about what they describe in terms of their distribution, what you can interpret from that is that the males tend to be high, the females tend to be low, but you can't, because of this confounding effect, you can't really interpret anything else with certainty. Is that ---

THE WITNESS: That is right. You have to be very careful when interpreting the results because you have to keep in mind that if somebody came with an alternative explanation for the data and the explanation was that this had nothing to do with gender, that this was high score/low score effects, you have collected your data in such a way that most of the high scores are males and most of the low scores are females. So they are two equally valid explanations for the same data.

I think it is a caution in interpretation that I think is reasonable.

781. Sunter conducted further analysis for presentation in reply. He refers to this analysis as his value effect analysis which attempts to explain further the difference in treatment of high point value and low point value questionnaires. The two statisticians hold opposing views as to whether such questions as value effect and gender can be separated out or unconfounded. We note Shillington's warning to exercise caution when attempting to unconfound the data in these circumstances. However, the analysis is useful in demonstrating the differences between the consultants and the committees occur at the high end of the point range. Having found the applicability of statistics for the determination of the issue before us to be supportive rather than definitive, we are not convinced as to the necessity for, or the validity of, Sunter's other conclusions pertaining to his value effect analysis. Moreover, Sunter's earlier work which focused on identifying significant differences remains helpful and useful in understanding where the differences occur between the committees and the consultants.

782. We will now address Willis' question (iv). The Wisner 222 was completed while the study was still ongoing. At that time, Willis did not perform any in depth analysis to determine the reasons for the differences between the Wisner 222 re-evaluations and the committees' evaluations as he had done with the previous consultant re-evaluations of the MEC benchmarks. Willis would have preferred to proceed immediately with the second part of his plan, which was to do a larger study. He believed this further study was desirable because the Wisner 222 was inconclusive on the question of gender bias. It was recognized at the time by Wisner himself that there could be other plausible explanations for what he defined as an observed pattern of evaluation differences in the Wisner 222 (Exhibit PSAC-4). Wisner does in fact suggest positions in male-dominated classifications, with more complex duties and responsibilities, might have been the cause. He states:

184

...Because this is true, the observed pattern of evaluation differences could occur if the committees tended to under evaluate more complex positions, in relation to the MEC discipline as seen by the consultant.

(Exhibit PSAC-4, p. 8)

783. It should be noted during the MEC's work Willis observed that the MEC adopted what he described as a conservative discipline. This is evidenced by the reluctance of the MEC to evaluate jobs above a certain level. The Willis evaluation plan had a varied level of complexity from levels A through to level G in functional job knowledge. According to Willis, the high G level is a level that presupposes ...a requirement for an expertise or command of a professional sphere of knowledge. (Volume 35, p. 4448).

784. During the operation of the MEC, Willis felt there were four or five questionnaires which should have been evaluated at the G level. Willis tried, at a special session with the MEC evaluators, to encourage them to promote jobs beyond the F level. Willis testified in Volume 35, at p. 4448, line 19 to p. 4450, line 24, about the phenomenon he observed:

A. Out of the ones that the Master Evaluation Committee had evaluated. As we got toward the end, in fact, I even had a special session with them to see if we could break out of the high F into the G level. It was an interesting phenomenon. They all realized the problem, but they just could not seem to select any jobs to promote above that F level.

In fact, I said let's just pick one -- I want the other committees to feel that they have a highly professional job with true expertise. I don't want them to feel they can't go beyond the F level. So, pick the strongest job you can. Let's see if we can't promote it to the G level. And they just couldn't do it.

This was of some concern to me. That was mitigated, however, for two reasons: (1) there were several jobs at the high F level. The point totals for the high F are the same as for the light ---

THE CHAIRPERSON: Excuse me, you were saying there were several jobs at the high F level?

THE WITNESS: Yes, the F leaning toward G.

If you recall from the evaluation system, the G on the light side leaning toward F has the same point total. So I was not concerned from the standpoint of the points. But since they were the committee that was setting the frame of reference for the other committees, I wanted them to be able to exercise that G level. That didn't happen with the Master Committee.

185

As it worked out, I was in counselling later with the evaluation committees. I explained the problem to them. I don't remember how many jobs they ultimately evaluated at the G level, but I understand that they did break through and they did evaluate some of the 4,000 at the G level.

Q. So was this tendency in the end something that you felt was beyond a concern?

A. The other mitigating factor was that even though they were very conservative here, this conservatism was consistent. Looking at the alignment, I felt that the internal alignment was still appropriate. So while they were very conservative at the top, it did not create, let's say, an inversion in the evaluation relationships.

There were so few jobs -- and I remember discussing this with Paul Durber after the study. They looked at those jobs that might have gone to a higher level and there were so few of them that they wouldn't have affected the results materially.

785. Early in the process, specifically with consultant re-evaluations of the MEC positions, there is evidence the consultants were evaluating differently than the committees. This first occurred when the Willis consultant, Drury, did her review of the MEC's evaluations at its own request. It also occurred later on, during Wisner's review of challenges to the MEC evaluations. Wisner's discipline was noted to be slightly more liberal than the MEC. Willis testified to this effect in Volume 56, at p. 6940, lines 14 - 24:

Q. So, this goes back to your comment that Mr. Wisner was probably more liberal.

A. He was slightly more liberal, but that didn't bother me. I had a reason for not wanting to do the evaluations myself or to have Jan Drury do them, even though we had discussed the Committee, I was willing to accept the fact that Mr. Wisner's discipline might be slightly different. But it was the consistency in evaluation differences that I was looking for. So, Wisner made the best choice.

786. Willis was willing to acknowledge Wisner's discipline might be slightly different from the committees'. This did not concern him as long as there was no pattern in the differences.

787. Some evaluations were easier to do than others depending on the information in the questionnaire. Willis testified the responses from incumbents in female-dominated occupations were returned more quickly and contained better information than from incumbents in male-dominated occupations. He was asked if this could have an effect on the reliability of the evaluations in a restricted sense. His response is contained in Volume 68, at p. 8575, lines 3 - 13:

186

Q. But what I am trying to get at here is: Could that affect the reliability of the evaluations based on, let's say, occupational groups? In other words, were you getting more reliable information from predominantly female groups and less reliable from predominantly male groups.

A. I haven't tested that, but I believe that is a possibility, certainly, since the quality of the information does generally tend to be better from female-dominated groups.

788. Willis further testified the questionnaires from incumbents in high level technical and professional jobs were slower in returning and they contained weaker information than questionnaires from incumbents in clerical and vocational jobs. In this regard, Willis explains generally speaking the professional and technical jobs are more difficult for the evaluators to understand. He says in Volume 69, at p. 8582, lines 11 - 20:

THE WITNESS: Professional and technical level questionnaires would be less easy to understand than, say, trades or clerical.

MR. FRIESEN:

Q. And that is partly because they were not as well described in the information.

A. Partly, and partly because it is more difficult to understand a more complex job. [emphasis added]

789. Willis' opinion is verified by the testimony of at least two evaluators. Crich, a member of the first version of Committee #5, testified her committee had difficulty evaluating questionnaires from male- dominated occupational groups. In her view, this contributed to the problems experienced by that committee. We also have testimony from Latour, also a member of Committee #5, as to the difficulty this committee experienced in evaluating technical jobs. 790. For the most part, the QA Committee's work must be discounted in view of Willis' criticisms of their work. However, the evidence of two of the participants, Crich and Yates, merits consideration because it illustrates the difficulty experienced by the QA Committee members when evaluating the 25 male questionnaires identified from the Wisner 222 and their inability to achieve consensus in those cases.

791. By way of contrast, the consultants did not experience the same difficulty in evaluating the more complex questionnaires as did the committee members. The consultants had the benefit of professional job evaluation experience and training, which enabled them to evaluate those positions more easily than the committee evaluators. The fact the committee evaluators lacked that kind of professional expertise contributed, we believe, to the inefficiency of the job evaluation process and the lengthy discussions which took place during the evaluations.

792. Willis expressed a high regard for the competency and experience of his consultants in conducting pay equity job evaluations. Willis agreed

187

his consultants were more liberal in evaluating higher level positions. Considering the consultants' experience, background and education, he also believed they probably had a better understanding of the higher level jobs than the committee members.

793. Illustrations provided by the Employer were confirmed in the cross-examination by Respondent Counsel of Sunter and Shillington as to the effect of different treatment of female and male questionnaires on the wage gap. Different treatment (arising from gender bias) will have a direct impact on the wage gap. There are two distinct ways in which the wage gap will increase. An increase can occur if committees are under-evaluating male-dominated questionnaires. It can also occur when the committees are over-evaluating female-dominated questionnaires. In either case, it will have the same effect. Expressed in another way, the wage gap will be over-stated when either of these events occur.

794. If the 2.3 per cent disparity between the committees and the consultants is attributable to gender bias, then it arises either because the evaluators were consciously or unconsciously treating male-dominated jobs less favourably than the consultants or, on the other hand, were over- valuing female-dominated questionnaires and were therefore biased against male-dominated questionnaires. Sunter's statistical analyses do not identify a preference for female-dominated questionnaires by the multiple evaluation committees. The IRR test results illustrate the majority of the outliers demonstrated a male preference yet, when the final committee evaluations are compared to the consultants' evaluations, the disparities are indicative of a bias against male-dominated jobs.

795. In determining why the differences occur, the Tribunal is entitled to look at some compelling facts. Most importantly, the MEC was conservative in its discipline relative to the consultants. Firstly, according to Willis, the MEC discipline was more accurate than the consultants as reflected in his report to the JUMI Committee (Exhibit R-22) on the re-evaluations of the MEC evaluations which arose out of the IRR 103 challenges and the Treasury Board challenges. That report states in part:

We have no significant concerns regarding the MEC's understanding and application of the evaluation plan. The MEC's pattern of application of the evaluation plan to positions (their discipline) differs in some respects from the pattern which the consultants would use. However, given the manner in which the MEC membership was determined, their discipline constitutes a more accurate reflection of the values of positions as commonly understood within the Government of Canada than the consultant could determine from an outside point of view.

(Exhibit R-22, p. 8)

796. Secondly, the Willis consultants had an established discipline prior to the JUMI Study based on their experience in other studies. There is ample evidence from which to conclude the Willis discipline was more liberal than the MEC discipline. According to Owen, another Willis consultant, the Willis discipline influenced the consultants in their

188

evaluations performed during the JUMI Study. The consultants were experienced and professional evaluators. They were more familiar with higher level jobs, both managerial and technical, which they gained through previous pay equity exercises. The JUMI Study was the first time the consultants had done any evaluations in the Federal Public Service.

797. Thirdly, overall the evaluation committees followed the MEC's discipline. There were three or four occasions where the evaluation committees actually evaluated above the F level to the low G level. According to Willis, this by no means altered the MEC discipline.

798. Fourthly, outliers did not exert an observable influence on the committee evaluations, either in the MEC or in the multiple evaluation committees. The statistical evidence corroborates Willis' own conclusions that the outliers had no discernible effect on the evaluations of the other committee members.

799. Finally, both Willis and the evaluators testified the high level positions were difficult to evaluate. The distribution of questionnaires between male- and female-dominated occupational groups were not the same in terms of value. The more difficult questionnaires were in the high level male-dominated jobs where the greatest difference between the evaluation committees and the consultants occurred.

VIII. CONCLUSION

800. In light of these facts, as well as other matters previously referred to by the Tribunal, it is reasonable to conclude from the conservative discipline established by the MEC, the evaluators' inexperience and difficulty with evaluating high level jobs, together with the very subjective nature of the exercise that the disparity between the consultant and the committee was the result of, and is explainable by those factors we have mentioned. We conclude this resulted in a phenomenon which manifested itself in a reluctance on the part of MEC to attribute high scores to higher level questionnaires. Factors such as weak job information and difficulty in comprehending job information also contributed to this phenomenon. In applying the reasonable standard of proof as required under s. 11 of the Act, it is reasonable to conclude the difference between the committees and the consultants was not a gender difference. We find as a matter of fact the disparities resulted from an inability and/or reluctance on the part of the evaluators to evaluate high level male-dominated jobs according to the discipline of the consultants.

801. The conservative mind set of the MEC evaluators was the origin of this phenomenon which spread and continued throughout the work of the multiple committees. This conservativeness has its most telling effect on the male-dominated jobs at the higher end of the scale.

802. During his testimony Willis was unable to give his unqualified support to the JUMI Study results. He was, however, of the opinion the results should not be trashed. He was of the opinion they could be accepted at face value or with some adjustments by the Tribunal. There

189

remained, however, lingering questions, in view of Willis' discomfort, about how well the process worked.

803. This hearing has spanned 232 days to date. The Tribunal was afforded a wide range of both expert opinion evidence and non-expert evidence, including anecdotal evidence. In addressing the issue of reliability, we are mindful of the large number of agreements between the consultants and the evaluation committees on the re-evaluations. The standard of proof in this case is one of reasonableness. We find, for the most part, the committees and the consultants were able to agree on the evaluation scores, except with the more complex, professional and technical jobs distributed at the high range of the male-dominated jobs. The phenomenon beginning with the MEC, carried over into the multiple committee evaluations and was nourished by other factors, which contributed to the disparity between the consultants and the committees.

804. We find as a fact that the evidence establishes the evaluation results are sufficiently reliable, by any reasonable standard, as a basis on which to calculate the existence or otherwise of a wage gap between male and female employees employed in the same establishment who are performing work of equal value within the meaning of s. 11 of the Act and the Guidelines. The Employer has failed to provide any evidence which would cause the Tribunal to find otherwise or to change its decision.

Dated at Vancouver, British Columbia, this 19th day of January, 1996.

Donna Gillis, Chairperson

Norman Fetterly, Member

Joanne Cowan-McGuigan, Member

APPENDIX A COMMITTEE MANDATES

1. Sub-Committee on a Common Evaluation Plan

(a) Committee Mandate

The official mandate of this sub-committee was to determine what evaluation plans to examine and make recommendations to the JUMI Committee at large.

2. JUMI Committee

(a) Committee Mandate

The task of the JUMI Committee was to develop agreed parameters under which equal pay for work of equal value, as incorporated in the provisions of s.11 of the Canadian Human Rights Act could be implemented and to prepare a detailed plan for its implementation covering that portion of the Public Service for which the Treasury Board represents the employer.

3. Sub-Committee on Communications Strategy

(a) Committee Mandate

The mandate of this sub-committee was to analyze communication alternatives and recommend the most effective ones for implementation.

4. Sub-Committee for Training

(a) Committee Mandate

The mandate for this sub-committee was to draft and recommend a training package for coordinators. This sub-committee later transmuted into the Administrative Sub-Committee.

5. Testing Sub-Committee on the Willis Evaluation Plan

(a) Committee Mandate

The main objective of this sub-committee was to present to the JUMI Committee recommendations related to:

  1. the modification or clarification of the definitions and the factors pertaining to the four evaluation charts of the Evaluation Plan.
  2. the choice between the Working Conditions Evaluation Chart No. 1 or 2.

6. Sub-Committee on the Willis Questionnaire

A-1

(a) Committee Mandate

The mandate of this sub-committee was to finalize the format and contents of the Willis questionnaire (including developing examples). The sub-committee was asked to review the questionnaire and ensure that the questionnaires were sufficient to gather the necessary data for the questionnaire.

7. Administration Sub-Committee

(a) Committee Mandate

The mandate of this sub-committee was to conduct examination and discussion of, and present recommendations and/or make decisions on, all matters related to the administration of the Equal Pay for Work of Equal Value Study, with the exception of those responsibilities assigned to the Equal Pay Study Secretariat. Specifically, this sub-committee:

  1. devised, implemented and monitored any administrative action required by JUMI;
  2. provided the EPSS with guidance regarding administrative issues;
  3. recommended to JUMI actions (to be) taken;
  4. ensured the smooth administrative operation of the Study, within the framework established by JUMI, through setting priorities, delegating work, resolving issues and assessing the progress of the Study; and
  5. co-ordinated required training to coordinators, evaluators, reviewers and secretaries.

8. Master Evaluation Committee

(a) Committee Mandate

The primary purpose of the Master Evaluation Committee (MEC) was to evaluate a representative sampling of positions and, in so doing, provide the frame of reference for the five evaluation committees (later expanded to nine) to rely on, so that at the conclusion of the position evaluation stage of the study all 4,400 position evaluations would relate to one another fairly and equitably. The mandate of the Master Evaluation Committee was to:

  1. establish benchmark position ratings for approximately 600 positions through initial evaluation of a representative number of positions sampled, and a frame of reference to guide subordinate evaluation committees in the evaluation process;
  2. A-2

  3. provide advise and assistance to subordinate evaluation committees in particularly difficult evaluation cases;
  4. implement a monitoring system to ensure consistent and bias-free rating by subordinate evaluation committees; and
  5. as final authority, resolve controversial cases where an evaluation committee has made every effort to arrive at an agreed to rating but has been unsuccessful in doing so.

9. Mini-JUMI Committee

(a) Committee Mandate

The mandate of the Mini-JUMI Committee was to deal with procedural problems arising from the study. Initially, the JUMI Committee dedicated a large amount of time discussing procedural problems but eventually decided to create the Mini-JUMI Committee to deal with them.

10. Equal Pay Study Secretariat

(a) Committee Mandate

The Equal Pay Study Secretariat was a Joint Union/Management Secretariat. It was located in the Jackson Building and provided all administrative support to the evaluation process in the Study. The Chief was responsible for the co-ordination of all support activities and the effective communication of JUMI and Administrative Sub-Committee instructions.

11. Inter-Rater Reliability and Methodology Sub-Committee

(a) Committee Mandate

The mandate of this sub-committee was:

  1. to determine and make recommendations about the methodology and research necessary to test evaluation committee rater reliability; and
  2. to assess and make recommendations about research methodology as it applies to the JUMI Study as a whole.

12. Five Multiple Evaluation Committees

(a) Committee Mandate

The mandate of the five evaluation committees was to:

  1. evaluate approximately 750 positions each; and
  2. A-3

  3. keep the Master Evaluation Committee abreast of their evaluation proceedings, results and issues, through chairpersons.

The five evaluation committees were reorganized into nine evaluation committees on April 14, 1989.

13. Inter-Committee Reliability Sub-Committee

(a) Committee Mandate

The mandate of this sub-committee was to:

  1. examine the results of the tests administered to the evaluation committees in relation to the baseline provided by the consultants;
  2. examine the baseline score provided by the consultants;
  3. determine the significant differences in the consensus ratings of the committees in relation to the benchmarks and the baseline;
  4. formulate if needed, recommendations for training, re- training by the consultant and/or other courses of action for JUMI considerations; and
  5. identify procedural/process problems and potential for improvement including the revisions to the formulation of rationales.

14. Mini-MEC

(a) Committee Mandate

The Mini-MEC was charged with the task of reviewing the committee challenges to the MEC's evaluations. The JUMI Committee, directed Johanne Labine of PSAC and Michel Cloutier of the Treasury Board, both of whom sat on the MEC, to review the working conditions of all 100 benchmarks for shift work, overtime and living conditions, assess the amount of points to be changed, if any, and correct rationales.

It was ultimately decided that the MEC would not be reconvened, and that a Mini-MEC, a nucleus, or a small number of evaluators from the MEC would undertake this exercise. There were two members from the MEC who were selected to represent this Mini- MEC, Michel Cloutier and Johanne Labine. The idea was that Mr. Willis would meet with the two of them and resolve any differences.

A-4

15. Sub-Committee on Total Compensation

(a) Committee Mandate

The draft terms of reference for this sub-committee as of September 21, 1989 were:

  1. To identify the elements of compensation in the Federal Government that comprise wages as defined in Section 11(6) of the Canadian Human Rights Act;
  2. To compile the data required to establish wages for the positions evaluated;
  3. To devise a method to cost total compensation for purposes of correcting any identified wage disparities.

16. Quality Analysis Committee

(a) Committee Mandate

Paul Durber of the Commission created the Quality Analysis Committee to examine the 25 male-dominated jobs noted as possibly undervalued in the Wisner report in May, 1990. The purpose of the committee was to shed light on whether the maleness of the jobs, might help to account for their rating and whether, conversely, the differences between Mr. Wisner and the committees were due to simple perceptions of the work.

A-5

 You are being directed to the most recent version of the statute which may not be the version considered at the time of the judgment.