A Fully Explicated Evidence-to-Conclusions Model for Two Child Custody Tests,

The Perception-of-Relationships Test (PORT) and Bricklin Perceptual Scales (BPS)

 

Barry Bricklin, Ph.D.

Adjunct Associate Professor

The Institute for Graduate Clinical Psychology, Widener University

 

Michael H. Halbert

Consultant to Management

Bala Cynwyd, PA

 

 

The following summary of research with the PORT and BPS was published as two separate articles.  One is called “Can Child Custody Data be Generated Scientifically?”  The second is called “Perception-of-Relationships Test and Bricklin Perceptual Scales: Validity and Reliability Issues.”

 

They appeared in the American Journal of Family Therapy.  Both came out in 2004, Volume 32, the first on pages 119-138, and the second

on pages 189-203 (two separate journal issues).

 

THE ARTICLES ARE LONGER THAN THE SUMMARY PRESENTED HERE, AND CONTAIN BRAND NEW NORMATIVE, RELIABILITY AND VALIDITY DATA.

Abstract

Existing and new validity data on 3,880 cases from the Bricklin Perceptual Scales and Perception-of-Relationships Test address the assertion that custody data cannot be generated scientifically.  Reliability (93 percent stability over 8 months) and validity data (90 percent agreement with multiple independent criteria) are presented, but without a fully explicated chain linking evidence to conclusions, one ends up with unresolvable, typically all-or-none, disputations about the adequacy of one’s evidence.  This chain includes, among others, the (confusing) role of values in science, how system complexities profoundly affect measurement choices, and the value of information to a decision-maker.  Psychometric indices, the usual source of such arguments, cannot alone address any of these areas.  New research addresses differentiating test-retest changes that are due to errors of measurement from true changes in measured variables.

Can Child Custody Data Be Generated Scientifically?

The article begins by looking at whether it is not only practically but even theoretically possible to create child custody data scientifically.  Doubts have been expressed in both areas (Krauss & Sales, 2000, pp. 859-870; O’Donohue & Bradley, 1999, pp. 314-315).  It is argued that many, if not all, of these doubts stem from too narrow a definition of “science,” the complexities of creating system-specific measurements and confusion about the unavoidable roles in science of value-driven choices that cannot, by themselves, be right or wrong.  System complexities impact the creation of suitable measurement units (ordinal and/or interval) as well as the choice of one’s reference standard (normative, criterion, single-participant).  Further, the challenge of validating system-specific data may be quite different than those that arise with non-system-specific targets.  Confusion also exists in regard to the value of using test data in real-life.  Several points are made on the need, in evaluating scientific merit, to present a highly explicated chain of reasoning that links evidence to conclusions.  Presenting such a chain is all one can do to demonstrate the degree to which an approach deserves to be called “scientific.”  For one thing, it should be noted that custody assessment tools are particularly difficult to assess for merit, since a decision-maker cannot weigh their value by checking their data against those derived from a widely accepted model.  None exists (Krauss & Sales, 2000, pp. 845-870; 1999, pp. 88-90).  Another reason for the detailed chain is to avoid the kind of arguments exemplified by proponents and enemies of the Rorschach test (Exner, 2001, pp. 386-388; 2002, pp. 391-404; Ganellan, 2001; Garb, Wood, Lilienfeld, & Nezworski, 2002, pp. 455-457; Meyer, 2001, pp. 389-396; Weiner, Spielberger, & Abeles, 2002, pp. 7-12; Wood, Nezwarski, Garb, & Lilienfeld, 2001, pp. 350-373) and those that involve basic disagreements about what the word “evidence” means in the phrase “evidence-based” (formerly, “empirically-validated”) practice (Anthony, Rogers & Farkas, 2003; Gonzales, Ringeisen & Chambers, 2002).  What one side offers as evidence is not viewed as evidence by the other side.

A Four-Tier Description of a Scientific Model

Our chain, spelled out by Piotrowski (1957, pp. 14-22), ascribes a simple but thorough description of a scientific model, similar to the one endorsed by Albert Einstein (1936).  There are four tiers.  The first consists of concepts.  “Intelligence,” “depression” and later, it is argued, a “tree,” are concepts, that is, not completely definable by external sensory data.  The second is principles.  Principles state the relations among concepts.  Empirical equivalents, most frequently the hidden cause of unresolvable arguments, define what one looks for in the world of sensory experience to exemplify a concept.  Validation refers to the degree to which the relations among the empirical equivalents of the concepts correspond to the relations among the concepts as stated in the principles.  There are no a priori ways to determine if empirical equivalents are well chosen, except insofar as how the four tiers work together to achieve some specific predictive goal.

The Often Confusing Role of Values in Assessing the Merits of Scientific Contributions

           Many believe values (personal or group) somehow contaminate the “objectivity” of the scientific process; for example, the value-driven nature of a best-interests determination makes it intrinsically impossible to approach it scientifically.  But all scientific endeavors are value-driven, and not simply as constructivists or postmodernists believe, but in more basic ways (Berger, & Luckmann, Gonzales, Ringeisen, & Chambers, 20022, 204-209).  Value-driven decisions are needed because “science” is not a closed system, one that is logically complete and internally consistent—one that possesses all the propositions and theorems needed to deduce all of its conclusions, as well as having the ability to prove any statement in it is true or false.

System Complexities and Measurement Choices

Most evaluators think of a system as an interactional model in which stable traits interact.  This can be seen in the way they conceptualize and write about their evaluations.  There are sections called “Mr. Jones,” “Mrs. Jones,” child “Mary Jones,” child “Sam Jones,” as though one can assess each element in a system as a separate entity and then somehow add up the parts.  In systems-based decisions, the elements of the system cannot be evaluated apart from the interactions of those elements within the system.  As people move in and out of systems, the relevant measurement reference standard can shift.  There are aspects of a custody evaluation in which it is helpful to know how Child 1 assigns value to his or her parents, which requires a single-participant reference (a child’s scores are compared to other of his or her scores) in addition to how value would be assigned to this parent by comparing him or her to other parents, which requires a group reference.  Note also that systems complexities can have profound effects on the choice of validating empirical equivalents.  The parent from whom a child seeks emotional closeness and/or active help can change dramatically depending on the family systems in which the child-parent interactions take place (Bricklin & Elliot, 2002[a]; 2002[b]).

Estimating Value to a Decision-Maker

Value is not inherent to data; it is evoked when the data are used to inform a decision.  The likelihood of the decision-maker choosing one or another course of action before having some particular information (in decision theory, called the “priors”), and how those likelihoods change as a result of new information are required inputs.  The value of information is the amount it reduces the likelihood of the decision-maker choosing any but the “best” course of action.  From the framework of this paradigm, when two experts disagree about the value of any particular instrument, they may both be right.  They may be making different assumptions about who is to use its measurement data, how they are to be used, what other information is available, and what cost estimates are involved.

A Four-Tier Description of the Perception-Of-Relationships Test (PORT) and Bricklin Perceptual Scales (BPS)

Concepts and Empirical Equivalents of the Tests

           The PORT and BPS illuminate the degree to which the exchange of information between a child and various caretakers lead to comfortable and effective behaviors on the part of the child within different family systems.  Such interchanges may involve the child’s seeking of psychological closeness and support, help with various tasks, or when a parent is teaching or modeling essential behaviors (competency skills, trustworthiness, etc.).

Validation Issues (Principles)

Criterion data were selected that reflected how the exchange of information between a child and caretaker affected the comfort and effectiveness with which a child subsequently functioned.  We sought data that were gathered over lengthy periods of time (in our samples, 4 months to 7 years) and were derived from multiple and independent sources of information (ecologic validity).  The original PORT study used a one-way mirror setup where three trained psychologists observed children and their parents interacting in spontaneous, semi-structured and highly structured tasks.  A quantified observation protocol was used, which we believe is essential to validate a test instrument when the validating team lacks time-rich data gathered from multiple and independent sources.  Note well that large sample sizes, in and of themselves, do not necessarily contribute to specific predictive efficiency, unless one is simply estimating population means.  Further, tools that use them in the absence of time-rich data gathered from multiple and independent sources, often depend upon “thick assumptive structures” (Bricklin & Halbert, in press).

The empirical equivalents of the original PORT validity criterion concepts (Bricklin, 1989; Bricklin & Bricklin, 1999, p. 340) were a large number of (mostly non-verbal) interchanges between parents and children observed from behind one-way mirrors by trained psychologists who concentrated primarily on how children reacted, and secondarily on what parents knew and how they behaved.

Interrater agreement among the MHP observers was very high—in the 90 percent range—partly because the categories in which the data were summarized were formulated to match the requirements of the legal system (discussed later): A > B (or B > A); A @ B; neither A nor B is adequate.  Interrater agreement in the number of positive interactions noted were also good, in the 80 to 85 percent range.  We will describe the statistical profile of our quantified observational protocol, since it is essential in understanding the validity data.  But note that these data are attained with our specific procedures.  In our original two samples (n=60; n=37), the following data emerged.  In a one-hour session in which a child interacted with both parents present so that the child could choose with whom to interact, the distribution ranged from zero to 12 positive reactions per caretaker.  Six to nine positive interactions per caretaker characterized about 70 percent of the children in our samples.  Less than five positive interactions was rare, as were scores greater than nine.  The mean number of positive reactions was 7.4 with a 1.2 standard deviation.  These scores are not meaningfully comparable to the scores one would get if interest centered on counting the number of positive and/or negative interactions initiated by parents (Kerig & Lindahl, 2002; Lahey, Conger, Atkeson & Treiber, 1984).  In another study (Bricklin, 2003), we recorded the number of positive interactions among children and their caretakers in four groups.  Groups 1 and 2 contained “good enough” parents (Schutz, Dixon, Lindenberger, & Ruther, 1989, pp. 16-24).  Group 3 had “high conflict” disputants (caretakers who were engaged in continual litigation for two or more years) and Group 4 had caretakers whose parental rights had been terminated.  The number of positive interactions as measured by our observational system declines steeply across the four groups, with scores of 0, 1 or 2 earned only by some members of Groups 3 and mostly by those in 4.

What Degree of Precision is Required of Validating Data?

           All mental health professionals who offer validity designations are directed to use judgmental categories that reflect the rather narrow range of choices utilized by our legal system, A>B; B>A; AB; neither A nor B.

Statistical data follow gathered between 1961 and 1997 are presented first.  New data are given later.

PORT and BPS Normative Data, 1961-1997

PORT Normative Data (1961-1997), n=1,581

Sex: 797 females; 784 males

Age: Mean age 7.76; SD=0.17

SES: Low-Middle to High-Middle

Race: 98 percent Caucasian; 2 percent all other

BPS Normative Data (1964-1997), n=2,389

Sex: 1202 females; 1,187 males

Age: Mean age 8.94; SD=2.40

SES: Low-Middle to High-Middle

Race: 98 percent Caucasian; 2 percent all others

PORT Validity Data (1961-1997), n=1,381

The percent-of-agreement rate is listed following the sample size.  Structured task problem-solving by children with access to both parents, observed from behind a one-way screen by three psychologists (1961), n=30, 90 percent; courtroom judges (1964-1981), based on all data available, n=45, 89 percent; agreement with BPS choices (1964-1981), n=23, 83 percent; courtroom judges (1981-1985), based on all data available, n=42, 95 percent; agreement with BPS choices (1981-1983), n=30, 84 percent; two psychologists, based on family therapy notes plus consultation with relevant therapists with families seen over two- to five-year intervals (1980-1985), n=30, 93 percent; courtroom judges (1986-1990), based on all data available, n=76, 93 percent; independent psychologists based on all clinical (except for PORT and BPS scores) and life-history data available (1995-1997), n=1,038, 89 percent.

BPS Validity Data (1964-1997), n=2,279

Agreement with PORT choices (1964-1981), n=23, 83 percent; two psychologists, based on family therapy notes plus consultation with relevant therapists with families seen over two- to seven-year intervals (1980-1983), n=21, 100 percent; courtrooms judges (1980-1983), n=30, 90 percent; “Would” questionnaire choices (a “disguised” semi-projective test, asking what Mommy/Daddy would do in certain situations e.g., “You get a bad mark on a test”) (1980-1983), n=23, 87 percent; PORT choices (1981-1983), n=30, 84 percent; courtroom judges based on all available information (1984-1990), n=179, 96 percent; independent psychologists based on all clinical and life-history data available (1988), n=141, 97 percent; independent psychologists based on all clinical and life-history data available (1992-1995), n=1,765, 88 percent; independent psychologists based on all clinical and life-history data available (1995-1997), n=67, 87 percent.

PORT/BPS Interrater Reliability

Interrater reliability of PORT scoring was obtained from two samples of seminar attendees (n=36; n=41) in which more than half of the scorers had no prior experience with the PORT.  Four different percent-of-agreement scores were obtained: (1) the points scored on Task I (the most complex task); (2) the POC on Task I; (3) the overall TDS score for all seven tasks; (4) the overall POC based on seven tasks.  The percent-of-agreement rates, respectively, were: 74; 90; 82; 92.  No interrater data for the BPS were gathered since scoring it is mechanical and requires only the ability to read arabic numbers and to recognize when one is larger than another.  It is also assumed that an evaluator can add and subtract numbers between zero and 32.

New Test-Retest and Validity Data: 1997-2002

Purposes

One purpose was to gather test-retest data with larger samples than had been used before.   Another was to formulate clinical hypotheses that could detect patterns that would red-flag test changes in the parent-of-choice (POC) over time and to investigate whether such changes should be considered errors of measurement or true changes in the measured variables.

How the Sample Was Formed and Validity Issues

Mental health professionals, abbreviated MHPs, were recruited from among those who had written or phoned the Professional Academy of Custody Evaluators (PACE) with a custody-related question between 1995 and 2002.  They were also recruited at seminars given by PACE.  Each MHP who made a validity criterion designation had to have continual contact either with the families of the tested children, and opportunities to observe each child with his or her parents or continual exchanges of information with a MHP who had such contact.  Each MHP who made an independent validity criterion designation was instructed to use all of the test, documentary, data-based observation protocol and other clinical/life-history information available (except for PORT or BPS scores).  This included numerous consultations with the MHP who had ongoing contact with each child and his or her family over the time-spans involved.

The Participants

           One hundred and twenty-seven children took the PORT at least two times, where the time span between Test One and Test Two was at least six months.  Ninety-three children also took the BPS.  The actual mean spread of months between Test One and Two turned out to be eight months, with none less than six months.  One group consisted of children who came from intact families.  There were 57 children in this group.  Fifty-four of the 57 were in some form of psychotherapy.  A second group was composed of two children whose parents were about to divorce, although the parents were still living together.  One of the two children was in psychotherapy.  A third group was made up of pre-divorce parents who were living separately.  Five children came from this group.  One of the five was in therapy.  The fourth group consisted of parents who had already divorced.  There were 63 children in this group, 16 of whom were in some form of psychotherapy.  The relative proportions of the numbers regarding the children in the BPS group were essentially the same as for the PORT.

PORT and BPS Normative Data, 1997-2002

PORT Normative Data (1997-2002), n=127

Sex: 61 females; 66 males

Age: Mean age 7.87; SD=2.101

SES: Low-Middle to Upper-Middle

Race: 92 percent Caucasian; 8 percent all other

BPS Normative Data (1997-2002), n=93

Sex: 47 females; 46 males

Age: Mean age 7.88; SD=1.473

SES: Low-Middle to Upper-Middle

Race: 92 percent Caucasian; 8 percent all other

The Research Hypotheses, Their Rationales and the Criterion Validation Methodology

The PORT and BPS use ordinal and interval scales.  Although the BPS uses a point score to record how a child expresses a parent’s value to him or her in a single life area, these scores represent an ordinal scale, where the only meaningful information is A>B, B>A or A=B.  While Point Scores represent ordinal scaling, their use makes scoring each item easier, and allows an evaluator to at least generate clinical hypotheses when the point spread between two caretakers is large, even though this spread does not represent true interval scaling.  Our practical and theoretical reason for using ordinal data for each BPS (and PORT) item, was to discourage an evaluator’s tendency to be overly swayed by a parent’s favorable (or unfavorable) showing in but a single life area.  Further, it would be as difficult to develop a true interval scale to reflect “parental value” to a child as it would be to develop an interval scale to measure the value of different taste sensations.  The ordinal data are summed.  The PORT POC is based on parental value to a child in 7 family system areas, and the BPS in 32 childcare areas.  The test-designated parent-of-choice is the parent who has greater value to the child in the greater number of life-areas.  This number is called the Item Difference Score (IDS) on the BPS and the Task-Difference Score on the PORT, and represents the number of categories in which one parent has greater value to the child than the other.  Both the Item Scores and the Item Difference Scores on the BPS, (a scale from zero to 32) and the Task and Task Difference Scores on the PORT, (a scale from zero to seven), represent interval scaling, and can be used as such.  For example, one could compute, for any group of children, the magnitudes of BPS IDSs and thus obtain a group reference standard that represents how children in general differentially assign value to their caretakers (See Tables 1 and 2).

Our paper addresses whether PORT concurrent validity data are similar to future validity. They are.

The paper addresses the stability of test-retest data on the PORT and BPS over an  eight month interval.  They are quite stable.  (Since the paper on which this descriptive summary is based has been submitted for publication, we cannot present the actual tables.  Test-retest stability for both tests over an eight-month interval is 97 percent.  Instability increases sharply as a TDS and IDS approach zero and one on the PORT, and zero, one, two or three on the BPS.)

Several hypotheses investigated factors that may influence a change in either a PORT or BPS test-retest POC.  Two central issues were involved.  One is the issue of whether such changes should be considered errors of measurement or true changes in the measured variables.  The second is a practical issue.  It was our hope to be able to alert evaluators to situations in which a change in POC might be expected.

Our results reveal several test/clinical signs that can help an evaluator to differentiate between scenarios that are highly unstable from those that are not.

The data revealed the extent to which each PORT task reflect the child’s dynamics as he or she functions within specific family systems, for example, with each parent alone, with both parents together, and so forth.

The article concludes with a discussion of the decision paradigms we encounter in the real psycholegal world as regards the resolution of disputed custody issues.  In general, they reflect the three reference standard categories discussed.  The first category, the criterion reference, is for example, a judge who believes and acts upon the premise, inter alia, “Fathers should play an important and equal-to-the-mother role in raising children almost regardless of the facts of the instant case.”  The second category, the group reference, is exemplified by those who base their proferred information and/or conclusions mainly on data derived from individual measurements compared to those of previously examined groups.  The third category, the single-participant, derive their main information from data that points to “parental value to this particular child functioning within specific family systems.”  There is a fourth category, inhabited, so far as we can determine, by an ultra-cautious group of academics, who not only refuse to address ultimate legal issues (as do many of us), but refuse to offer any information at all except pointers derived from traditional psychological sources on areas needing improvement.

One could probably offer a good argument for any (or all) of these positions.  We have presented information on the importance of including in the decision process, along with the others, data from a single-participant reference standard, our attempt to measure specific parental value for a particular child in different family systems.