Asia-Pacific Forum on Science Learning and Teaching, Volume 17, Issue 1, Article 4 (Jun., 2016)
Ángel VÁZQUEZ-ALONSO, María-Antonia MANASSERO-MAS, Antonio GARCÍA-CARMONA and Marisa MONTESANO DE TALAVERA
Diagnosing conceptions about the epistemology of science: Contributions of a quantitative assessment methodology

Previous Contents Next


Method

Participants and context

Science education in Panama emphasises environmental education to strengthen the younger generations' awareness of their responsibilities in the careful management of their country's rich natural and environmental resources, though it currently intends to take part of the worldwide trend of teaching epistemology and NOS issues at all levels of the country's educational system by promoting in-service training teachers in NOS issues.

Within this framework, the present study was possible thanks to the voluntary collaboration across the whole country of many Panamanian students and teachers, who were invited to participate by SENACYT (the governmental agency for science and technology). The participants randomly and anonymously responded to one of two different questionnaires (Form 1 and Form 2) that had been constructed by a research team to cover the entire theoretical framework of VOSTS (see the following subsection) to develop a research project across several countries and to avoid respondents’ fatigue through the adequate length of each Form. The present study accounts for the responses obtained for the seven multiple-rating items on epistemology of science included in both forms; Form 1 and Form 2 were validly answered by 1887 and 1885 participants, respectively.

The study targeted four groups of the population: high-school students preparing to start university or first-year undergraduates (typically average age 17-18 years, labelled here as "young students" – 61%), final-year undergraduates or new graduates ("veteran students" – 24%), teachers in their initial training ("pre-service teachers" – 7%), and practising teachers ("in-service teachers" – 24%).  The above percentages sum higher than 100% because many individuals belonged to two groups that were not incompatible (e.g., an undergraduate student preparing to be a teacher was included within student groups and within the pre-service teacher group).

The age ranges from 16 years (youngest student) to 70 (oldest teacher) years, and the distribution by gender was roughly even (men: 52%; women: 48%).  The in-service teachers had a mean experience of about 15 years, and their distribution over the levels was primary (8%), secondary (30%), and university (62%).  There were both science (66%) and non-science – humanities – (34%) specialities in every group. The large sample is representative of Panamanian student and teacher groups (alpha = 95.5%; e < ± 3%; p = q = 50%).

Research instrument

Seven NOS items on epistemology were drawn from the pool "Views on Science-Technology-Society" – VOSTS – (Aikenhead, Ryan & Fleming, 1989) after its faithful translation and careful adaptation to the Spanish language and cultural context.  The original VOSTS items were empirically developed from student and teachers’ open responses and interviews to questions, which were then systematized into a multiple-choice format by researchers (Aikenhead & Ryan, 1992) who consider that the empirical process overcomes the immaculate perception objection and is equivalent to a pilot testing, thus endowing the instrument with intrinsic construct validity.  Further, Lederman, Wade and Bell (1998) consider VOSTS to be a valid and reliable instrument to investigate standpoints concerning the NOS. Additionally, empirical reliability was first established by Botton and Brown (1998) using a unique response model (respondents make a forced choice for one sentence in each item).

The seven items used in this study have a multiple-rating format (see Appendix B).  The item stem presents an epistemology issue, which is followed by multiple sentences, each one explaining a reason that develops a particular position (belief) on the stem issue and sorted with a label A, B, C, D, E…. Both the stem and the sentences use a simple, common, non-technical language style, as they were empirically developed from students’ answers. Further, the set of sentences does not reflect any particular philosophical standpoint, instead the whole set of sentences tries to cover a wide range of different positions, which supports a balanced evaluation of respondents' conceptions for each item.  The set of the respondent's indices on the sentences yields his/her evaluation profile on the item issue. A total of 36 epistemological sentences (16 for the three items of Form 1, and 20 for Form 2) were rated by the respondents (Appendix B).

The items are labelled with a five-digit number, and each sentence within an item is identified by the item number plus the letter that labels the sentence position within the item (e.g., 90521D means Sentence D within Item 90521).  Some sentences have an additional coding _C_ prepended to the tag number, indicating that the sentence represents an idea about which a group of expert judges strongly agreed on the category assigned to that sentence, whose details are given elsewhere (Vázquez, Manassero & Acevedo, 2006).

Procedures

The new model of multiple-rating responses and its methodological approach that enable its application to evaluate NOS conceptions were developed through a series of prior stages, whose complexity do not try to emulate any naïve step-model of scientific method:

  • The translation and adaptation of items through back-translation processes into Spanish ("Questionnaire of Views on Science, Technology, and Society", Spanish acronym COCTS).
  • The limitations of the single response model, which had been extensively applied by VOSTS users, on validity and information scores (Vázquez & Manassero, 1999).
  • The scaling of sentences into one of a 3-category scheme (Adequate, Plausible, or Naïve) by a panel of expert judges (Vázquez, Manassero & Acevedo, 2006), according to the concomitance between the sentence's content and contemporary scholars in history, philosophy, and sociology of science and technology (HPSST). The scheme is similar to those suggested by other researchers (Rubba, Schoneweg-Bradford & Harkness, 1993; Tedman & Keeves, 2001): Adequate (A) – the sentence expresses a fully acceptable NOS conception; Plausible (P) – although not totally adequate, the sentence expresses some acceptable aspects; and Naïve (N) – the sentence expresses a conception that is neither adequate nor plausible. The final letter of each sentence codes its assigned category (e.g., 90521D_A_ means that Sentence D of Item 90521 belongs to the Adequate category).
  • The design of a new multiple response model (MRM) in which the respondents rate their degree of agreement with all the sentences within the item. The MRM avoids the disadvantages of the "forced" choice and maximizes the information about the respondent's thinking about the question issue (Vázquez & Manassero, 1999).
  • The construction of a metric that provides a normalized, homogeneous and invariant index [-1, +1] for each sentence, whose value is computed taking into account the respondent's rating and the sentence's category. These indices are averaged on the basis of their sentence indices to compute item indices that efficiently summarize the respondent's conception of an item, (Manassero, Vázquez & Acevedo, 2003a, 2003b).

Response and metric

The MRM asks participants to express their agreement/disagreement with each item sentence on a nine-point scale (1 to 9, disagreement to agreement).  If a respondent does not wish to answer, he/she may choose one of two reasons ("I do not understand the issue" or "I do not have sufficient knowledge about the issue") or leave it blank (opening the MRM possibilities to avoid forced choice).

Each raw rating score (1-9) is transformed into a homogeneous invariant normalized response index in the interval [-1, +1] through a scaling procedure that takes into account the category of the sentence (Adequate, Plausible, Naïve) as previously assigned by a panel of expert judges (further details in Vázquez, Manassero & Acevedo, 2006).  For instance, an adequate sentence expresses an appropriate view on the issue, so that the scaling procedure assigns the top index score +1 to total agreement (9) and the bottom score -1 to total disagreement (1), and proportionally for the in-between scores.  A naïve sentence expresses a view that is neither adequate nor plausible, so that the scaling assigns the inverse scoring index to those of the adequate sentences.  A plausible sentence assigns the +1 scoring index to the middle raw score (5) and -1 to the two extremes (1 and 9), and proportionally for the in-between scores (see Table 1).  This three-category scaling for Likert multi-rating responses not only avoids forcing choices, but also guessing the "right" response patterns (Eagly & Chaiken, 1993) and is similar to the three-point rubric recently applied in other papers (e.g., Akerson & Donnelly, 2010).

Table 1: Correspondence between the respondent’s direct score on each sentence and its scaling transformation, according to the category of each sentence (Adequate, Plausible, or Naïve), into the normalized index for the sentence, which represents the standardized value of the respondent’s belief about each sentence.

Direct Score Scale [1-9] (Respondent’s Degree of Agreement)

 

Total

Near
Total

High

Partial
High

Partial

Partial
Low

Low

Near
Null

Null

 

9

8

7

6

5

4

3

2

1

Category of the sentence

Scaling Transformation into Normalized Indices [-1, +1] (Standardized Value of Belief)

Adequate

1

0.75

0.50

0.25

0

-0.25

-0.50

-0.75

-1

Plausible

-1

-0.50

0

0.50

1

0.50

0

-0.50

-1

Naïve

-1

-0.75

-0.50

-0.25

0

0.25

0.50

0.75

1

The value of the index represents the degree of match between the respondent's opinion, as originally expressed through the raw agreement scores, and the current views of HPSST experts.  The higher (lower) the index is, the stronger (weaker) the match, regardless of the category of the sentence (this is the invariance property of indices).  Thus, the closer an index is to the maximum positive value (+1), the better informed (closer to the experts' views on NOS) is the respondent's conception, whereas the closer it is to the negative value (-1), the more misinformed (distant from current NOS conceptions) is the respondent's conception.  As misinformed conceptions are associated with the lowest negative values of the index, and informed conceptions are represented by the highest positive values of the index, for brevity they will simply be referred to as "negative" or "positive", without implying any bias in the meaning of these words.

The sentence's indices form the basis for further computations and statistical analyses.  For instance, the average of the item sentence indices yields the global weighted item index, which constitutes a quantitative evaluation of the overall conception of the item issue.

Table 2. Item labels and issues about the nature of science displayed across the two questionnaire forms (Form 1 and Form 2) together with their reliability parameters.

Form 1 (F1) Items

Reliabilitya

Form 2 (F2) Items

Reliabilitya

F1_90211 scientific models (inference)

0.611

F2_90111 observations (theory-laden)

0.477

F1_90411 tentativeness

0.520

F2_90311 classification schemes (inference)

0.661

F1_90621 scientific method

0.586

F2_90521 role of assumptions (hypothesis, theories, laws)

0.649

 

 

F2_91011 epistemological status (creativity)

0.684

Note: a (superscript) Cronbach’s Aplha

The reliability coefficients (Cronbach's alpha) computed from the raw scores of all sentences is excellent (0.922 for Form 1, and 0.925 for Form 2). The reliability coefficients of the seven epistemological items (computed from the few sentence scores belonging to each item) are obviously lower (Table 2), due to the mechanical effect of the sharp diminution of the number of sentences that contribute to single item reliability coefficient.

Statistical analysis

The indices provide a homogeneous, invariant, and normalized meaning for the scores across all sentences and items, e.g., they provide a measure of the magnitude of the correctness of a conception.  The index scores allow various further computations to be made – averaging and relating different variables and applying inferential statistics for hypothesis testing, group comparisons, or to establish cut-off points defining different achievement levels (Vázquez, Manassero & Acevedo, 2006). Inferential statistics are usually analysed in terms of probabilistic measurements (p-values) of the significance of any apparent differences found.

Given that p-values carry no information about the relevance of the magnitude of the differences, the effect size parameter (the difference between means expressed in standard deviation units) is often used to this end.  The resulting values are usually interpreted on the basis of some simple criterion (Cohen, 1988), classifying differences into intervals labelled as trivial (d <  .10), small (d <  .20), medium (d <  .50), large (d <  .80), etc.  For a large sample, an effect size over 0.30 usually corresponds to statistically significant differences (p <  .01). Therefore, we shall henceforth use the term "relevant" to refer to differences that satisfy both an effect size over 0.30 and statistical significance (p < .01), which means they represent some practical educational value. Differences that do not pass this threshold will be considered "irrelevant", even though they still might be statistically significant or interesting from other perspectives (e.g., personnel evaluation).

 

 

 


Copyright (C) 2016 EdUHK APFSLT. Volume 17, Issue 1, Article 4 (Jun., 2016). All Rights Reserved.