|
Chapter 8
Quantitative Data Collection Techniques
Revised - 11 November 2002
| Learning Objectives | External Web Sites | Application Exercises | Assignments
- Technical characteristics
- Validity: the extent to which inferences made from the results are appropriate,
meaningful, and useful 8.4, 8.5, 8.6, 8.8
- Two types of inferences commonly found in education
- Assessing achievement and thus focusing on the extent to which the
content of a test represents the larger domain of content that could have
been examined
- Assessing abstract traits that tend to be highly related to achievement
(e.g., intelligence, creativity, self-esteem, attitudes, reasoning, etc.) and
thus focusing on the extent to which the construct is represented
- Types of evidence
- Evidence based on test content: the extent to which the test covers the
content it is supposed to cover 8.7
- Types
- Face validity: a cursory examination of the content of a
test
- Content validity: a systematic examination of the content
of a test and the level of cognition at which it is tested
(see Bloom's Taxonomy 8.2)
- Estimated by creating a table of specifications and mapping
items into it 8.21
- Evidence based on response processes: the extent to which
observations of performance strategies or responses to specific tasks
are consistent with what is intended to be measured
- Examples
- Solving a mathematical problem demonstrating
mathematical reasoning
- Performing a complicated physical activity like designing
a floor plan or performing a chemistry experiment
- Estimated by examining respondent's explanations and patterns
of responses
- Evidence based on internal structure: the extent to which different parts
of an instrument and the items related to these parts are related in
prescribed manners (i.e., construct validity) 8.9
- Example: A "Survey of School Attitudes" scale might be
developed around four dimensions: curriculum, teachers and
administrators, physical environment, and social relationships
among students. Items written to each dimension should be
unique to that dimension and strongly related to one another.
Each dimension should represent a unique aspect of the
construct of school attitudes and therefore not be related highly
to the other dimensions.
- Estimated statistically by inter-item correlations and factor
analyses of dimensions
- Evidence based on relations to other variables: the extent to which
scores on a predictor measure relate in a predictable manner to scores
on a criterion measure
- Convergent and discriminant evidence
- Convergent evidence: scores on the predictor measure
relate positively to those on the criterion measure (e.g.,
ACT scores are related to freshman GPA's, GRE scores
are related to graduate school GPA's)
- Discriminant evidence: scores on the predictor measure
relate negatively to those on the criterion measure (e.g.,
frequency of counseling sessions and disruptive
behavior, lesson effectiveness and off-task behaviors)
- Estimated with correlation coefficients
- Criterion-related evidence
- Predictive validity: scores on a predictor measure are
collected first while scores on the criterion measure are
collected at some point in the future (e.g., ACT scores
are collected during the student's senior year in high
school while freshman GPA is collected after completing
his first year of college)
- Concurrent validity: scores on the predictor and criterion
measures are collected at the same time (e.g., a
teacher's perception of her effectiveness is collected at
the same time as an observer's ratings of that teacher's
effectiveness are made)
- Estimated with correlation coefficients
- Importance of validity evidence
- A researcher must establish validity for the measures being used
- Validity evidence is a matter of degree, not presence or absence
- Reliability: the extent to which the results are consistent, that is, they are similar over
different forms of the same instrument or occasions of data collection 8.4, 8.5, 8.6, 8.8
- Conceptual formula: Obtained Score = True Score + Error
- Sources of measurement error in test construction and administration
- Changes in time limits
- Changes in directions
- Different scoring procedures
- Interrupted testing session
- Race of the test administrator
- Time the test is taken
- Sampling of items
- Ambiguity in wording
- Misunderstood directions
- Effect of heat, light, ventilation, etc. in the testing situation
- Differences in observers
- Sources of measurement error associated with the person taking the test
- Reactions to specific items
- Health
- Motivation
- Mood
- Fatigue
- Luck
- Fluctuation in memory or attention
- Attitudes
- Test-taking skills
- Ability to comprehend instruction
- Anxiety
- Types of reliability estimates
- Stability: taking the same test on two occasions (i.e., test - retest)
- Equivalence: taking two different forms of the same test at the same time
(i.e., parallel forms)
- Equivalence and stability: parallel forms one of which is taken at one
time and the other at a later time
- Internal consistency: artificially splitting a single test into two halves
- Split-half: literally any combination of halves of the test (e.g.,
odd-even, first half-second half, etc.)
- Kuder Richardson formulae: statistical formulae estimating the
average of all combinations of spit-halves
- KR 20: uses test and item statistics in a complicated
formula
- KR 21: uses test statistics only but under-estimates the
KR 20
- Cronbach alpha: similar to the KR 20 but applicable to non-dichotomous responses (e.g., Likert scales, Semantic
Differential scales, etc.)
- Agreement: extent to which two or more people agree about what was
observed or rated (i.e., inter-rater reliability)
- Interpreting reliability coefficients
- All reliability coefficients range from 0 to 1 with the higher the coefficient
the greater the reliability
- Factors positively affecting the reliability coefficient
- Heterogeneity of the group taking the test
- Greater number of items
- Greater variability in scores
- Moderate levels of item difficulty
- Items that discriminate effectively between high and low
achievers
- Importance of reliability
- A researcher must establish reliability for the measures being used
- Reliability is a necessary, but not sufficient, condition for validity
- Tests
- Cognitive tests: measurement of student cognition
- Types of tests
- Standardized: tests which are characterized by prescribed methods for
administering, scoring, and interpreting scores 8.11
- Typically developed in very meticulous, rigorous ways and
therefore technically sound
- Broad based applications and uses
- Teacher-made: tests which are developed by teachers for use in their
classrooms
- Typically less technically sophisticated than standardized tests
- Specific applications and uses
- Content of tests
- Achievement: measurement of what a student knows
- Aptitude
- Measurement of a student's potential to know
- Usually used to predict future performance
- Interpretation of test scores 8.12
- Norm-referenced: scores are interpreted relative to the scores of others
taking the test (i.e., a norming group)
- Johnny performed better than 95% of the other students
- NRT scores include percentiles, stanines, CEEB scores (i.e.,
ETS scores), etc.
- Criterion-referenced: scores are interpreted relative to what the student
knows
- Sally knows how to add, subtract, and multiply, but she does not
know how to divide
- Typically interpreted relative to some performance standard
(e.g., pass or fail, competent or incompetent, etc.)
- ERIC Test Locator 8.3
- Alternative assessments: measurement of student performance and achievement in
"authentic" contexts (e.g., giving a speech, conducting a science experiment, writing an
original short story, etc.)
- Performance assessments: measuring student proficiency of cognitive skills by
directly observing how a student performs the skill 8.13
- Represents a holistic perspective on the skill
- Performed in an "authentic" context
- Criterion-referenced in orientation
- Portfolios: purposeful, systematic collection and evaluation of student work that
documents progress toward learning objectives
- Due to the dependence on subjective ratings and subsequent lack of reliability,
there is a need for sound rubrics for scoring these assessments 8.20
- Affective scales 8.14, 8.16
- Measures of interests, attitudes, self-concept, values, personality traits, beliefs,
etc.
- Characteristics of concern
- Response set
- A pattern of consistent responses based on format rather than
the trait being measured (e.g., strongly agreeing to every
statement once this response pattern is established)
- Controlled by using both positively and negatively worded items
so that a response set is not possible
- Faking
- Disguising the true response to an item, typically by choosing
the socially desirable or normal response rather than responding
honestly
- Controlled with anonymity or confidentiality
- Reliability is typically low for affective scales
- The lack of "right" or "wrong" answers implies that scores are often
interpreted from a norm-referenced perspective, and, thus, the
characteristics of the norming group become important
- Construct validity evidence is difficult to establish
- Questionnaire: a set of paper and pencil questions to which responses are requested
- Types of items
- Open (respondents create their responses) or closed (respondents choose their
response form alternatives given to them)
- Scaled item formats
- Likert: a question or statement followed by a set of scaled responses
- Typical format is 1) Strongly Agree 2) Agree 3) Neutral 4)
Disagree 5) Strongly Disagree
- Alternative formats describe levels of importance (i.e., very
important to not important at all), time (i.e., always to never),
performance (i.e., excellent to very poor), happiness (extremely
happy to extremely sad), etc.
- Neutrality
- Neutral positions can be obtained by offering an odd
number of points on the scale (e.g., 1 - 3, 1 - 5, etc.)
- Forced choices (positive or negative) can be obtained
by offering an even number of points on the scale (e.g.,
1 - 4, 1 - 6, etc.)
- Semantic differential: a statement or question followed by a set of
bipolar adjectives called anchors and a set of responses placing the
individual between these anchors
- Anchors include bipolar adjectives like easy - hard, like - dislike,
fair - unfair, etc.
- The typical format is Easy: ___ ___ ___ ___ ___: Hard where
the respondent checks the blank that corresponds to their
feelings
- Neutral or forced choice responses can be controlled through
the number of points on the scale as in the Likert format
- Ranked responses: statements or questions that require the respondent
to rank their response alternatives
- Typical rankings are based on importance, fairness, desire, etc.
- Forces respondents to differentiate among alternatives
- Checklists: statements or questions followed by a number of options
from which the respondent checks all that apply
- Development
- Justify the use of the technique
- Define the objective of the questionnaire
- Write the questions
- Make items clear
- Avoid double-barreled questions that contain two or more ideas (e.g.,
the following item, "Do you think the test was easy and fair?" asks for an
opinion about difficulty and fairness.)
- Questions should be relevant
- Respondents must be competent to respond (e.g., reading levels are
appropriate)
- Short and simple items are best
- Avoid negative items (e.g., "the test is fair" is a better statement than
"the test is not fair")
- Avoid biased items or terms
- Organize the questionnaire and layout
- Review the scale and pilot test for technical merit
- Pilot testing involves administering the questionnaire to a sample of
subjects to identify difficulties with directions, items, item formats,
responses, time limits, etc.
- Advantages and disadvantages
- Advantages
- Efficient
- Practical
- Useful with relatively large samples
- Provides for standardized instructions
- Disadvantages
- Data is typically self-reported
- Possibility of misinterpreting questions
- Low return rates, typically less than 50%, for mailings
- Interview schedule: oral questions and answers
- Types
- Structured: specific questions are followed by a set of responses from which the
respondents choose their responses
- Semi-structured: fairly specific questions with open-ended responses
- Unstructured: great latitude in terms of asking broad questions and flexibility of
responses
- Development
- Justify the use of this technique
- Define objectives
- Write the questions and responses if it is a structured interview
- Organize the interview
- The interview process
- Importance of personal characteristics (e.g., appearance, personality,
age, experience, racial background, gender, etc.)
- Use of probing for further clarification of an answer
- Advantages and disadvantages
- Advantages
- Flexibility and elaboration
- Opportunity to establish trust
- Depth of information
- Disadvantages
- Time consuming and typically expensive as a result
- Typically small samples
- Difficult data analysis
- Observation schedule: recordings of naturally occurring behavior seen or heard by the observer
- In quantitative research the observer remains detached and objective
- Development
- Justify the use of this technique
- Define observational units
- Types of observational data 8.19
- High inference: requires the observer to make judgements (e.g.,
enthusiastic, happy, etc.)
- Low inference: requires recording specific behaviors without
making judgements in a larger sense (e.g., number of times a
student leaves his seat, number of questions a student asks,
etc.)
- Types of observations
- Duration: length of time a behavior lasts
- Frequency count: number of times a behavior occurs
- Interval: behaviors that occur within a specified interval of time
- Usually ten seconds to one minute
- Frequency counts within the interval
- Continuous: a brief description of behavior over an extended
period of time
- Time sampling
- A selection of fixed or random time periods that will be
used to observe
- Used in conjunction with all of the types of observations
discussed above
- Training issues
- Observations must be objective, unbiased, and accurate
- Bias: idiosyncratic perceptions of the observer
- Advantages and disadvantages
- Advantages
- Records behavior as it naturally occurs
- Data is not self-reported
- Minimal effects of social desirability
- No response set effect
- Disadvantages
- Time consuming and thus expensive
- Difficult to conduct for complex behaviors
- Observer effects are difficult to control
- Unobtrusive measures: measures uninfluenced by an awareness of the subjects that they are
participating
- Types
- Physical traces
- Archives
- Running record
- Private record
- Simple observation
- Contrived observation (e.g., taping)
- Development follows those procedures described above for observations and interviews
- Advantages and disadvantages
- Advantages
- Controls for the Hawthorne Effect
- Controls for role selection
- Controls for response sets
- Disadvantages
- Time consuming and thus expensive
- Difficult to conduct for complex behaviors
- Observer effects are difficult to control
- Strengths and weaknesses of quantitative data collection techniques (See Table 8.7, p. 276)
- Websites related to general measurement issues
- Measurement 8.1
- The Quality of Assessment 8.10
- Types of Educational Measures 8.18
Original outline prepared for Addison, Wesley, Longman by Jeffrey Oescher, University of New Orleans
Back to Top

E-Mail Jeffrey Oescher
|