Applying Cognitive Diagnostic Models to Mechanics Concept Inventories

Vy Le School of Education, Iowa State University, Ames, IA, 50011, USA    Jayson M. Nissen Nissen Education and Research Design, Monterey, CA, 93940, USA    Xiuxiu Tang College of Education, Purdue University, West Lafayette, IN, 47907, USA    Yuxiao Zhang College of Education, Purdue University, West Lafayette, IN, 47907, USA    Amirreza Mehrabi School of Engineering Education, Purdue University, West Lafayette, IN, 47907, USA    Jason W. Morphew School of Engineering Education, Purdue University, West Lafayette, IN, 47907, USA    Hua Hua Chang College of Education, Purdue University, West Lafayette, IN, 47907, USA    Ben Van Dusen School of Education, Iowa State University, Ames, IA, 50011, USA
Abstract

In physics education research, instructors and researchers often use research-based assessments (RBAs) to assess students’ skills and knowledge. In this paper, we support the development of a mechanics cognitive diagnostic to test and implement effective and equitable pedagogies for physics instruction. Adaptive assessments using cognitive diagnostic models provide significant advantages over fixed-length RBAs commonly used in physics education research. As part of a broader project to develop a cognitive diagnostic assessment for introductory mechanics within an evidence-centered design framework, we identified and tested student models of four skills that cross content areas in introductory physics: apply vectors, conceptual relationships, algebra, and visualizations. We developed the student models in three steps. First, we based the model on learning objectives from instructors. Second, we coded the items on RBAs using the student models. Lastly, we then tested and refined this coding using a common cognitive diagnostic model, the deterministic inputs, noisy “and” gate (DINA) model. The data included 19,889 students who completed either the Force Concept Inventory, Force and Motion Conceptual Evaluation, or Energy and Momentum Conceptual Survey on the LASSO platform. The results indicated a good to adequate fit for the student models with high accuracies for classifying students with many of the skills. The items from these three RBAs do not cover all of the skills in enough detail, however, they will form a useful initial item bank for the development of the mechanics cognitive diagnostic.

computerized adaptive testing, DINA model, mechanics cognitive diagnostic, formative assessments, LASSO.

I Introduction

Since the development of the Force Concept Inventory (FCI), research-based assessments (RBAs) have played an important role in sha** the landscape of physics education research (PER) [Madsen et al., 2017; Docktor and Mestre, 2014]. RBAs have provided instructors and researchers with empirical evidence about how students learn and change throughout courses [Madsen et al., 2017, 2016]. Researchers used data from RBAs to assess the impact of curricular and pedagogical innovations [Madsen et al., 2017]. RBAs also play a central role in documenting inequities in physics courses before and after instruction [Van Dusen and Nissen, 2020; Wilcox and Lewandowski, 2016]. In previous studies, researchers have primarily used the data from the RBAs as summative assessments to evaluate the effectiveness of a course [Wilcox and Lewandowski, 2016; Thornton et al., 2009; Stoen et al., 2020].

Although some instructors use RBAs as formative assessments to inform their instruction, such as creating groups with diverse content knowledge [Laverty et al., 2022; Wilson, 2008], two common shortcomings of existing RBAs hamper their use as formative assessments: 1) a lack of easily actionable information and 2) a lack of timely information [Madsen et al., 2016]. Examining overall RBA scores on student pre-tests can inform an instructor how well-prepared a group of students is. Still, the overall RBA scores do not help instructors identify the specific skills students need to gain to be successful. Instructors and researchers also examine student gains in scores from the first (pre-test) to the last (post-test) week of class. While this is a useful measure of the impact on instruction, it is an inherently retrospective activity that cannot inform instruction throughout a course.

To address the shortcomings of existing RBAs, we are develo** the mechanics cognitive diagnostic (MCD). The MCD is a cognitive diagnostic (CD) computerized adaptive testing (CAT) assessment [Leighton and Gierl, 2007]. CD-CATs are adaptive assessments that can cover the specific contents and skills an instructor needs and wishes to assess them. CDs assess which skills a student has or has not mastered [Cui et al., 2012]. CATs can adapt to students’ proficiency level and skill mastery profile, making assessment individualized and more efficient. These features allow an instructor to administer a CD-CAT as a formative assessment throughout a semester. The MCD will provide instructors with student-level and course-level assessments of student content knowledge and skill acquisition to help tailor instruction to students’ needs.

To support the development of the MCD, we investigated the skills assessed by three RBAs commonly used in introductory college mechanics courses [Madsen et al., 2017]. This research develops the models for the student skills and the evidence for assessing those skills as a component of the larger development of the MCD. The MCD will leverage this information to provide instructors with timely and actionable formative assessments.

II Research Question

To support the development of the MCD to measure skills across introductory mechanics content areas, we developed and applied a model of four skills to three commonly used RBAs for introductory mechanics courses. To this end, we ask the following research question:

  • What skills and content areas do three RBAs for introductory mechanics cover?

III Definitions

To support readers’ interpretation of our research, Table 1 includes a selection of terms and their definitions.

Table 1: Definitions of terms.
Term Definition
Computerized adaptive testing (CAT) Administered on computers, the test adaptively selects appropriate items for each person to match student proficiency [Morphew et al., 2018; Chang, 2015; Weiss, 1982].
Proficiency The proficiency of a person reflects the probability of answering test items correctly. The higher the individual’s proficiency, the higher the probability of a correct response. Different fields refer to proficiency as skill, ability, latent trait, omega…
Skills A latent attribute that students need to master to correctly answer items and that cut across content areas [Chang, 2015; Helm et al., 2022; Li and Traynor, 2022].
Cognitive diagnostic (CD) assessment An assessment method that evaluates students on a set of specific skills to determine mastery. In contrast to traditional assessment methods that measure students on a single proficiency, CD provides diagnostic information on skill strengths and weaknesses to support personalized educational strategies [Ravand and Robitzsch, 2015; De La Torre and Minchen, 2014].
Classification accuracy The proportion at which a cognitive diagnostic model accurately classifies a student’s skill mastery status, assessed using a simulation-based approach [Wang et al., 2015].
Deterministic inputs, noisy “and” gate (DINA) model A cognitive diagnostic model assuming that a student must master all the required skills to correctly solve an item. The absence of any required skills cannot be compensated by the mastery of others. This model operates within a binary framework, categorizing each skill as either mastered or not mastered [Haertel, 1984; Junker and Sijtsma, 2001; De La Torre and Minchen, 2014].
Evidence-centered design (ECD) A framework for develo** educational assessments based on establishing logical, evidence-based arguments [Mislevy et al., 2003].

IV Literature Review

Many physics education researchers and instructors use existing fixed-length RBAs. PhysPort [Phy, n.d.a] and the LASSO platform [Alliance, 2020] provide lists and resources of these RBAs. Initially, instructors administered these RBAs with paper and pencil, but the administration is moving to online formats [Van Dusen et al., 2021]. This move to online data collection has led to the development of CATs for introductory physics that have advantages over fixed-length tests. In this section, we discuss RBAs in introductory mechanics, options for administering RBAs online, CAT broadly, and the application of CAT to RBAs in physics.

IV.1 RBAs in introductory mechanics

PhysPort [Phy, n.d.b] provides an extensive list of RBAs for physics and other extensive pedagogical resources. PhysPort, however, does not administer assessments online. RBA developers and researchers have instead often relied on Qualtrics to administer the RBAs they develop or use online or the LASSO platform [Alliance, 2020; Van Dusen et al., 2021]. Administering RBAs online allows assessing students in class or outside of class to save class time, can automatically analyze the collected data, and can aggregate the data for research [Van Dusen, 2018].

PhysPort describes 117 RBAs [Phy, n.d.b] with 16 RBAs for introductory mechanics. Each RBA targets content areas and skills important for physics learning. The titles of each RBA often state the focus of the RBAs. For example, our study analyzed data from three RBAs. The Force Concept Inventory (FCI) [Hestenes et al., 1992] focuses on conceptual knowledge of forces and kinematics. The Force and Motion Conceptual Evaluation (FMCE) [Thornton and Sokoloff, 1998] provides similar coverage but also has four questions on energy. The Energy and Motion Conceptual Survey (EMCS) [Singh and Rosengrant, 2003] covers exactly what the name states. Other assessment names also portray skills or content areas of interest to physics education: the Test of Understanding Graphs in Kinematics, the Test of Understanding Vectors in Kinematics, and the Rotational Kinematics Inventory. These names imply that graphs and vectors play an important role in many physics courses and that many physics courses cover rotation.

IV.2 Cognitive diagnostic - computerized adaptive testing

Computerized adaptive testing (CAT) uses item response theory (IRT) to establish a relationship between the students’ proficiency levels and the probability of their success in answering test items [Collares, 2022]. CAT selects items based on student responses to the preceding items to estimate the students proficiency and then align each item’s difficulty with the individual’s proficiency [Chang, 2015]. This continuous adaptation of item difficulty to student proficiency ensures that the test remains challenging and engaging for the students throughout its duration and provides a more precise estimation of the proficiency of students than paper-and-pencil assessment [Morphew et al., 2018; Chang, 2015; Weiss, 1982]. Compared to paper-and-pencil assessment methods, CAT requires fewer items to accurately measure students’ proficiency meanwhile controlling the selected items concerning their content variety [Şahin and Özbasi, 2017]. Chen et al. [2008] shows that CAT supports test security by drawing from a large item bank to control for item overexposure and how CAT can use pretest proficiency estimates for item selection and proficiency estimation to maximize test efficiency.

The combination of cognitive diagnostic (CD) models and CAT improves the assessment process and categorizes students based on their mastery of distinct skills associated with each item. CD models aim to estimate how the students’ cognitive proficiency relates to the specific skills or contents necessary to solve individual test items [Chang, 2015; Collares, 2022], with skill as a fundamental cognitive unit or proficiency that students need to acquire and master to answer certain items [Helm et al., 2022; Li and Traynor, 2022]. Deterministic inputs, noisy “and” gate (DINA) model emerges as a CD model that facilitates the assessment of skill mastery profiles and the estimation of item parameters [Anamezie and Nnadi, 2018]. DINA model leverages a Q-matrix to delineate the relationships between items and the requisite skills [Chen et al., 2015], thereby providing a structured framework for monitoring the mastery levels of distinct proficiency Chen et al. [2015]. The DINA model is applied for the evaluation of the mastery situation of students across various skills, including problem-solving [Zhang et al., 2021], computational thinking [Li and Traynor, 2022], and domain-specific knowledge [Chen et al., 2015].

IV.3 CAT in physics education

We are unaware of any CD assessments in physics education research (PER). Researchers have, however, conducted studies on the effectiveness of CAT using IRT to evaluate students’ proficiencies [Morphew et al., 2018; Yasuda et al., 2021a]. One such study by Istiyono et al. [2018] utilized CAT to assess the physics problem-solving skills of senior high school students, revealing that most students’ competencies fell within the medium to low categories. Morphew et al. [2018] explored the use of CAT to evaluate physics proficiency and identify the areas where students needed to improve when preparing for course exams in an introductory physics course. Their studies showed that students who engaged in using CAT improved their performance on subsequent exams. In another study, Yasuda et al. [2021b] also indicated CAT can reduce testing time by shorter test lengths while maintaining the accuracy of test measurement and administration. Yasuda et al. [2021a] examined item overexposure in FCI-CAT, employing pre-test proficiency for item selection, which shortened test duration while maintaining accuracy and enhanced security by reducing item content memorization and sharing among students.

V Theoretical Framework

Refer to caption
Figure 1: An evidence-centered design (ECD) model for the creation of the mechanics cognitive diagnostic (MCD). This paper focuses on two specific models of the ECD framework (student models and evidence models) due to their alignment with our research question. The research question encompasses assessing both skills and content areas. Firstly, student models determine the skills and content areas (K - kinematics, F - forces, E - energy, and M - momentum) that our assessment aims to measure. The student models ensure that the assessment reflects the competencies we intend to evaluate. Secondly, evidence models, which comprise evidence rules and a measurement model, play a role in guiding the evaluation process. These sub-models operate by analyzing student responses and their proficiency levels. The incorporation of the DINA model within these evidence models refines skills on each item which coded before. The remaining models (models 3-6) are not within the scope of this paper. Task model focuses on multiple-choice questions about kinematics, forces, energy, and momentum, each having a definitive right or wrong answer. Assembly model integrates student models, evidence models, and task model, adjusting item difficulties based on student performance using the CD algorithm. Presentation model describes how tasks are presented, particularly online and in multiple-choice format. Lastly, delivery system model ensures the integration of these models within the LASSO platform, covering aspects like security and timing for a well-rounded assessment system.

We drew on evidence-centered design (ECD) [Mislevy et al., 2003] to inform our development of the MCD. ECD was initially applied in the high-stakes contexts of the Graduate Record Examinations (GRE) [Mislevy et al., 2003; Sheehan et al., 2007], and has also been effectively utilized in PER for the development of RBAs [Pollard et al., 2021; Vignal et al., 2023]. There are three core premises in the ECD framework used in this study [Mislevy et al., 2003]:

  1. 1.

    Assessment developers need content and context expertise to create high-quality items. In this analysis, we focused our analysis on three RBAs developed by physics education researchers - FCI, FMCE, and EMCS.

  2. 2.

    Assessment developers use evidence-based reasoning to evaluate students’ comprehension and identify misunderstandings accurately. In this analysis, we developed a Q-matrix that identified which underlying skills were required to correctly answer each item (more details in section VI.2).

  3. 3.

    When creating assessments, developers must consider various factors such as resource availability, limitations, and usage conditions. For instance, the LASSO platform supports multiple-choice items and needs web-enabled devices, but it conserves class and instructor time.

Our work used the conceptual assessment framework provided by the ECD framework with its six models [Mislevy et al., 2003] (shown in Fig. 1) to guide assessment development. The models and their connections to our work are as follows:

  1. 1.

    Student models focus on identifying one or more variables directly relevant to the knowledge, skills, or proficiencies an instructor wishes to examine. In this project, a qualitative analysis (see section VI.2) indicated that four skills (i.e., apply vectors, conceptual relationships, algebra, and visualizations) and four content areas (i.e., kinematics, forces, energy, momentum) would be optimum for our MCD.

  2. 2.

    Evidence models include evidence rules and measurement model to provide a comprehensive guide to update information regarding a student’s performance. The evidence rules govern how observable variables summarize a student’s performance on specific tasks. In this project, our evidence models are created by scoring student responses and iterative testing a Q-matrix that relates performance on each item to the underlying skills. The measurement model is the DINA model, which supports suggested Q-matrices.

  3. 3.

    Task model describes the structure of situations such that the situation collects the essential data for the evidence models. It defines a class of test items based on specific topics or areas of knowledge. In this project, the task model focuses exclusively on multiple-choice items pertaining to the topics of kinematics, forces, energy, and momentum, where each item has a definitive right or wrong answer.

  4. 4.

    Assembly model describes how the three models above, including the student models, evidence models, and task model, work together to form the psychometric frame of the assessment. In the broader project, we developed an IRT model and a CD model that coordinates the item difficulty, skills, and content areas of each next item based on a student’s prior responses. The IRT model selects items that are appropriately difficult for each student and the CD model determines mastery of skills and content areas.

  5. 5.

    Presentation model provides a realistic view of how tasks are presented across different evaluation settings. In this project, the items are presented in an online multiple-choice format.

  6. 6.

    Delivery system model describes integrating all the models required for evaluation. In this project, we use the LASSO platform [Van Dusen, 2018; Nissen et al., 2022].

In this paper, we focus on the student models and evidence models (models 1-2). These models are instrumental in aligning our analysis with the research question. By evaluating the student models, we gain insights into the range of competencies RBAs are designed to assess. Similarly, through evidence models, we understand how these assessments capture and represent student understanding in various skills and content areas.

VI Materials and Methods

To answer the research question, we employed a mixed methods approach using qualitative coding to identify the skills and content areas to measure for the student models. Subsequent quantitative analyses drove the testing of the evidence models and iterative improvements of the student models. We first used artefacts from courses to build the student models of skills that cut across the content of introductory mechanics courses. We then identified RBAs with sufficient data available through the LASSO platform and coded each item for the skills it assessed. Last, we used an iterative process that applied DINA models to build the evidence models and to improve our definitions of the skills and the coding of the skills on each item. In this iterative process, the DINA model provided suggested changes to the item skill codes initially made by content experts. The suggested changes were accepted or rejected by content experts. We then ran a final DINA model on our revised codes.

VI.1 RBAs data collection and cleaning

Our analysis examined student responses on three RBAs: the FCI (30 items, 12932 students), FMCE (47 items, 5510 students), and EMCS (25 items, 1447 students). Our dataset came from the LASSO platform [Van Dusen, 2018; Nissen et al., 2022; Alliance, 2020]. LASSO provided post-test data from 19889 students across the three assessments. We removed assessments completed in less than five minutes and assessments with missing answers.

VI.2 Qualitative Data Analysis

We developed an initial list of skills and content areas covered in physics courses by coding learning objectives from courses using standards-based grading. We focused on standards-based grading because instructors explicitly list the learning objectives students should master during the course [Beatty, 2013]. Initially, we coded a wide range of skills based on our analysis from these RBAs [Madsen et al., 2017]. From this coding, we aggregated the broader set of skills into five initial skills: apply vectors, conceptual understanding, algebra, visualizations, and definitions. We discarded definitions as a skill because it represents a memorized response that the other skills covered in greater depth by asking students to apply or understand the concept. And, we are not aware of RBAs for introductory physics that ask definition questions. Table 2 lists the four skills and our definitions of the skills.

We initially coded content areas at a finer grain size to match the standards-based grading learning objectives, e.g., kinematics was split into four areas across two variables: 1D or 2D and constant velocity or constant acceleration. These content areas, however, were too fine grained to develop an assessment with a reasonable length for students to complete it or a realistic size item bank. Therefore, we simplified the content codes: i.e. kinematics, forces, energy, and momentum for these three RBAs. Table 9 lists the four content areas covered by these three RBAs and their definitions.

Based on this initial set of codes we developed, we coded each item for its relevant skills and content areas. Our coding team included three researchers with backgrounds in physics and teaching physics. Each item was independently coded by at least two team members. The three coders then compared the coding for the items and reached a consensus on all items. This consensus coding of the three assessments provided one of the inputs into the DINA analysis.

Table 2: Definition of the skills present in the FCI, FMCE, and EMCS assessments.
Skills Definition
Apply Vectors Item requires manipulating vectors in more than one dimension or has a change in sign for a 1-D vector quantity.
Conceptual Relationships Item requires students to identify a relationship between variables and/or the situations in which those relationships apply.
Algebra Item requires students to reorganize one or more equations. This goes beyond recognizing the standard forms of equations.
Visualizations Item requires extracting information from or creating formal visualizations such as xy𝑥𝑦xyitalic_x italic_y plots, bar plots, or line graphs.
Table 3: Definition of the content areas present in the FCI, FMCE, and EMCS assessments.
Content Areas Definition
Kinematics Items concerning the motion of objects without reference to the forces that cause the motion.
Forces Free body diagram, and Newtonian laws.
Energy Conservation of energy, work, set up system, and the relationship between force and potential energy.
Momentum Conservation of momentum and impulse.

VI.3 Quantitative data analysis

VI.3.1 DINA model

Researchers commonly use the DINA model for cognitive diagnostic (CD) models [Haertel, 1984; Junker and Sijtsma, 2001], which aims to identify students’ strengths and weaknesses across skills measured by an assessment. On a CD assessment, correctly answering each test item requires the mastery of different skills. The relationship between each item and its required skills is denoted by a Q-matrix [Tatsuoka, 2012], which is shown in Fig. 1. The Q-matrix is a J×K𝐽𝐾J\times Kitalic_J × italic_K matrix, and qjk=1subscript𝑞𝑗𝑘1q_{jk}=1italic_q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = 1 indicates the correct answer on the j𝑗jitalic_jth items requires the mastery of the k𝑘kitalic_kth skill, and 00 otherwise.

The DINA model assumes that an examinee has to master all the required skills for the item to answer an item correctly; lacking one or more of the required skills would lead to an incorrect response [de la Torre, 2009]. The model, however, allows noise in the examinee’s item-answering process. Some examinees may not answer certain items correctly even if they have mastered all the required skills (“slip**”), while some examinees may answer some items correctly even though they have not mastered at least one of the required skills (“guessing”). The probability of examinee i𝑖iitalic_i answering item j𝑗jitalic_j correctly is given by

P(Yij=1|αi)=(1sj)ηijgj1ηij𝑃subscript𝑌𝑖𝑗conditional1subscript𝛼𝑖superscript1subscript𝑠𝑗subscript𝜂𝑖𝑗superscriptsubscript𝑔𝑗1subscript𝜂𝑖𝑗P(Y_{ij}=1|\alpha_{i})=(1-s_{j})^{\eta_{ij}}g_{j}^{1-\eta_{ij}}italic_P ( italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 | italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( 1 - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_η start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where the skill mastery vector αi=(αi1,αi2,,αik)subscript𝛼𝑖subscript𝛼𝑖1subscript𝛼𝑖2subscript𝛼𝑖𝑘\alpha_{i}=(\alpha_{i1},\alpha_{i2},...,\alpha_{ik})italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_α start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ), and αik=1subscript𝛼𝑖𝑘1\alpha_{ik}=1italic_α start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1 indicates the i𝑖iitalic_ith examinee masters the k𝑘kitalic_kth skill, αik=0subscript𝛼𝑖𝑘0\alpha_{ik}=0italic_α start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 otherwise; sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the slip** parameter representing the probability of answering the j𝑗jitalic_jth item incorrectly when an examinee has mastered all the required skills; gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the guessing parameter representing the probability of answering the j𝑗jitalic_jth item correctly when an examinee has not mastered all the required skills; ηijsubscript𝜂𝑖𝑗\eta_{ij}italic_η start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT indicates whether the examinee has mastered all required skills or not, and it can be calculated as follows:

ηij=k=1Kαikqjksubscript𝜂𝑖𝑗superscriptsubscriptproduct𝑘1𝐾superscriptsubscript𝛼𝑖𝑘subscript𝑞𝑗𝑘\eta_{ij}=\prod_{k=1}^{K}\alpha_{ik}^{q_{jk}}italic_η start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

ηij=1subscript𝜂𝑖𝑗1\eta_{ij}=1italic_η start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 indicates all required skills have been mastered, and 00 otherwise.

In this study, we used DINA to analyze students’ response data for each of the three RBAs to further refine our item codes and calibrate all items’ slip** and guessing parameters. DINA analyses also generated skill mastery profiles for each student, which were not the focus of the research question in this paper. These psychometric analyses were implemented using the G-DINA package [Ma and de la Torre, 2020] in the R programming environment. RMSEA2 and SRMSR were used to assess the degree of the model-data fit. RMSEA2 is the root mean square error approximation (RMSEA) based on the M2 statistic using the univariate and bivariate margins. RMSEA2 ranges from 0 to 1, and RMSEA2 <0.06absent0.06<0.06< 0.06 indicates good fit [Hooper et al., 2008; Hu and Bentler, 1999]. SRMSR, the standardized root mean squared residual, has acceptable values ranging between 0 and 0.8. Models with SRMSR <0.05absent0.05<0.05< 0.05 can be viewed as a well-fitted model, and models with SRMSR <0.08absent0.08<0.08< 0.08 are typically considered acceptable models [Maydeu-Olivares and Joe, 2014; Hu and Bentler, 1999; Jang, 2018]. Additionally, the skill-level classification accuracy informed the reliability and validity of the CD assessment. Classification accuracy is the percentage of agreement between the observed and expected proportions of examinees in each of the skills.

The appropriateness of the Q-matrix plays an important role in CD assessments and affects the degree of model-data fit. Inappropriate specifications in the Q-matrix may lead to poor model fit and thus may produce incorrect skill diagnosis results for students. Therefore, we need a Q-matrix validation step in the study. The input Q-matrices for the DINA analysis for each RBA were constructed by content experts, as detailed in the prior section. In the Q-matrix validation step, the DINA analysis further examined each Q-matrix to identify potential misspecifications in the Q-matrices.

VI.3.2 Q-Matrix validation

The analysis fitted the DINA model to students’ post-assessment responses using the Q-matrix constructed by the three coders. The Proportion of Variance Accounted For method [de la Torre and Chiu, 2016] measured the relationships between the items and the skills specified in the provided Q-matrix. The analysis of the empirical response data suggested changes to the provided Q-matrix, which the three coders reviewed. The coders assessed the suggested modifications for how well they aligned with the definitions and revised the Q-matrix when they agreed with the suggested changes. The refined Q-matrix was then used in subsequent CD Modeling analyses.

Table 4: The skills and content areas for items from the FCI, FMCE, and EMCS. Note: “FCI_01” represents an abbreviation of the assessment name and the number of the item on the assessment.
Content Area Apply Vectors Conceptual Relationships Algebra Visualizations
Kinematics FCI_07, FCI_08, FCI_09, FCI_12, FCI_14, FCI_21, FCI_22, FCI_23, FMCE_27, FMCE_28, FMCE_29, FMCE_41 FCI_01, FCI_02, FCI_07, FCI_12, FCI_14, FCI_19, FCI_20, FCI_23, FMCE_22, FMCE_23, FMCE_24, FMCE_25, FMCE_26, FMCE_27, FMCE_28, FMCE_29, FMCE_40, FMCE_41, FMCE_42, FMCE_43 FCI_20, FMCE_22, FMCE_23, FMCE_24, FMCE_25, FMCE_26, FMCE_40, FMCE_41, FMCE_42, FMCE_43
Forces FCI_05, FCI_11, FCI_13, FCI_17, FCI_18, FCI_25, FCI_26, FCI_27, FCI_29, FCI_30, FMCE_01, FMCE_03, FMCE_04, FMCE_05, FMCE_06, FMCE_07, FMCE_08, FMCE_09, FMCE_10, FMCE_11, FMCE_12, FMCE_13, FMCE_20, FMCE_21 FCI_03, FCI_04, FCI_06, FCI_10, FCI_15, FCI_16, FCI_24, FCI_25, FCI_28, FMCE_01, FMCE_02, FMCE_03, FMCE_04, FMCE_05, FMCE_06, FMCE_07, FMCE_08, FMCE_09, FMCE_10, FMCE_11, FMCE_12, FMCE_13, FMCE_14, FMCE_15, FMCE_16, FMCE_17, FMCE_18, FMCE_19, FMCE_20, FMCE_21, FMCE_30, FMCE_31, FMCE_32, FMCE_33, FMCE_34, FMCE_35, FMCE_36, FMCE_37, FMCE_38, FMCE_39 FMCE_14, FMCE_15, FMCE_16, FMCE_17, FMCE_18, FMCE_19, FMCE_20, FMCE_21
Energy EMCS_01 EMCS_01, EMCS_02, EMCS_03, EMCS_04, EMCS_06, EMCS_08, EMCS_09, EMCS_12, EMCS_13, EMCS_15, EMCS_17, EMCS_20, EMCS_22, EMCS_24, EMCS_25, FMCE_44, FMCE_45, FMCE_46, FMCE_47 EMCS_15 FMCE_44, FMCE_45
Momentum EMCS_05, EMCS_11, EMCS_13, EMCS_23 EMCS_03, EMCS_05, EMCS_07, EMCS_10, EMCS_13, EMCS_14, EMCS_16, EMCS_18, EMCS_19, EMCS_21 EMCS_21
Table 5: Q-matrix modifications and adoption rates
Total Items Possible Changes Suggested Changes Adopted Changes Adoption Rate Change Rate
FCI 30 90 11 7 64% 7.8%
FMCE 47 141 14 5 36% 3.5%
EMCS 25 75 1 1 100% 4.0%
Overall 102 306 26 13 50% 4.2%

Table 5 presents a summary detailing the frequency of data-driven modifications suggested, adopted by the coders, and the rate of adoption for each of the three assessments under study. The FCI, for example, had 11 proposed changes of the 90 possible changes (30 items each with three possible skills), and the coders adopted seven of these suggestions. For instance, conceptual relationships skill was initially not considered essential for item 7. However, empirical response data suggested that this skill was required to answer item 7 correctly. Post-review, the expert panel endorsed this modification; thereby, the value in the Q-matrix corresponding to the intersection of item 7 and conceptual relationships was changed from “0” to “1”. Overall, only 8.5% of the codings (26 of 306) were identified for re-examination by this analysis. Of the 26 proposed changes, 13 were adopted across the three assessments, yielding an overall adoption rate of 50%. This iterative approach to informing the validity of the Q-matrix avoids overreliance on either expert opinion or empirical data, harmonizing both information sources to enhance the accuracy of the Q-matrix. Table 11 shows the final coding for each RBA item across the four content areas and four skills.

VII Findings

This section addresses the research question by detailing the skills and content areas measured by the three assessments, as detailed in Table 4. First, we present which of the four skills the items on the three assessments measured and the number of skills the items measured. The specific models relating the items to the four skills are presented in the Appendix, see Tables 12, 13, and 14. Second, we show the content areas covered in the three assessments. Finally, we examine the skills across content areas. This structure highlights the various aspects of the items in these three assessments.

VII.1 Skills

FCI - The FCI assessed three skills (Fig. 2). Eighteen items assessed apply vectors skill, 17 items assessed conceptual relationships skill, 1 item assessed visualizations skill, and 0 items assessed algebra skill. The majority of items assessed a single skill. Twenty-four items (80%) assessed a single skill, 6 items (20%) assessed two skills, and 0 items assessed three skills (Table 6).

FMCE - The FMCE assessed the same three skills as the FCI (Fig. 2). All 47 items assessed conceptual relationships skill, 19 items assessed the visualizations skill, 18 items assessed apply vectors skill, and 0 items assessed algebra skill. The majority of items assessed multiple skills. Thirteen items (28%) assessed a single skill, while 31 items (66%) assessed two skills, and 3 items (6%) assessed three skills (Table 6).

EMCS - Similar to the FCI and FMCE, the EMCS assessed the apply vectors and conceptual relationships skills (see Fig. 2 and Table 4). The EMCS differed in that it included 2 items that assessed the algebra skill. Of the 25 EMCS items, 23 assessed the conceptual relationships skill (with items 3 and 13 both coded for energy and momentum), 5 items assessed the apply vectors skill, 2 items assessed the algebra skill, and 0 items assessed the visualizations skill. The EMCS was the only assessment with items assessing the algebra skill. The majority of items assessed a single skill. Twenty items (80%) assessed a single skill, 5 items (20%) assessed two skills, and 0 items assessed three skills (Table 6).

Refer to caption
Figure 2: The distribution of items across skills, content areas, and assessments. Note that 2 items, 3 and 13, are counted for both energy and momentum under the conceptual relationships skill for the EMCS.
Table 6: The distribution of items across the number of skills they assess.
N(%)
1 2 3
FCI 24 (80%) 6 (20%) 0 (0%)
FMCE 13 (28%) 31 (66%) 3 (6%)
EMCS 20 (80%) 5 (20%) 0 (0%)
Total 57 (56%) 42 (41%) 3 (3%)

VII.1.1 DINA model fit

The analysis fitted the DINA model with the refined Q-matrix to the response data. Table 7 outlines the model fit statistics for each assessment. According to the established criteria [Hu and Bentler, 1999; Bentler, 1990], the model demonstrated satisfactory fit (RMSEA2 <0.05absent0.05<0.05< 0.05, SRMSR <0.07absent0.07<0.07< 0.07) for FCI and EMCS, whereas the fit for FMCE was unsatisfactory (RMSEA2 = 0.090, SRMSR = 0.110). These results suggest that the model adequately represents the underlying structure of data for FCI and EMCS but might not capture the latent structure of FMCE well.

Table 7: Model fit by assessment
Absolute Fit Statistics
RMSEA2 SRMSR
FCI 0.048 0.062
FMCE 0.090 0.110
EMCS 0.028 0.041

VII.1.2 DINA model classification accuracy

Table 8 presents the classification accuracy [Wang et al., 2015] for each skill across the three assessments. As discussed in the skills section, not all of the skills were measured by each of the assessments; 9 of 12 were possible. For those skills that were measured, 7 of the 9 classification accuracies were high (over 0.9). The classification accuracy of visualizations for the FCI (0.79) and algebra for the EMCS (0.63) was notably lower. The lower classification accuracy reflects the lack of items measuring these skills (Fig. 2).

Table 8: Skill classification accuracy by assessment
Apply Conceptual Algebra Visualizations
Vectors Relationships
FCI 0.97 0.96 - 0.79
FMCE 0.96 0.98 - 0.91
EMCS 0.94 0.95 0.63 -

VII.2 Content Areas

Table 9: The number of items across content areas they assess. Note that 2 items, 3 and 13, are counted for both energy and momentum under the conceptual relationships skill for the EMCS. These were the only 2 items that assessed multiple content areas.
Total Kinematics Forces Energy Momentum
FCI 30 12 18 0 0
FMCE 47 12 31 4 0
EMCS 25 0 0 15 12
Overall 102 24 49 19 12
Table 10: The distribution of items across the number of content areas they assess.
N(%)
1 2
FCI 30 (100 %) 0 (0 %)
FMCE 47 (100 %) 0 (0 %)
EMCS 23 (92 %) 2 (8 %)

FCI - The FCI assessed two content areas (Fig. 2 & Table 9). Eighteen items assessed forces, 12 items assessed kinematics, and 0 items assessed energy and momentum. All 30 items (100%) assessed a single content area (Table 10).

FMCE - The FMCE assessed three content areas (Fig. 2 & Table 9). Thirty-one items assessed forces, 12 items assessed kinematics, 4 items assessed energy, and 0 items assessed momentum. Similar to FCI, all 47 (100%) items assessed a single content area (Table 10).

EMCS - The EMCS assessed two content areas (Fig. 2 & Table 9). Fifteen items assessed energy, 12 items assessed momentum, and 0 items assessed kinematics and forces. Unlike the FCI and the FMCE, 23 items assessed a single content area, and 2 items assessed two content areas (8%) (Table 10).

VII.3 Skills ×\timesbold_× content areas

Table 11: The intersection of skills and content areas covered by each RBA.
Content Areas Apply Vectors Conceptual Relationships Algebra Visualizations
Kinematics FCI, FMCE FCI, FMCE - FCI, FMCE
Forces FCI, FMCE FCI, FMCE - FMCE
Energy EMCS FMCE, EMCS EMCS FMCE
Momentum EMCS EMCS EMCS -
Refer to caption
Figure 3: The distribution of items from the three assessments (FCI, FMCE, and EMCS) across skills and content areas. Note: Each assessment contains a different number of items, and some items assess multiple skills and content areas.

The distribution of skills assessed was not consistent across content areas (Fig. 3 and Table 11). This inconsistency follows from several aspects of the three RBAs. The FCI and FMCE did not measure the algebra skill. The EMCS did not measure the visualizations skill. The majority of items came from the FMCE and FCI, which focused more on forces than on kinematics. Across the three RBAs, very few items measured the apply vectors skill for energy (1) and momentum (4), even though applying vectors is central to momentum. And, very few items measured the visualizations skills for energy (2) and momentum (0).

VIII Discussion

This study supports the development of the MCD within the ECD framework by focusing on the student and evidence models (Fig. 1). For the student models, the three RBAs measured all four skills, though to different extents, across the four content areas. For the evidence models, the three RBAs assessed most of the skills with high classification accuracies. These results indicate that the combined items from the three RBAs will provide an adequate initial item bank for the further development of the MCD.

VIII.1 Student models - The four skills

The three RBAs - FCI, FMCE, and EMCS - each included items that assessed three of the four skills across two to three content areas. The three RBAs all included a majority of items that assessed the conceptual relationships skill, which follows from their conceptual focus. In addition to measuring the conceptual relationships skill, all three RBAs also included sufficient items to assess the apply vectors skill with high classification accuracies. The FMCE included sufficient items to assess the visualizations skill. These results, in addition to other RBAs on visualizations and vectors specifically Zavala et al. [2017]; Barniol and Zavala [2014], indicate that these three skills are common learning objectives of physics instruction.

The three RBAs did not include enough items assessing the algebra skill to inform the extent to which that skill fits within our student models. This likely follows from these RBAs being conceptual assessments developed to refocus physics instruction from memorization and application of equations to a deeper understanding of the conceptual relationships linking the physical world. Applying and manipulating equations was, however, a common learning objective in the standards-based grading rubrics we used to develop our student models. Many instructors and students may want formative assessments on algebra skill to support their teaching and learning.

Most items in both the FCI and EMCS required mastery of a single skill, while the majority of items in the FMCE needed multiple skills. Requiring multiple skills to answer an item correctly can have two effects. First, requiring mastery of more than one skill typically makes the items more difficult to answer. This is consistent with prior findings that the FMCE is more difficult than the FCI [Thornton et al., 2009]. Second, multi-skill items can provide different information than a single-skill item, and item banks should include a mix of single- and multi-skill items to pick from to maximize the information generated by each item a student answers. Combining the three assessments into a single test bank provides a more even mix of single- and multi-skill items than any of these three RBAs.

VIII.2 Evidence models - Model fits and classification accuracies

The DINA model fit the FCI and EMCS well, but the fit for the FMCE was marginal. The length and difficulty of the FMCE may have driven this marginal fit. The large number of items assessing multiple skills may have also been a factor. Post-hoc analyses to test these possibilities indicated that they were not major contributors to the marginal model fit of the FMCE. The additional analyses included generalized DINA models, DINA models of the first and second half of the FMCE, and separate DINA models of the students from calculus- and algebra-based physics courses. The marginal fit likely follows from our post-hoc application of our skill model to the FMCE. This model fits two assessments well, and one assessment marginally indicates that the student models of the skills are broadly applicable to physics learning and items from these three assessments can form the initial item bank of a cognitive diagnostic.

The three RBAs had classification accuracy (above 0.9) for the apply vectors and conceptual relationships skills, as shown in Table 10. This makes sense for the FMCE and FCI, given that they each had at least 17 items for each of the apply vectors and conceptual relationships skills (Fig. 2). Although the EMCS only had 5 items measuring apply vectors, the classification accuracy was still 0.94. This finding indicates that a relatively small number of items can still accurately assess a skill. The number of items measuring algebra skills on the EMCS (2) and visualization skills in FCI (1) were not sufficient to generate useful classification accuracies (<0.8). Combining the three assessments into a single-item bank should provide sufficient coverage of apply vectors, conceptual relationships, and visualization skills, but it will not offer enough items to assess the algebra skill. Additionally, the combined item bank will require additional items to assess the visualization and apply vectors skills in the content areas of energy and momentum.

IX Limitations

The DINA analysis assumes students have mastered each of the skills assessed by an item to answer that item correctly. A less restrictive analysis, such as generalized DINA, that assumes some questions can be answered by only mastering a subset of skills or by students who have only partially mastered skills may provide a better fit. The three RBAs constrained the skills that the analyses could test. This was an obvious issue for the algebra skill, which was only assessed by two items on one assessment. Physics instructors also likely value and teach other skills that they would want to assess, such as the ability to decompose complex problems into smaller pieces to solve as assessed by the Mechanics Reasoning Inventory [Pawl et al., 2012]). The analysis does not test the extent to which the items and assessments act differently across populations, e.g., gender, race, or type of physics course. Mixed evidence exists about the measurement invariance [Morley et al., 2023] and differential item functioning [Traxler et al., 2018; Henderson et al., 2018] of the FCI and FMCE. The combination of items from these three assessments administered through a cognitive diagnostic at a large scale will provide a dataset to identify and understand item differences and potential item biases between groups of students.

X Conclusions

Combining 102 items from three RBAs into a single item bank to create a CD-CAT provides a solid foundation for building the MCD. The limited number of items assessing the algebra skill and the apply vectors and visualizations skills for energy and momentum point to these as specific areas for improvement of the item bank. Delivering the MCD online, fortunately, has the advantage of allowing for the inclusion of new items under development to fill in gaps in the item bank. The combined item bank will also produce better classification accuracy by having more items to draw on. The high classification accuracy (0.941) for the apply vectors skill on the EMCS, however, indicates that even just 5 items can provide a high classification accuracy. This result indicates that shorter assessments may allow for high levels of classification accuracy for skills while also using fewer questions.

Using LASSO as the delivery system for the MCD provides instructors with an adaptive tool to assess students’ skills and knowledge across content areas or in specific content areas. In particular, using a cognitive diagnostic for the assembly model allows instructors to design formative assessments by choosing the skills, and content areas to measure. Integrating guidelines and constraints on test lengths will help instructors design accurate assessments of those skills and content areas. The cognitive diagnostic also allows flexible timing; instructors can design pre or post-tests that cover many skills and content areas or weekly tests focused on a few skills for one content area.

For researchers, the MCD will collect longitudinal data across skills and content areas. This data can inform the development of learning progressions or the transfer of skills across content areas, such as applying vectors in mathematical, kinematics, and momentum content areas. Develo** more items that cover multiple content areas can inform how physics content interacts, which current RBAs do not assess. Because LASSO is free for instructors to use, the data will likely also represent a broader cross-section of physics learners [Nissen et al., 2021] than physics education research has historically included [Kanim and Cid, 2020].

Acknowledgments

This research was made possible through the financial support provided by National Science Foundation Grants No. 2141847. We extend our appreciation to LASSO for their support in both collecting and sharing data for this research.

Appendix

The appendix includes the Q-matrices for the three assessments we used to conduct the DINA model analysis.

Table 12: The table provides the Q-matrix for each FCI item, represented as binary coding.
FCI Apply Conceptual Algebra Visual-
Item Vectors Relationships izations
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 1 0 0 0
6 0 1 0 0
7 1 1 0 0
8 1 0 0 0
9 1 0 0 0
10 0 1 0 0
11 1 0 0 0
12 1 1 0 0
13 1 0 0 0
14 1 1 0 0
15 0 1 0 0
16 0 1 0 0
17 1 0 0 0
18 1 0 0 0
19 0 1 0 0
20 0 1 0 1
21 1 0 0 0
22 1 0 0 0
23 1 1 0 0
24 0 1 0 0
25 1 1 0 0
26 1 0 0 0
27 1 0 0 0
28 0 1 0 0
29 1 0 0 0
30 1 0 0 0
Table 13: The table provides the Q-matrix for each FMCE item, represented as binary coding.
FMCE Apply Conceptual Algebra Visual-
Item Vectors Relationships izations
1 1 1 0 0
2 0 1 0 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 1 1 0 0
8 1 1 0 0
9 1 1 0 0
0 1 1 0 0
11 1 1 0 0
12 1 1 0 0
13 1 1 0 0
14 0 1 0 1
15 0 1 0 1
16 0 1 0 1
17 0 1 0 1
18 0 1 0 1
19 0 1 0 1
20 1 1 0 1
21 1 1 0 1
22 0 1 0 1
23 0 1 0 1
24 0 1 0 1
25 0 1 0 1
26 0 1 0 1
27 1 1 0 0
28 1 1 0 0
29 1 1 0 0
30 0 1 0 0
31 0 1 0 0
32 0 1 0 0
33 0 1 0 0
34 0 1 0 0
35 0 1 0 0
36 0 1 0 0
37 0 1 0 0
38 0 1 0 0
39 0 1 0 0
40 0 1 0 1
(Table continued)
Table 13: (Continued)
FMCE Apply Conceptual Algebra Visual-
Item Vectors Relationships izations
41 1 1 0 1
42 0 1 0 1
43 0 1 0 1
44 0 1 0 1
45 0 1 0 1
46 0 1 0 0
47 0 1 0 0
Table 14: The table provides the Q-matrix for each EMCS item, represented as binary coding.
EMCS Apply Conceptual Algebra Visual-
Item Vectors Relationships izations
1 1 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 1 0 0
7 0 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
11 1 0 0 0
12 0 1 0 0
13 1 1 0 0
14 0 1 0 0
15 0 1 1 0
16 0 1 0 0
17 0 1 0 0
18 0 1 0 0
19 0 1 0 0
20 0 1 0 0
21 0 1 1 0
22 0 1 0 0
23 1 0 0 0
24 0 1 0 0
25 0 1 0 0

References

  • Madsen et al. [2017] Adrian Madsen, Sarah B McKagan,  and Eleanor C Sayre, “Resource letter RBAI-1: research-based assessment instruments in physics and astronomy,” American Journal of Physics 85, 245–264 (2017).
  • Docktor and Mestre [2014] Jennifer L Docktor and José P Mestre, “Synthesis of discipline-based education research in physics,” Physical Review Special Topics-Physics Education Research 10, 020119 (2014).
  • Madsen et al. [2016] Adrian Madsen, Sarah B McKagan, Mathew Sandy Martinuk, Alexander Bell,  and Eleanor C Sayre, “Research-based assessment affordances and constraints: Perceptions of physics faculty,” Physical Review Physics Education Research 12, 010115 (2016).
  • Van Dusen and Nissen [2020] Ben Van Dusen and Jayson Nissen, “Equity in college physics student learning: A critical quantitative intersectionality investigation,” Journal of Research in Science Teaching 57, 33–57 (2020).
  • Wilcox and Lewandowski [2016] Bethany R Wilcox and HJ Lewandowski, “Research-based of students’ beliefs about experimental physics: When is gender a factor?” Physical Review Physics Education Research 12, 020130 (2016).
  • Thornton et al. [2009] Ronald K Thornton, Dennis Kuhl, Karen Cummings,  and Jeffrey Marx, “Comparing the force and motion conceptual evaluation and the force concept inventory,” Physical review special topics-Physics education research 5, 010105 (2009).
  • Stoen et al. [2020] Siera M Stoen, Mark A McDaniel, Regina F Frey, K Mairin Hynes,  and Michael J Cahill, “Force concept inventory: More than just conceptual understanding,” Physical Review Physics Education Research 16, 010105 (2020).
  • Laverty et al. [2022] James T Laverty, Amogh Sirnoorkar, Amali Priyanka Jambuge, Katherine D Rainey, Joshua Weaver, Alexander Adamson,  and Bethany R Wilcox, “A new paradigm for research-based assessment development,” in Proceedings of the Physics Education Research Conference (PERC) (2022) pp. 279–284.
  • Wilson [2008] Nance S Wilson, “Teachers expanding pedagogical content knowledge: Learning about formative assessment together,” Journal of In-Service Education 34, 283–298 (2008).
  • Leighton and Gierl [2007] Jacqueline Leighton and Mark Gierl, Cognitive diagnostic assessment for education: Theory and applications (Cambridge University Press, 2007).
  • Cui et al. [2012] Ying Cui, Mark J Gierl,  and Hua-Hua Chang, “Estimating classification consistency and accuracy for cognitive diagnostic assessment,” Journal of Educational Measurement 49, 19–38 (2012).
  • Morphew et al. [2018] Jason W Morphew, Jose P Mestre, Hyeon-Ah Kang, Hua-Hua Chang,  and Gregory Fabry, “Using computer adaptive testing to assess physics proficiency and improve exam performance in an introductory physics course,” Physical Review Physics Education Research 14, 020110 (2018).
  • Chang [2015] Hua-Hua Chang, “Psychometrics behind computerized adaptive testing,” Psychometrika 80, 1–20 (2015).
  • Weiss [1982] David J Weiss, “Improving measurement quality and efficiency with adaptive testing,” Applied psychological measurement 6, 473–492 (1982).
  • Helm et al. [2022] Christoph Helm, Julia Warwas,  and Henry Schirmer, “Cognitive diagnosis models of students’ skill profiles as a basis for adaptive teaching: an example from introductory accounting classes,” Empirical Research in Vocational Education and Training 14, 1–30 (2022).
  • Li and Traynor [2022] Tingxuan Li and Anne Traynor, “The use of cognitive diagnostic modeling in the assessment of computational thinking,” AERA Open 8, 23328584221081256 (2022).
  • Ravand and Robitzsch [2015] Hamdollah Ravand and Alexander Robitzsch, “Cognitive diagnostic modeling using R,” Practical Assessment, Research, and Evaluation 20, 11 (2015).
  • De La Torre and Minchen [2014] Jimmy De La Torre and Nathan Minchen, “Cognitively diagnostic assessments and the cognitive diagnosis model framework,” Psicología Educativa 20, 89–97 (2014).
  • Wang et al. [2015] Wenyi Wang, Lihong Song, ** Chen, Yaru Meng,  and Shuliang Ding, “Attribute-level and pattern-level classification consistency and accuracy indices for cognitive diagnostic assessment,” Journal of Educational Measurement 52, 457–476 (2015).
  • Haertel [1984] Edward Haertel, “An application of latent class models to assessment data,” Applied Psychological Measurement 8, 333–346 (1984).
  • Junker and Sijtsma [2001] Brian W Junker and Klaas Sijtsma, “Cognitive assessment models with few assumptions, and connections with nonparametric item response theory,” Applied Psychological Measurement 25, 258–272 (2001).
  • Mislevy et al. [2003] Robert J Mislevy, Russell G Almond,  and Janice F Lukas, “A brief introduction to evidence-centered design,” ETS Research Report Series 2003, i–29 (2003).
  • Phy [n.d.a] “Physport assessments: Force and motion conceptual evaluation,”  (n.d.a).
  • Alliance [2020] Learning Assistant Alliance, “Learning assistant alliance,”  (2020).
  • Van Dusen et al. [2021] Ben Van Dusen, Mollee Shultz, Jayson M Nissen, Bethany R Wilcox, NG Holmes, Manher Jariwala, Eleanor W Close, HJ Lewandowski,  and Steven Pollock, “Online administration of research-based assessments,” American Journal of Physics 89, 7–8 (2021).
  • Phy [n.d.b] “Physport: Browse assessments,”  (n.d.b).
  • Van Dusen [2018] Ben Van Dusen, “LASSO: A new tool to support instructors and researchers,” American Physics Society Forum on Education Fall 2018  (2018).
  • Hestenes et al. [1992] David Hestenes, Malcolm Wells,  and Gregg Swackhamer, “Force concept inventory,” The physics teacher 30, 141–158 (1992).
  • Thornton and Sokoloff [1998] Ronald K Thornton and David R Sokoloff, “Assessing student learning of newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture curricula,” american Journal of Physics 66, 338–352 (1998).
  • Singh and Rosengrant [2003] Chandralekha Singh and David Rosengrant, “Multiple-choice test of energy and momentum concepts,” American Journal of Physics 71, 607–617 (2003).
  • Collares [2022] Carlos Fernando Collares, “Cognitive diagnostic modeling in healthcare professions education: an eye-opener,” Advances in Health Sciences Education 27, 427–440 (2022).
  • Şahin and Özbasi [2017] Alper Şahin and Durmus Özbasi, “Effects of content balancing and item selection method on ability estimation in computerized adaptive tests,” Eurasian Journal of Educational Research  (2017).
  • Chen et al. [2008] Shu-Ying Chen, Pui-Wa Lei,  and Wen-Han Liao, “Controlling item exposure and test overlap on the fly in computerized adaptive testing,” British Journal of Mathematical and Statistical Psychology 61, 471–492 (2008).
  • Anamezie and Nnadi [2018] Rose C Anamezie and Fidelis O Nnadi, “Parameterization of teacher-made physics achievement test using deterministic-input-noisy-and-gate (DINA) model,” Journal of Education and Practice 9, 101–109 (2018).
  • Chen et al. [2015] Yunxiao Chen, **gchen Liu, Gongjun Xu,  and Zhiliang Ying, “Statistical analysis of Q-matrix based diagnostic classification models,” Journal of the American Statistical Association 110, 850–866 (2015).
  • Zhang et al. [2021] Jiwei Zhang, **g Lu, **g Yang, Zhaoyuan Zhang,  and Shanshan Sun, “Exploring multiple strategic problem solving behaviors in educational psychology research by using mixture cognitive diagnosis model,” Frontiers in psychology 12, 568348 (2021).
  • Yasuda et al. [2021a] J Yasuda, N Mae, MM Hull,  and M Taniguchi, “Analysis to develop computerized adaptive testing with the force concept inventory,” in Journal of Physics: Conference Series, Vol. 1929 (IOP Publishing, 2021) p. 012009.
  • Istiyono et al. [2018] Edi Istiyono, Wipsar Sunu Brams Dwandaru,  and Revnika Faizah, “Map** of physics problem-solving skills of senior high school students using PhysProSS-CAT,” REID (Research and Evaluation in Education) 4, 144–154 (2018).
  • Yasuda et al. [2021b] Jun-ichiro Yasuda, Naohiro Mae, Michael M Hull,  and Masa-aki Taniguchi, “Optimizing the length of computerized adaptive testing for the force concept inventory,” Physical review physics education research 17, 010115 (2021b).
  • Sheehan et al. [2007] Kathleen M Sheehan, Irene Kostin,  and Yoko Futagi, “Supporting efficient, evidence-centered item development for the GRE verbal measure,” ETS Research Report Series 2007, i–63 (2007).
  • Pollard et al. [2021] Benjamin Pollard, Robert Hobbs, Rachel Henderson, Marcos D Caballero,  and HJ Lewandowski, “Introductory physics lab instructors’ perspectives on measurement uncertainty,” Physical Review Physics Education Research 17, 010133 (2021).
  • Vignal et al. [2023] Michael Vignal, Gayle Geschwind, Benjamin Pollard, Rachel Henderson, Marcos D Caballero,  and HJ Lewandowski, “Survey of physics reasoning on uncertainty concepts in experiments: an assessment of measurement uncertainty for introductory physics labs,” arXiv preprint arXiv:2302.07336  (2023).
  • Nissen et al. [2022] Jayson M Nissen, Ian Her Many Horses, Ben Van Dusen, Manher Jariwala,  and Eleanor Close, “Providing context for identifying effective introductory mechanics courses,” The Physics Teacher 60, 179–182 (2022).
  • Beatty [2013] Ian D Beatty, “Standards-based grading in introductory university physics,” Journal of the Scholarship of Teaching and Learning , 1–22 (2013).
  • Tatsuoka [2012] Kikumi K Tatsuoka, “Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classification approach,” in Cognitively Diagnostic Assessment (Routledge, 2012) pp. 327–359.
  • de la Torre [2009] J. de la Torre, “DINA model and parameter estimation: A didactic,” Journal of Educational and Behavioral Statistics 34, 115 (2009).
  • Ma and de la Torre [2020] Wenchao Ma and Jimmy de la Torre, “GDINA: An R package for cognitive diagnosis modeling,” Journal of Statistical Software 93, 1–26 (2020).
  • Hooper et al. [2008] Daire Hooper, Joseph Coughlan,  and Michael Mullen, “Evaluating model fit: a synthesis of the structural equation modelling literature,” in 7th European Conference on research methodology for business and management studies, Vol. 2008 (2008) pp. 195–200.
  • Hu and Bentler [1999] Li-tze Hu and Peter M Bentler, “Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives,” Structural equation modeling: a multidisciplinary journal 6, 1–55 (1999).
  • Maydeu-Olivares and Joe [2014] Alberto Maydeu-Olivares and Harry Joe, “Assessing approximate fit in categorical data analysis,” Multivariate Behavioral Research 49, 305–328 (2014).
  • Jang [2018] Sung Tae Jang, “The implications of intersectionality on southeast asian female students’ educational outcomes in the united states: A critical quantitative intersectionality analysis,” American Educational Research Journal 55, 1268–1306 (2018).
  • de la Torre and Chiu [2016] Jimmy de la Torre and Chia-Yi Chiu, “A general method of empirical Q-matrix validation,” Psychometrika 81, 253–273 (2016).
  • Bentler [1990] Peter M Bentler, “Comparative fit indexes in structural models.” Psychological Bulletin 107, 238 (1990).
  • Zavala et al. [2017] Genaro Zavala, Santa Tejeda, Pablo Barniol,  and Robert J Beichner, “Modifying the test of understanding graphs in kinematics,” Physical Review Physics Education Research 13, 020111 (2017).
  • Barniol and Zavala [2014] Pablo Barniol and Genaro Zavala, “Test of understanding of vectors: A reliable multiple-choice vector concept test,” Physical Review Special Topics-Physics Education Research 10, 010121 (2014).
  • Pawl et al. [2012] Andrew Pawl, Analia Barrantes, Carolin Cardamone, Saif Rayyan,  and David E Pritchard, “Development of a mechanics reasoning inventory,” in AIP Conference Proceedings, Vol. 1413 (American Institute of Physics, 2012) pp. 287–290.
  • Morley et al. [2023] Alicen Morley, Jayson M Nissen,  and Ben Van Dusen, “Measurement invariance across race and gender for the force concept inventory,” Physical Review Physics Education Research 19, 020102 (2023).
  • Traxler et al. [2018] Adrienne Traxler, Rachel Henderson, John Stewart, Gay Stewart, Alexis Papak,  and Rebecca Lindell, “Gender fairness within the force concept inventory,” Physical Review Physics Education Research 14, 010103 (2018).
  • Henderson et al. [2018] Rachel Henderson, Paul Miller, John Stewart, Adrienne Traxler,  and Rebecca Lindell, “Item-level gender fairness in the force and motion conceptual evaluation and the conceptual survey of electricity and magnetism,” Physical Review Physics Education Research 14, 020103 (2018).
  • Nissen et al. [2021] Jayson M Nissen, Ian Her Many Horses, Ben Van Dusen, Manher Jariwala,  and Eleanor W Close, “Tools for identifying courses that support development of expertlike physics attitudes,” Physical Review Physics Education Research 17, 013103 (2021).
  • Kanim and Cid [2020] Stephen Kanim and Ximena C Cid, “Demographics of physics education research,” Physical Review Physics Education Research 16, 020106 (2020).