Touchstone2 - An Interactive Environment for Exploring Trade-offs in HCI Experiment Design

CHI, pp. 2172019.

Cited by: 0|Bibtex|Views30|Links
EI
Keywords:
counterbalancing experiment design power analysis randomization reproducibility
Weibo:
We argue that existing Human-Computer Interaction experiment design platforms should be extended to support generating and visualizing alternative designs, based on randomization, power analysis, and other factors

Abstract:

Touchstone2 offers a direct-manipulation interface for generating and examining trade-offs in experiment designs. Based on interviews with experienced researchers, we developed an interactive environment for manipulating experiment design parameters, revealing patterns in trial tables, and estimating and comparing statistical power. We al...More

Code:

Data:

0
Introduction
  • Human-Computer Interaction (HCI) researchers often compare the effectiveness of interaction techniques or other independent variables with respect to specified measures, e.g. speed and accuracy
  • Designing such experiments is deceptively tricky: researchers must not only control for extraneous nuisance variables, such as fatigue and learning effects, but weigh the costs of adding more conditions or participants versus the benefits of higher statistical power.
  • Cockburn et al [5] argue persuasively in favor of pre-registering these decisions, in line with other scientific disciplines
  • To make this possible, the HCI community needs a common language for defining and sharing experiment designs.
  • The authors need tools for exploring design trade-offs, and capturing the final design for easy comparison with published designs
Highlights
  • Human-Computer Interaction (HCI) researchers often compare the effectiveness of interaction techniques or other independent variables with respect to specified measures, e.g. speed and accuracy
  • This paper focuses on two aspects of experiment design: counterbalancing1 and a priori power analysis
  • An a priori power analysis5 lets experimenters determine the number of participants necessary to detect an effect of a specified size, given a significance criterion
  • We argue that existing Human-Computer Interaction experiment design platforms should be extended to support generating and visualizing alternative designs, based on randomization, power analysis, and other factors
  • We describe the user interface for specifying and comparing alternatives according to diverse criteria, e.g. randomization strategies, session length, and statistical power
  • Experiments are more likely to be reproducible when researchers have complete and unambiguous specifications of experiment designs, which may be unavailable in research papers due to the lack of common language and page limits
Methods
  • The authors recruited 10 researchers who had designed, run and published one or more controlled experiments: 2 post-docs, 7 Ph.D. students and 1 graduate assistant, in Economics (1), Biology (1), Psychology (2) and HCI (6).
  • Two authors observed the teams, answered questions about Touchstone2 and noted any bugs, problems, desired features or suggestions for improvement.
  • The authors encouraged participants to write any feedback or observations in the text area provided.
  • Participants shared their impressions of Touchstone2 in a final plenary discussion (15 minutes).
  • Data collection: The authors collected logs of each team’s experiment creation process, their final experiment design(s) and their written feedback, as well as the observers’ notes
Results
  • Participants highlighted the following design challenges: Time constraints (8/10): P3 works with small children with short attention spans — so sessions can last at most five minutes.
  • All teams adjusted parameters within each design, e.g. number of participants or counterbalancing strategies, and inspected how trial tables change.
  • Three said that power differences would influence their recruitment decisions: “If recruiting participants is not very hard I would probably perhaps [add more].
  • It seems more sound.” (P10).
  • One said she would use the power chart to justify recruiting fewer participants. “If I am struggling [recruiting], I think the chart is useful to say OK, no.” (P3)
Conclusion
  • The authors found that participants face numerous constraints, some predictable, e.g. P3’s limited session time; some emergent, e.g. P8’s discovery of a learning effect
  • They struggle to weigh the costs and benefits of different parameters and lack a standard way to represent and communicate their experiments.
  • The authors argue that existing HCI experiment design platforms should be extended to support generating and visualizing alternative designs, based on randomization, power analysis, and other factors
  • This requires a common format for representing experiments, so they can be replicated and shared within the HCI community.
  • Researchers navigate the trade-offs not only about the design itself but about their design process
Summary
  • Introduction:

    Human-Computer Interaction (HCI) researchers often compare the effectiveness of interaction techniques or other independent variables with respect to specified measures, e.g. speed and accuracy
  • Designing such experiments is deceptively tricky: researchers must not only control for extraneous nuisance variables, such as fatigue and learning effects, but weigh the costs of adding more conditions or participants versus the benefits of higher statistical power.
  • Cockburn et al [5] argue persuasively in favor of pre-registering these decisions, in line with other scientific disciplines
  • To make this possible, the HCI community needs a common language for defining and sharing experiment designs.
  • The authors need tools for exploring design trade-offs, and capturing the final design for easy comparison with published designs
  • Methods:

    The authors recruited 10 researchers who had designed, run and published one or more controlled experiments: 2 post-docs, 7 Ph.D. students and 1 graduate assistant, in Economics (1), Biology (1), Psychology (2) and HCI (6).
  • Two authors observed the teams, answered questions about Touchstone2 and noted any bugs, problems, desired features or suggestions for improvement.
  • The authors encouraged participants to write any feedback or observations in the text area provided.
  • Participants shared their impressions of Touchstone2 in a final plenary discussion (15 minutes).
  • Data collection: The authors collected logs of each team’s experiment creation process, their final experiment design(s) and their written feedback, as well as the observers’ notes
  • Results:

    Participants highlighted the following design challenges: Time constraints (8/10): P3 works with small children with short attention spans — so sessions can last at most five minutes.
  • All teams adjusted parameters within each design, e.g. number of participants or counterbalancing strategies, and inspected how trial tables change.
  • Three said that power differences would influence their recruitment decisions: “If recruiting participants is not very hard I would probably perhaps [add more].
  • It seems more sound.” (P10).
  • One said she would use the power chart to justify recruiting fewer participants. “If I am struggling [recruiting], I think the chart is useful to say OK, no.” (P3)
  • Conclusion:

    The authors found that participants face numerous constraints, some predictable, e.g. P3’s limited session time; some emergent, e.g. P8’s discovery of a learning effect
  • They struggle to weigh the costs and benefits of different parameters and lack a standard way to represent and communicate their experiments.
  • The authors argue that existing HCI experiment design platforms should be extended to support generating and visualizing alternative designs, based on randomization, power analysis, and other factors
  • This requires a common format for representing experiments, so they can be replicated and shared within the HCI community.
  • Researchers navigate the trade-offs not only about the design itself but about their design process
Tables
  • Table1: C D A B
  • Table2: D C B A
  • Table3: B A D C
Download tables as Excel
Related work
  • This paper focuses on two aspects of experiment design: counterbalancing1 and a priori power analysis. The research literature includes different conventions for representing experiment designs, and provides some software packages for ensuring counterbalancing and assessing power.

    Representing experiment designs

    Individual research disciplines use various techniques for optimizing experiment designs. For example, industrial manufacturing uses Response surface design [2] and the Taguchi method [23] for between-subjects designs. They treat product elements as experiment subjects and focus solely on determining the optimal number of levels for each independent variable. In the natural sciences, Saldatova and King [29] created a computer-readable ontology of scientific experiments (expo) that defines terms related to scientific discovery: research, null and alternative hypotheses, independent (IV) and coefficients
Funding
  • This work was partially supported by European Research Council (ERC) grants No 321135 “CREATIV: Creating CoAdaptive Human-Computer Partnerships” and No 695464 “ONE: Unified Principles of Interaction”
Reference
  • Monya Baker. 2016. 1500 scientists lift the lid on reproducibility. Nature 533, 11 (2016), 452–454. DOI:http://dx.doi.org/10.1038/533452a 18 https://github.com/ZPAC-UZH/Touchstone2 https://github.com/ZPAC-UZH/TSL
    Locate open access versionFindings
  • G. E. P. Box and K. B. Wilson. 199On the Experimental Attainment of Optimum Conditions. Springer New York, New York, NY, 270–310. DOI:http://dx.doi.org/10.1007/978-1-4612-4380-9_23
    Findings
  • Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. DOI:http://dx.doi.org/10.1191/1478088706qp063oa
    Locate open access versionFindings
  • Stephane Champely, Claus Ekstrom, Peter Dalgaard, Jeffrey Gill, Stephan Weibelzahl, Aditya Anandkumar, Clay Ford, Robert Volcic, Helios De Rosario, and Maintainer Helios De Rosario. 2018. Package ’pwr’. https://CRAN.R-project.org/package=pwr R package version 1.2.2.
    Locate open access versionFindings
  • Andy Cockburn, Carl Gutwin, and Alan Dix. 2018. HARK No More: On the Preregistration of CHI Experiments. In Proc. Human Factors in Computing Systems (CHI ’18). ACM, New York, NY, USA, Article 141, 12 pages. DOI:http://dx.doi.org/10.1145/3173574.3173715
    Locate open access versionFindings
  • Jacob Cohen. 1988. Statistical power analysis for the behavioral sciences. 2nd. Hillsdale, NJ: erlbaum.
    Google ScholarFindings
  • David Roxbee Cox and Nancy Reid. 2000. The theory of the design of experiments. CRC Press.
    Google ScholarFindings
  • Pierre Dragicevic. 2016. Fair statistical communication in HCI. In Modern Statistical Methods for HCI. Springer, 291–330.
    Google ScholarFindings
  • Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner.
    Google ScholarFindings
  • 2007. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39, 2 (01 May 2007), 175–191. DOI:http://dx.doi.org/10.3758/ BF03193146
    Locate open access versionFindings
  • [10] Ronald Aylmer Fisher. 1937. The design of experiments. Oliver And Boyd; Edinburgh; London.
    Google ScholarFindings
  • [11] Tovi Grossman and Ravin Balakrishnan. 2005. The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area. In Proc. Human Factors in Computing Systems (CHI ’05). ACM, New York, NY, USA, 281–290. DOI:http://dx.doi.org/10.1145/1054972.1055012
    Locate open access versionFindings
  • [12] Transparent Statistics in Human–Computer Interaction working group. 2018. Transparent Statistics Guidelines. Technical Report. DOI:http://dx.doi.org/10.5281/zenodo.1186169 Available at https://transparentstats.github.io/guidelines.
    Findings
  • [13] Matthew A Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414–420.
    Google ScholarLocate open access versionFindings
  • [14] Daniel Kahneman, Jack L. Knetsch, and Richard H. Thaler. 1991. Anomalies: The Endowment Effect, Loss Aversion, and Status Quo Bias. Journal of Economic Perspectives 5, 1 (March 1991), 193–206. DOI: http://dx.doi.org/10.1257/jep.5.1.193
    Locate open access versionFindings
  • [15] Matthew Kay, Gregory L. Nelson, and Eric B. Hekler. 20Researcher-Centered Design of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentives of HCI. In Proc. Human Factors in Computing Systems (CHI ’16). ACM, New York, USA, 4521–4532. DOI:http://dx.doi.org/10.1145/2858036.2858465
    Locate open access versionFindings
  • [16] Ross D. King and others. 2009. The Automation of Science. Science 324, 5923 (2009), 85–89. DOI:http://dx.doi.org/10.1126/science.1165620
    Locate open access versionFindings
  • [17] Mark W Lipsey. 1990. Design sensitivity: Statistical power for experimental research. Vol. 19. Sage.
    Google ScholarFindings
  • [18] Wendy E Mackay. 2002. Using video to support interaction design. DVD Tutorial, CHI 2, 5 (2002).
    Google ScholarLocate open access versionFindings
  • [19] Wendy E. Mackay, Caroline Appert, Michel Beaudouin-Lafon, Olivier Chapuis, Yangzhou Du, Jean-Daniel Fekete, and Yves Guiard. 2007. Touchstone: Exploratory Design of Experiments. In Proc. Human Factors in Computing Systems (CHI ’07). ACM, New York, NY, USA, 1425– 1434. DOI:http://dx.doi.org/10.1145/1240624.1240840
    Locate open access versionFindings
  • [20] Xiaojun Meng, Pin Sym Foong, Simon Perrault, and Shengdong Zhao. 2017. NexP: A Beginner Friendly Toolkit for Designing and Conducting Controlled Experiments. Springer International Publishing, Cham, 132– 141. DOI:http://dx.doi.org/10.1007/978-3-319-67687-6_10
    Findings
  • [21] Tyler Morgan-Wall and George Khoury. 2018. skpr: Design of Experiments Suite: Generate and Evaluate Optimal Designs. https://CRAN. R-project.org/package=skpr R package version 0.54.3.
    Findings
  • [22] Kevin R Murphy, Brett Myors, and Allen Wolach. 2014. Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Routledge.
    Google ScholarFindings
  • [23] Vijayan N. Nair, Bovas Abraham, Jock MacKay, John A. Nelder, George Box, Madhav S. Phadke, Raghu N. Kacker, Jerome Sacks, William J. Welch, Thomas J. Lorenzen, Anne C. Shoemaker, Kwok L. Tsui, James M. Lucas, Shin Taguchi, Raymond H. Myers, G. Geoffrey Vining, and C. F. Jeff Wu. 1992. Taguchi’s Parameter Design: A Panel Discussion. Technometrics 34, 2 (1992), 127–161. http://www.jstor.org/stable/1269231
    Locate open access versionFindings
  • [24] C. Papadopoulos, I. Gutenko, and A. E. Kaufman. 2016. VEEVVIE: Visual Explorer for Empirical Visualization, VR and Interaction Experiments. IEEE Transactions on Visualization and Computer Graphics 22, 1 (2016), 111–120. DOI:http://dx.doi.org/10.1109/TVCG.2015.2467954
    Locate open access versionFindings
  • [25] Ramana Rao and Stuart K. Card. 1994. The Table Lens: Merging Graphical and Symbolic Representations in an Interactive Focus + Context Visualization for Tabular Information. In Proc. Human Factors in Computing Systems (CHI ’94). ACM, New York, NY, USA, 318–322. DOI:http://dx.doi.org/10.1145/191666.191776
    Locate open access versionFindings
  • [26] Martin Oliver Sailer. 2013. crossdes: Construction of Crossover Designs. https://CRAN.R-project.org/package=crossdes R package version 1.1.
    Findings
  • [27] SAS Institute Inc. 2016. JMP ®13 Design of experiments guide. SAS Institute Inc., SAS Institute Inc., Cary, NC, USA.
    Google ScholarFindings
  • [28] Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. 2011. Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science 22, 11 (2011), 1359–1366.
    Google ScholarLocate open access versionFindings
  • [29] Larisa N Soldatova and Ross D King. 2006. An ontology of scientific experiments. Journal of The Royal Society Interface 3, 11 (2006), 795–803. DOI:http://dx.doi.org/10.1098/rsif.2006.0134
    Locate open access versionFindings
  • [30] Lisa Tweedie, Robert Spence, Huw Dawkes, and Hus Su. 1996. Externalising Abstract Mathematical Models. In Proc. Human Factors in Computing Systems (CHI ’96). ACM, New York, NY, USA, 406–ff. DOI: http://dx.doi.org/10.1145/238386.238587
    Locate open access versionFindings
  • [31] Jelte Wicherts, Coosje Veldkamp, Hilde Augusteijn, Marjan Bakker, Robbie Van Aert, and Marcel Van Assen. 2016. Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology 7 (2016), 1832.
    Google ScholarLocate open access versionFindings
  • [32] Daniel Wollschläger. 2017. Grundlagen der Datenanalyse mit R. Springer Berlin Heidelberg. DOI:http://dx.doi.org/10.1007/978-3-662-53670-4
    Findings
  • [33] Koji Yatani. 2016. Effect Sizes and Power Analysis in HCI. Springer International Publishing, Cham, 87–110. DOI:http://dx.doi.org/10.1007/978-3-319-26633-6_5
    Findings
Your rating :
0

 

Best Paper
Best Paper of CHI, 2019
Tags
Comments