1 00:00:02.300 --> 00:00:14.710 Hi! This is a poster on eye-tracked evaluation of subtitles in immersive VR 360 video, with Marta Brescia-Zapata and Pilar Orero from Univsitat Autonoma of Barcelona, 2 00:00:14.780 --> 00:00:20.730 Krzysztof Krejtz at SWPS University in Poland, and Chris Hughes from Salford 3 00:00:20.820 --> 00:00:27.690 in the Uk and myself from Clemson in South Carolina, the U.S.A. All right. 4 00:00:28.330 --> 00:00:33.580 So subtitles are important for not only multilingual translation, 5 00:00:33.670 --> 00:00:36.640 but also for accessibility services. 6 00:00:37.010 --> 00:00:44.810 And while there are standards that exist for 2D immersive media little has been explored in VR/AR, 7 00:00:44.900 --> 00:00:49.630 in 360 video, for example, this is what our poster is about. 8 00:00:49.780 --> 00:00:54.160 So we designed a very controlled experiment where we 9 00:00:54.180 --> 00:01:10.370 compared head-locked and fixed subtitles. So that's position versus monochrome and color subtitles in a very controlled manner such that each viewer would only see one type of video content once. 10 00:01:10.540 --> 00:01:12.950 So they either saw head locked 11 00:01:14.110 --> 00:01:22.380 in color or fixed in monochrome, but from the 2 different videos, either the one on the left there, or the one on the right. 12 00:01:24.120 --> 00:01:25.960 So to do this we wanted to 13 00:01:27.840 --> 00:01:39.280 eye-track the experiment, to see what people are looking at inside VR 360, and to get a sense of how well did the viewers actually look at 14 00:01:39.380 --> 00:01:51.700 the subtitles. So our framework for evaluating this in VR is the novelty really of our contribution where we looked at psychophysiological measures, eye movements, performance metrics, and so 15 00:01:51.740 --> 00:01:52.840 reported 16 00:01:52.920 --> 00:02:09.639 that kind of statistics. Our framework is based on an HTC Vive Pro Eye, where we use a data manager to display the subtitles using a particle system like implementation 17 00:02:09.650 --> 00:02:14.140 from computer graphics which runs in its own thread 18 00:02:14.620 --> 00:02:17.500 so that we can capture our movements at 120 Hertz. 19 00:02:17.610 --> 00:02:26.720 And then we record everything, including video position, caption, position by the manager, and then later we can play back or output 20 00:02:26.850 --> 00:02:30.160 the results that we had collected. 21 00:02:32.110 --> 00:02:43.770 So we hypothesized the headlocked subtitles, the ones that are always in front of the viewer, would be better for content comprehension, measured by lower task, load, 22 00:02:43.810 --> 00:02:50.800 and in other words, lower cognitive, demand, self-reported in this case, and also that 23 00:02:51.770 --> 00:02:58.550 they'd be easier to read in some sense. So we'll look at that in a minute. A 2 by 2 mixed design, where we use the position 24 00:02:58.660 --> 00:03:07.620 as a between subjects factor and the color was within subjects factor. So like, I said, viewers would only see one type of content 25 00:03:07.650 --> 00:03:16.230 at a time, counter-balanced. 24 participants played a part in our study, and most of our digital savvy 26 00:03:16.310 --> 00:03:18.040 the digital media savvy. 27 00:03:18.670 --> 00:03:25.440 Here's the experiment. We started with a demographic questionnaire, then calibrated the eye tracker inside the HMD, 28 00:03:25.710 --> 00:03:28.500 provided instructions, played the clip, 29 00:03:28.660 --> 00:03:33.590 and then obtained the NASA TLX, so self-reported load questionnaire, 30 00:03:33.720 --> 00:03:40.400 Video clip 2, counterbalanced, so that you know the order of presentation was always different 31 00:03:40.620 --> 00:03:43.490 or balanced across subjects, or participants. 32 00:03:44.500 --> 00:03:57.380 And then a debrief session to explain what was going on. So results show that head-locked self-reported task load was indeed lower than the fixed subtitles. So fixed subtitles are fixed in space 33 00:03:57.430 --> 00:04:05.320 within the video, and so the user has to rotate their head and essentially find the subtitles that are positioned next to the speaker. 34 00:04:06.300 --> 00:04:17.839 Ambient-focal attention, our K coefficient, which is a the metric that uses both the fixation duration minus the subsequent fixation saccade, 35 00:04:18.740 --> 00:04:24.170 you can see the poster for further details, showed that for headlocked subtitles 36 00:04:24.280 --> 00:04:28.840 viewers were mainly focal, so they could read the subtitles, and then they were 37 00:04:28.890 --> 00:04:37.130 ambient, so they can move their eyes around afterwards. The observation was somewhat reversed in the fixed subtitles. They were ambient first, 38 00:04:37.150 --> 00:04:41.260 so they were most likely looking for the subtitles first, and then 39 00:04:41.320 --> 00:04:47.170 focal in time to be able to read the subtitles. Total fixation time 40 00:04:47.180 --> 00:04:49.500 on the headlock subtitles 41 00:04:49.570 --> 00:05:00.640 between scene and subtitle was pretty much the same. So, in other words, it allowed participants to see this as much of the scene in subtitles in a balanced kind of way; whereas the fixed subtitles 42 00:05:00.720 --> 00:05:17.100 allowed more viewing of the scene and fewer time on the subtitles. Now that, on the one hand, that is good because you get to see more of the scene as intended, and not spending most of your time on the subtitles. On the other hand, it could also mean that you're basically having a hard time finding 43 00:05:17.240 --> 00:05:20.850 the subtitles that are positioned somewhere in the scene. 44 00:05:23.380 --> 00:05:31.950 So head-locked are better which was expected, and it allowed better content comprehension, 45 00:05:32.380 --> 00:05:40.130 as expected. We did not find much evidence in terms of impact of color in this particular instance. 46 00:05:40.260 --> 00:05:44.030 But I think this is a a landmark study in the sense of 47 00:05:45.090 --> 00:05:51.030 processing eye movements as the participants are viewing subtitles in VR. 48 00:05:51.370 --> 00:06:07.540 Those are scarce in 2D, those kinds of studies, and even more scarce in VR. And hopefully further work on this, including more controlled experiments, will lead to better content comprehension and better use of subtitles 49 00:06:07.620 --> 00:06:12.430 in the video, whether it's for accessibility purposes, or translation. 50 00:06:13.650 --> 00:06:18.340 That's about it. Thank you very much. As I said, this is a a large 51 00:06:18.400 --> 00:06:27.240 cooperation, and is also funded by the LEAD-ME cost European cooperation, science, technology funded by the horizon 2020 program, 52 00:06:27.360 --> 00:06:36.180 and involves several groups of people from Poland, the UK, Spain, and the U.S.A. Thank you so much.