1
00:00:02.300 --> 00:00:14.710
Hi! This is a poster on eye-tracked evaluation of subtitles in immersive VR 360 video, with Marta Brescia-Zapata and Pilar Orero from Univsitat Autonoma of Barcelona,

2
00:00:14.780 --> 00:00:20.730
Krzysztof Krejtz at SWPS University in Poland, and Chris Hughes from Salford

3
00:00:20.820 --> 00:00:27.690
in the Uk and myself from Clemson in South Carolina, the U.S.A. All right.

4
00:00:28.330 --> 00:00:33.580
So subtitles are important for not only multilingual translation,

5
00:00:33.670 --> 00:00:36.640
but also for accessibility services.

6
00:00:37.010 --> 00:00:44.810
And while there are standards that exist for 2D immersive media little has been explored in VR/AR,

7
00:00:44.900 --> 00:00:49.630
in 360 video, for example, this is what our poster is about.

8
00:00:49.780 --> 00:00:54.160
So we designed a very controlled experiment where we

9
00:00:54.180 --> 00:01:10.370
compared head-locked and fixed subtitles. So that's position versus monochrome and color subtitles in a very controlled manner such that each viewer would only see one type of video content once.

10
00:01:10.540 --> 00:01:12.950
So they either saw head locked

11
00:01:14.110 --> 00:01:22.380
in color or fixed in monochrome, but from the 2 different videos, either the one on the left there, or the one on the right.

12
00:01:24.120 --> 00:01:25.960
So to do this we wanted to

13
00:01:27.840 --> 00:01:39.280
eye-track the experiment, to see what people are looking at inside VR 360, and to get a sense of how well did the viewers actually look at

14
00:01:39.380 --> 00:01:51.700
the subtitles. So our framework for evaluating this in VR is the novelty really of our contribution where we looked at psychophysiological measures, eye movements, performance metrics, and so

15
00:01:51.740 --> 00:01:52.840
reported

16
00:01:52.920 --> 00:02:09.639
that kind of statistics. Our framework is based on an HTC Vive Pro Eye, where we use a data manager to display the subtitles using a particle system like implementation

17
00:02:09.650 --> 00:02:14.140
from computer graphics which runs in its own thread

18
00:02:14.620 --> 00:02:17.500
so that we can capture our movements at 120 Hertz.

19
00:02:17.610 --> 00:02:26.720
And then we record everything, including video position, caption, position by the manager, and then later we can play back or output

20
00:02:26.850 --> 00:02:30.160
the results that we had collected.

21
00:02:32.110 --> 00:02:43.770
So we hypothesized the headlocked subtitles, the ones that are always in front of the viewer, would be better for content comprehension, measured by lower task, load,

22
00:02:43.810 --> 00:02:50.800
and in other words, lower cognitive, demand, self-reported in this case, and also that

23
00:02:51.770 --> 00:02:58.550
they'd be easier to read in some sense. So we'll look at that in a minute.  A 2 by 2 mixed design, where we use the position

24
00:02:58.660 --> 00:03:07.620
as a between subjects factor and the color was within subjects factor. So like, I said, viewers would only see one type of content

25
00:03:07.650 --> 00:03:16.230
at a time, counter-balanced. 24 participants played a part in our study, and most of our digital savvy

26
00:03:16.310 --> 00:03:18.040
the digital media savvy.

27
00:03:18.670 --> 00:03:25.440
Here's the experiment. We started with a demographic questionnaire, then calibrated the eye tracker inside the HMD,

28
00:03:25.710 --> 00:03:28.500
provided instructions, played the clip,

29
00:03:28.660 --> 00:03:33.590
and then obtained the NASA TLX, so self-reported load questionnaire,

30
00:03:33.720 --> 00:03:40.400
Video clip 2, counterbalanced, so that you know the order of presentation was always different

31
00:03:40.620 --> 00:03:43.490
or balanced across subjects, or participants.

32
00:03:44.500 --> 00:03:57.380
And then a debrief session to explain what was going on.  So results show that head-locked self-reported task load was indeed lower than the fixed subtitles. So fixed subtitles are fixed in space

33
00:03:57.430 --> 00:04:05.320
within the video, and so the user has to rotate their head and essentially find the subtitles that are positioned next to the speaker.

34
00:04:06.300 --> 00:04:17.839
Ambient-focal attention, our K coefficient, which is a the metric that uses both the fixation duration minus the subsequent fixation saccade,

35
00:04:18.740 --> 00:04:24.170
you can see the poster for further details, showed that for headlocked subtitles

36
00:04:24.280 --> 00:04:28.840
viewers were mainly focal, so they could read the subtitles, and then they were

37
00:04:28.890 --> 00:04:37.130
ambient, so they can move their eyes around afterwards. The observation was somewhat reversed in the fixed subtitles. They were ambient first,

38
00:04:37.150 --> 00:04:41.260
so they were most likely looking for the subtitles first, and then

39
00:04:41.320 --> 00:04:47.170
focal in time to be able to read the subtitles.  Total fixation time

40
00:04:47.180 --> 00:04:49.500
on the headlock subtitles

41
00:04:49.570 --> 00:05:00.640
between scene and subtitle was pretty much the same. So, in other words, it allowed participants to see this as much of the scene in subtitles in a balanced kind of way; whereas the fixed subtitles

42
00:05:00.720 --> 00:05:17.100
allowed more viewing of the scene and fewer time on the subtitles. Now that, on the one hand, that is good because you get to see more of the scene as intended, and not spending most of your time on the subtitles. On the other hand, it could also mean that you're basically having a hard time finding

43
00:05:17.240 --> 00:05:20.850
the subtitles that are positioned somewhere in the scene.

44
00:05:23.380 --> 00:05:31.950
So head-locked are better which was expected, and it allowed better content comprehension,

45
00:05:32.380 --> 00:05:40.130
as expected.  We did not find much evidence in terms of impact of color in this particular instance.

46
00:05:40.260 --> 00:05:44.030
But I think this is a a landmark study in the sense of

47
00:05:45.090 --> 00:05:51.030
processing eye movements as the participants are viewing subtitles in VR.

48
00:05:51.370 --> 00:06:07.540
Those are scarce in 2D, those kinds of studies, and even more scarce in VR. And hopefully further work on this, including more controlled experiments, will lead to better content comprehension and better use of subtitles

49
00:06:07.620 --> 00:06:12.430
in the video, whether it's for accessibility purposes, or translation.

50
00:06:13.650 --> 00:06:18.340
That's about it. Thank you very much. As I said, this is a a large

51
00:06:18.400 --> 00:06:27.240
cooperation, and is also funded by the LEAD-ME cost European cooperation, science, technology funded by the horizon 2020 program,

52
00:06:27.360 --> 00:06:36.180
and involves several groups of people from Poland, the UK, Spain, and the U.S.A. Thank you so much.