WEBVTT

1
00:00:00.060 --> 00:00:08.980
Hi. Okay, I am Andrew Duchowski and I'll be presenting our recent work on eye tracking in 360 videos

2
00:00:09.130 --> 00:00:13.710
with the focus on accessibility standing in for Chris Hughes. He's really our

3
00:00:13.730 --> 00:00:19.830
developer of our...you know what, I'll play the slides...were we are...

4
00:00:20.980 --> 00:00:26.240
He's really our lead developer in our LEAD-ME COST Action Workgroup.

5
00:00:26.270 --> 00:00:33.260
And so here in Vienna, what I'll talk about is a little bit of the background and motivation.

6
00:00:35.880 --> 00:00:40.180
accessible, immersive video. And then what we do, eye tracking

7
00:00:40.440 --> 00:00:45.460
in 360 video to test various usability type

8
00:00:45.540 --> 00:00:52.970
applications and accessibility type developments. Okay. So background and motivation. Here is

9
00:00:53.270 --> 00:00:57.050
what standard 2D subtitles are

10
00:00:57.850 --> 00:00:59.100
used to

11
00:00:59.350 --> 00:01:04.080
percent, or what we are used to seeing when they are presented. So here, in this case

12
00:01:04.269 --> 00:01:11.410
on TV, on the 2D display, and here they were showing color and a box, a specific kind of font.

13
00:01:13.640 --> 00:01:17.080
So you've seen all that before. and

14
00:01:17.190 --> 00:01:33.310
this kind of thing can show up on most devices these days. And even this is actually taken from, I think the Opera House in Barcelona, where you even have something called surtitles where the subtitles are above the stage, or even in the seat back in front of you.

15
00:01:33.650 --> 00:01:40.330
So the point of this is that we need this kind of accessibility in modern

16
00:01:40.690 --> 00:01:46.930
extended reality applications, VR being one of them, augmented reality as well.

17
00:01:48.170 --> 00:01:57.270
Yet another form is sign language which I will quickly skip because we're not really implementing this or looking at this in our particular workgroup.

18
00:01:58.570 --> 00:02:03.120
Instead, we're doing immersive video, which is 360.

19
00:02:03.210 --> 00:02:13.020
So this goes back to a BBC project that Chris was involved in building the immersive accessibility player

20
00:02:13.160 --> 00:02:17.400
and looking at the media there and trying to get

21
00:02:18.840 --> 00:02:22.500
subtitles to show up in VR.

22
00:02:24.560 --> 00:02:31.110
And the ImAc player and recorder was part of that design. So that was back then it was user centered with

23
00:02:31.360 --> 00:02:43.840
user requirements. They eventually built the platform but the trouble is that you'll see that this development kind of was short lived, or maybe limited in its extent, because

24
00:02:46.450 --> 00:02:54.850
too much was uncovered in the sense, too many variables, too many type of possibilities, and it was really hard to pin down what worked.

25
00:02:54.970 --> 00:03:07.480
But anyway, what we want to do is, do this for VR glasses specifically, and one of the concerns is limited resolution. And so where do you put the subtitles. Do you put it

26
00:03:07.710 --> 00:03:16.250
in the regular kind of like 2D display at the bottom. or what? You have to remember that the user's head moves around here.

27
00:03:16.280 --> 00:03:22.140
And so we get a 360 view, and so it's much different from the standard TV 2D display.

28
00:03:22.220 --> 00:03:30.020
The resolution is not much of a of an issue, except that you don't want to put subtitles there, because it would distort the text.

29
00:03:30.170 --> 00:03:33.910
The other issue that it shows up, or what was uncovered in the ImAc

30
00:03:34.640 --> 00:03:36.480
player was that

31
00:03:36.510 --> 00:03:48.350
this person speaking here is not speaking. In fact, he's just standing there, but the speaker is off to the side somewhere where we don't see them at the moment. And so here, how do you indicate

32
00:03:48.470 --> 00:03:50.520
where they are? They use an arrow to test.

33
00:03:50.560 --> 00:03:53.240
So yeah, he's not speaking.

34
00:03:53.290 --> 00:03:59.660
Speaker is over on the left. Go look there! Meanwhile the other technical challenge here is, how do you present

35
00:04:00.220 --> 00:04:06.330
video and 3D. You could do various geometric mappings, using cube map, or

36
00:04:06.410 --> 00:04:10.410
pick, we rectangular what the difference here is

37
00:04:10.710 --> 00:04:16.029
is a distortion, and so the plot below shows you how much distortion there is.

38
00:04:16.079 --> 00:04:23.920
and where, and so most of it happens at the polls, as it were. That's why you see often the globe is kind of stretched.

39
00:04:24.000 --> 00:04:27.630
There's a video here, but it doesn't want to play in in the player.

40
00:04:27.690 --> 00:04:29.150
which just shows you

41
00:04:31.530 --> 00:04:35.400
which shows you what the distortion looks like. So if it's done properly.

42
00:04:35.450 --> 00:04:37.090
the distortion

43
00:04:37.710 --> 00:04:40.870
isn't there because these lines look like they're

44
00:04:43.210 --> 00:04:45.390
like lines, and they're not distorted

45
00:04:45.430 --> 00:04:49.080
like curves like you would expect here in the image.

46
00:04:49.860 --> 00:04:58.200
Alright, other types of mappings include standard cube. Back, then, you see, the distortions are kind of around the corners of the cube.

47
00:04:58.290 --> 00:05:06.370
and then also equi-angular cube map seems like one of the better options, but it's a little bit more exotic, and

48
00:05:07.450 --> 00:05:08.300
I don't know.

49
00:05:08.420 --> 00:05:17.700
not quite as sort of prevalent as the equi rectangular projection that we use a lot of 360 cameras use, and so they give you that pretty much for for free.

50
00:05:18.200 --> 00:05:23.060
So here you see what the what the mapping looks like, where the distortions are.

51
00:05:23.150 --> 00:05:29.580
and if you're mostly in the center field of view. Then you're kind of okay. And so it seems to work very well.

52
00:05:29.740 --> 00:05:34.500
Now user testing for accessibility. The trouble here is,

53
00:05:34.960 --> 00:05:43.600
first of all, people might not be used to the technology. So, in fact, in some of our testing that Marta may tell you about

54
00:05:43.720 --> 00:05:47.120
when asked which type of subtitle. Do you want head

55
00:05:47.350 --> 00:05:58.200
or a headlocked, or fixed, people say what's VR?  And so there, that's an interesting kind of technological challenge in itself.   Content is super important

56
00:05:58.390 --> 00:06:08.810
because how do you actually get? Excuse me, objective results as opposed to subjective kinds of things.

57
00:06:09.040 --> 00:06:12.290
So this is what they learned in the W3C group is, they looked at various

58
00:06:12.400 --> 00:06:15.480
VR approaches, and here's what they came up with.

59
00:06:15.760 --> 00:06:19.250
Lots of things!  Are the subtitles fixed in the scene,

60
00:06:20.840 --> 00:06:26.470
or headlocked, so do they, are they linked to where your head direction is pointing.

61
00:06:27.800 --> 00:06:39.030
Turns out that they didn't get a very good answer to all this, even though they were able. and Chris was instrumental in this, and implementing this. So this is a technical challenge.

62
00:06:39.280 --> 00:06:50.270
On the right. The little cone is the head, direction and head position on the the plane. There is where the actual part of the video shows up where we can draw subtitles.

63
00:06:50.410 --> 00:06:51.220
So

64
00:06:51.540 --> 00:06:59.980
in VR world in 360 video we have location of elements using polar coordinates.

65
00:07:01.430 --> 00:07:03.310
polar angle and azimuth.

66
00:07:03.440 --> 00:07:13.420
And then what you do in computer graphics is you build a scene graph. So the scene means the world. The world includes a camera. It includes the captions.

67
00:07:13.620 --> 00:07:15.180
the video.

68
00:07:15.220 --> 00:07:19.490
the sphere and the fake camera is where the where the head is.

69
00:07:19.750 --> 00:07:22.600
So how does it move? Is it fixed and so forth?

70
00:07:25.210 --> 00:07:29.520
Chris came to this because they, I think they asked him to do this with a

71
00:07:29.650 --> 00:07:41.510
with in beyond, he said, oh, yeah, sure, I'll be happy to do that. It's pretty easy. And then they gave him this video where the speakers were on the side in an airplane seat, and right away you can see that things didn't quite work.

72
00:07:42.140 --> 00:07:46.100
It wasn't quite obvious who was speaking when and so forth, and so

73
00:07:46.280 --> 00:07:54.130
fixed subtitles. He had to basically reinvent them, using something based on particle systems. This is a screenshot from

74
00:07:54.160 --> 00:08:03.620
Star Trek II The Wrath of Kahn. I think a 1982 or 85 movie where each particle here

75
00:08:04.210 --> 00:08:07.870
has a position inside this flame.

76
00:08:08.200 --> 00:08:16.800
So each of these sort of strands in the video. Here is a is a particle, and it's got a position of velocity. So speed, direction, color.

77
00:08:17.170 --> 00:08:23.240
age, lifetime. Eventually the particles die, and they are removed from the scene.

78
00:08:23.360 --> 00:08:36.240
There's also transparency and things like that. So Chris thought, hey, why not use the same computer graphics technology to represent subtitles as particles.

79
00:08:36.280 --> 00:08:44.039
And that's what he did. And of course, with that you've got an emitter so you can throw these captions, as he calls them, or subtitles anywhere you want.

80
00:08:44.140 --> 00:08:52.360
Each one has a certain lifetime or lifespan. They can have their own color, font shape. box, transparency whatever.

81
00:08:53.580 --> 00:09:03.170
And so here he is with all these options you've got this ton of things to pick from. Here on the right you can see the menu. It's quite involved

82
00:09:03.230 --> 00:09:13.680
guide modes, responsive captions, time codes, all sorts of things. Trouble is, how do you actually test this? How do you actually establish what is useful

83
00:09:13.730 --> 00:09:17.380
out of all these myriad of choices.

84
00:09:19.100 --> 00:09:31.010
Here's what they look like, and some of the things that that particle systems give you. You can use physics engine to prevent collisions. You can stack them so that they don't overlap each other.

85
00:09:31.170 --> 00:09:38.350
You can do the different colors. You can make sure they're inside the scene. This is Chris's living room, by the way.

86
00:09:39.360 --> 00:09:50.240
There he is. You can basically see him in the middle picture. There he's standing up, and he's talking to his kids and all the various things you can do

87
00:09:50.390 --> 00:10:01.230
to make these display in VR. We still don't know, though. which is the best of all these, what's the most effective? What's the most accessible.

88
00:10:03.880 --> 00:10:17.650
And so. like Henry Ford, if he was asked what they wanted, they would have said, Just give me the old style 2D subtitles. so that I can see them. So they would not want all these new fangled things.

89
00:10:17.670 --> 00:10:25.040
So here's how we test them in 360. The technology that we've conceived of here, and that Chris implemented.

90
00:10:25.360 --> 00:10:28.970
And we use an eye tracker to find out what you're looking at.

91
00:10:30.590 --> 00:10:41.440
The HTC Vive Pro Eye is such. an eye-tracked, endowed VR headset that we use, and we've got several of these, and so

92
00:10:41.780 --> 00:10:48.720
Dr. Krejtz has used one in Warsaw. Marta has taken it to Slovenia,

93
00:10:48.770 --> 00:10:59.990
Spain, and Chris has one in the Uk. I've got one here in in the States, so we all share the the source code. And we we all test this and run this

94
00:11:00.100 --> 00:11:04.190
when it works.  120 Hertz ey tracking that's the sampling rate.

95
00:11:04.330 --> 00:11:14.490
It comes with this sort of software development toolkit SDK: from SRanipal which gives us the gaze information vectors.

96
00:11:14.530 --> 00:11:16.730
Here's what it looks like. You get a position

97
00:11:16.910 --> 00:11:24.820
for where the gaze ray emanates at each eyeball, the direction, and then the intersection

98
00:11:24.910 --> 00:11:27.920
on the screen, if you want it, or you can calculate it.

99
00:11:29.350 --> 00:11:30.280
So

100
00:11:30.560 --> 00:11:33.880
the geometry is there, and it works very fairly well.

101
00:11:35.220 --> 00:11:36.180
We get

102
00:11:36.240 --> 00:11:44.780
the gaze, origin, direction, pupil diameter, eye openness, and whether your eye is in the sensor or visible by the sensor.

103
00:11:45.130 --> 00:11:46.180
I am.

104
00:11:46.560 --> 00:11:52.240
We get all this in Unity. We can output these files and put them in. And so we've got.

105
00:11:52.420 --> 00:11:54.160
Chris has created this

106
00:11:54.400 --> 00:12:08.540
system architecture where it all relies on a threaded kind of recorder. So we record the data, the screen, eye movement data, save it to a file, and then later you can read it all in

107
00:12:08.700 --> 00:12:11.950
and play it back via this

108
00:12:12.140 --> 00:12:17.840
player that he's also developed. So that is the technology. I think

109
00:12:18.110 --> 00:12:21.950
Marta will come in and show you an example of this

110
00:12:22.110 --> 00:12:31.970
video that we have, or several videos. And the key concern here is that we've got all these options, and we don't quite know which is the best. How do you test this?

111
00:12:32.040 --> 00:12:39.680
Technology itself is the learning step that users have to put the helmet on to learn how to use it.

112
00:12:40.270 --> 00:12:42.220
And then the eye tracker that we get

113
00:12:42.340 --> 00:12:54.800
that will give us objective evidence of what people are doing. Are they actually reading the subtitles or not. This was actually not known before people would say, oh, sure, I read the subtitles.

114
00:12:55.060 --> 00:12:58.460
but we actually have data that shows quite the contrary.

115
00:12:58.570 --> 00:13:04.440
So I will leave you at that. That is the technology that we've built up, Chris mainly.

116
00:13:04.620 --> 00:13:08.460
And then Marta, Dr. Krejtz

117
00:13:08.680 --> 00:13:16.810
has gone and obtained really interesting data, and I think you'll see more during this LEAD-ME COST Action

118
00:13:17.100 --> 00:13:18.350
workshop.

119
00:13:19.340 --> 00:13:28.000
Thank you so much on behalf of myself. Dr. Krejtz, Marta, and of course. the indubitable Chris Hughes in Salford.

120
00:13:28.760 --> 00:13:30.710
Thank you. See, you later.