id_notes/John C/1998-08-17_1999-03-17
[idsoftware.com]
Welcome to id Software's Finger Service V1.5!
Name: John Carmack
Email: johnc@idsoftware.com
Description: Programmer
Project: Quake Arena
Last Updated: 03/17/1999 01:53:50 (Central Standard Time)
-------------------------------------------------------------------------------
3/17/99
-------
First impressions of the SGI visual workstation 320:
I placed an order for a loaded system ($11k) from their web site two months
ago. It still hasn't arrived (bad impression), but SGI did bring a loaner
system by for us to work with.
The system tower is better than standard pc fare, but I still think Apple's
new G3 has the best designed computer case.
The wide aspect LCD panel is very nice. A while ago I had been using a dual
monitor LCD setup on a previous intergraph, but the analog syncing LCD
screens had some fringing problems, and the gamma range was fairly different
from a CRT. The SGI display is perfectly crisp (digital interface), and
has great color. The lighting is a little bit uneven top to bottom on this
unit -- I am interested to see how the next unit looks for comparison.
Unfortunately, the card that they bundle with the LCD if you buy it
separately is NOT a good 3D accelerator, so if you care about 3D and want
this LCD screen, you need to buy an sgi visual workstation. Several of the
next generation consumer cards are going to support digital flat panel
outputs, so hopefully soon you will be able to run a TNT2 or something
out to one of these.
The super memory system does not appear to have provided ANY benefit to the
CPU. My memory benchmarking tests showed it running about the same as a
standard intel design.
Our first graphics testing looked very grim -- Quake3 didn't draw the world
at all. I spent a while trying to coax some output by disabling various
things, but to no avail. We reported it to SGI, and they got us a fix the
next day. Some bug with depthRange(). Even with the fix, 16 bit rendering
doesn't seem to work. I expect they will address this.
Other than that, there haven't been any driver anomolies, and both the game
and editor run flawlessly.
For single pass, top quality rendering (32 bit framebuffer, 32 bit depth
buffer, 32 bit trilinear textures, high res screen), the SGI has a higher
fill rate than any other card we have ever tested on a pc, but not by too
wide of a margin.
If your application can take advantage of multitexture, a TNT or rage128
will deliver slightly greater fill performance. It is likely that the next
speed bump of both chips will be just plain faster than the SGI on all
fill modes.
A serious flaw is that the LCD display can't support ANY other resolutions
except the native 1600*1024. The game chunks along at ten to fifteen fps
at that resolution (but it looks cool!). They desperately need to support
a pixel doubled 800*512 mode to make any kind of full screen work
possible. I expect they will address this.
Vsync disable is implemented wrong. Disabling sync causes it to continue
rendering, but the flip still doesn't happen until the next frame. This
gives repeatable (and faster) benchmark numbers, but with a flashing
screen that is unusable. The right way is to just cause the flip to happen
on the next scan line, like several consumer cards do, or blit. It gives
tearing artifacts, but it is still completely usable, and avoids temporal
nyquist issues between 30 and 60 hz. I expect they will address this.
Total throughput for games is only fair, about like an intergraph. Any
of the fast consumer cards will run a quake engine game faster than the
sgi in its current form. I'm sure some effort will be made to improve
this, but I doubt it will be a serious focus, unless some SGI engineers
develop unhealthy quake addictions. :-)
The unified memory system allows nearly a gig of textures, and individual
textures can by up to 4k by 4k. AGP texturing provides some of this
benefit for consumer cards, but not to the same degree or level of
performance.
The video stream support looks good, but I haven't tried using it yet.
Very high interpolater accuracy. All the consumer cards start to break
up a bit with high magnification, weird aspects, or long edges. The
professional cards (intergraph, glint, E&S, SGI) still do a better job.
SGI exports quite a few more useful OpenGL extensions than intergraph
does, but multisample antialiasing (as claimed in their literature)
doesn't seem to be one of them.
Overall, it looks pretty good, and I am probably going to move over to
using the SGI workstation full time when my real system arrives.
I was very happy with the previous two generations of intergraph
workstations, but this last batch (GT1) has been a bunch of lemons, and
the wildcat graphics has been delayed too long. The current realizm-II
systems just don't have enough fill rate for high end development.
For developers that don't have tons of money, the decision right now
is an absolute no-brainer -- buy a TNT and stick it in a cheap system.
Its a better "professional" 3D accelerator than you could buy at any
price not too long ago.
3/3/99
------
On the issue of railgun firing rates -- we played with it for a while at
the slower speed, but it has been put back to exactly Q2's rate of fire.
I do agree with Thresh that the way we had it initially (faster than Q2,
but with the same damage) made it an overpowered weapon in the hands of
highly skilled players, which is exactly what we should try to avoid.
An ideal game should give scores as close to directly proportional to
the players reletive skills as possible. The better player should win
in almost all cases, but the game will be more entertaining if the
inferior players are not completely dominated.
Quake 1 had really bad characteristics that way -- Thresh can play
extremely talented players and often prevent them from scoring a single
point. We wouldn't put up with a conventional sport that commonly
game scores of 20 to 1 in championship matches, and I don't think
we should encourage it in our games.
Eliminating health items is probably the clearest way to prevent
blowout games, but that has never been popular. Still, we should
try to avoid weapon decisions that allow the hyper-skilled to pull
even farther away from the rest of the crowd. They will still win,
no matter what the weapons are, just not by as wide a margin.
1/29/98
-------
The issue of the 8192 unit map dimension limit had been nagging at me for
a long time now. The reason for the limit is that coordinates are
communicated over the network in 16 bit shorts divided as a sign bit,
twelve unit bits, and three fractional bits. There was also another
side to the representation problem, but it was rarely visible: if
you timescale down, you can actually tell the fractional bit granularity
in position and velocity.
The rest of the system (rendering and gameplay) has never had any issues
with larger maps, even in quake 1. There are some single precision
floating point issues that begin to creep in if things get really huge,
but maps should be able to cover many times the current limit without
any other changes.
A while ago I had changed Q3 so that the number of fractional bits was
a compile time option, which allowed you to trade off fine grain precision
for larger size. I was considering automatically optimizing this for each
level based on its size, but it still didn't feel like a great solution.
Another aspect of the problem that wasn't visible to the public was that
the fractional quantization of position could cause the position to
actually be inside a nearby solid when used for client side prediction.
The code had to check for this and try to correct the situation by jittering
the position in each of the possible directions it might have been
truncated from. This is a potential issue whenever there is any loss of
precision whatsoever in the server to client communication.
The obvious solution is to just send the full floating point value for
positions, but I resisted that because the majority of our network
traffic is positional updates, and I didn't want to bloat it. There have
been other bandwidth savings in Q3, and LANs and LPB connections are also
relevant, so I was constantly evaluating the tradeoff.
Dealing with four or five players in view isn't a real problem. The big
bandwidth issues arrive when multiple players start unloading with rapid
fire weapons. (as an aside, I tried making 5hz fire weapons for Q3 to save
bandwidth, but no matter how much damage they did, 5hz fire rates just
seemed to feel slow and weak...)
I finally moved to a bit-level stream encoding to save some more bandwidth
and give me some more representational flexibility, and this got me thinking
about the characteristics of the data that bother us.
In general, the floating point coordinates have significant bits all through
the mantissa. Any movement along an angle will more or less randomize the
low order bits.
My small little insight was that because missiles are evaluated
parameterically instead of itteretively in Q3, a one-time snapping of the
coordinates can be performed at their generation time, giving them fixed
values with less significant bits for their lifetime without any effort
outside their spawning function. It also works for doors and plats, which
are also parametrically represented now. Most events will also have
integral coordinates.
The float encoder can check for an integral value in a certain range and
send that as a smaller number of bits, say 13 or so. If the value isn't
integral, it will be transmitted as a full 32 bit float.
The other thing I am investigating is sub-byte delta encoding of floating
point values. Even with arbitrary precision movement deltas, the sign and
exponent bits change with very low frequency except when you are very
near the origin. At the minimum, I should be able to cut the standard
player coordinate delta reps to three bytes from four.
So, the bottom line is that the bandwidth won't move much (it might even
go down if I cut the integral bits below 15), the maps become unbounded
in size to the point of single precision roundoff, and the client doesn't
have to care about position jittering (which was visible in Q3 code that
will be released).
1/10/99
-------
Ok, many of you have probably heard that I spoke at the macworld
keynote on tuesday. Some information is probably going to get
distorted in the spinning and retelling, so here is an info
dump straight from me:
Q3test, and later the full commercial Quake3: Arena, will be simultaniously
released on windows, mac, and linux platforms.
I think Apple is doing a lot of things right. A lot of what they are
doing now is catch-up to wintel, but if they can keep it up for the next
year, they may start making a really significant impact.
I still can't give the mac an enthusiastic reccomendation for sophisticated
users right now because of the operating system issues, but they are working
towards correcting that with MacOS X.
The scoop on the new G3 mac hardware:
Basically, its a great system, but Apple has oversold its
performance reletive to intel systems. In terms of timedemo scores,
the new G3 systems should be near the head of the pack, but there
will be intel systems outperforming them to some degree. The mac has
not instantly become a "better" platform for games than wintel, it
has just made a giant leap from the back of the pack to near the
front.
I wish Apple would stop quoting "Bytemarks". I need to actually
look at the contents of that benchmark and see how it can be so
misleading. It is pretty funny listening to mac evangelist types
try to say that an iMac is faster than a pentium II-400. Nope.
Not even close.
From all of my tests and experiments, the new mac systems are
basically as fast as the latest pentium II systems for general
cpu and memory performance. This is plenty good, but it doesn't
make the intel processors look like slugs.
Sure, an in-cache, single precision, multiply-accumulate loop could
run twice as fast as a pentium II of the same clock rate, but
conversly, a double precision add loop would run twice as fast
on the pentium II.
Spec95 is a set of valid benchmarks in my opinion, and I doubt the
PPC systems significantly (if at all) outperform the intel systems.
The IO system gets mixed marks. The 66 mhz video slot is a good step
up from 33 mhz pci in previous products, but that's still half the
bandwidth of AGP 2X, and it can't texture from main memory. This
will have a small effect on 3D gaming, but not enough to push it
out of its class.
The 64 bit pci slots are a good thing for network and storage cards,
but the memory controller doesn't come close to getting peak
utilization out of it. Better than normal pci, though.
The video card is almost exactly what you will be able to get on
the pc side: a 16 mb rage-128. Running on a 66mhz pci bus, it's
theoretical peak performance will be midway between the pci and
agp models on pc systems for command traffic limited scenes. Note
that current games are not actually command traffic limited, so the
effect will be significantly smaller. The fill rates will be identical.
The early systems are running the card at 75 mhz, which does put
it at a slight disadvantage to the TNT, but faster versions are
expected later. As far as I can tell, the rage-128 is as perfect
as the TNT feature-wise. The 32 mb option is a feature ATI can
hold over TNT.
Firewire is cool.
Its a simple thing, but the aspect of the new G3 systems that
struck me the most was the new case design. Not the flashy plastic
exterior, but the functional structure of it. The side of the
system just pops open, even with the power on, and lays the
motherboard and cards down flat while the disks and power supply
stay in the encloser. It really is a great design, and the benefits
were driven home yesterday when I had to scavenge some ram out of old
wintel systems yesterday -- most case designs suck really bad.
---
I could gripe a bit about the story of our (lack of) involvement
with Apple over the last four years or so, but I'm calling that
water under the bridge now.
After all the past fiascos, I had been basically planning on ignoring Apple
until MacOS X (rhapsody) shipped, which would then turn it into a platform
that I was actually interested in.
Recently, Apple made a strategic corporate decision that games were a
fundamental part of a consumer oriented product line (duh). To help that
out, Apple began an evaluation of what it needed to do to help game
developers.
My first thought was "throw out MacOS", but they are already in the process
of doing that, and its just not going to be completed overnight.
Apple has decent APIs for 2D graphics, input, sound, and networking,
but they didn't have a solid 3D graphics strategy.
Rave was sort of ok. Underspecified and with no growth path, but
sort of ok. Pursuing a proprietary api that wasn't competetive with
other offerings would have been a Very Bad Idea. They could have tried
to make it better, or even invent a brand new api, but Apple doesn't have
much credibility in 3D programming.
For a while, it was looking like Apple might do something stupid,
like license DirectX from microsoft and be put into a guaranteed
trailing edge position behind wintel.
OpenGL was an obvious direction, but there were significant issues with
the licensing and implementation that would need to be resolved.
I spent a day at apple with the various engineering teams and executives,
laying out all the issues.
The first meeting didn't seem like it went all that well, and there wasn't
a clear direction visible for the next two months. Finally, I got the all
clear signal that OpenGL was a go and that apple would be getting both the
sgi codebase and the conix codebease and team (the best possible arrangement).
So, I got a mac and started developing on it. My first weekend of
effort had QuakeArena limping along while held together with duct
tape, but weekend number two had it properly playable, and weekend
number three had it brought up to full feature compatability. I
still need to do some platform specific things with odd configurations
like multi monitor and addon controlers, but basically now its
just a matter of compiling on the mac to bring it up to date.
This was important to me, because I felt that Quake2 had slipped a bit in
portability because it had been natively developed on windows. I like the
discipline of simultanious portable development.
After 150 hours or so of concentrated mac development, I learned a
lot about the platform.
CodeWarrior is pretty good. I was comfortable devloping there
almost immediately. I would definately say VC++ 6.0 is a more powerful
overall tool, but CW has some nice little aspects that I like. I
am definately looking forward to CW for linux. Unix purists may
be aghast, but I have always liked gui dev environments more than
a bunch of xterms running vi and gdb.
The hardware (even the previous generation stuff) is pretty good.
The OpenGL performance is pretty good. There is a lot of work
underway to bring the OpenGL performance to the head of the pack,
but the existing stuff works fine for development.
The low level operating systems SUCKS SO BAD it is hard to believe.
The first order problem is lack of memory management / protection.
It took me a while to figure out that the zen of mac development is
"be at peace while rebooting". I rebooted my mac system more times
the first weekend than I have rebooted all the WinNT systems I
have ever owned. True, it has gotten better now that I know my
way around a bit more, and the codebase is fully stable, but there
is just no excuse for an operating system in this day and age to
act like it doesn't have access to memory protection.
The first thing that bit me was the static memory allocation for
the program. Modern operating systems just figure out how much
memory you need, but because the mac was originally dsigned for
systems without memory management, significant things have to be
laid out ahead of time.
Porting a win32 game to the mac will probably involve more work
dealing with memory than any other aspect. Graphics, sound, and
networking have reasonable analogues, but you just can't rely
on being able to malloc() whatever you want on the mac.
Sure, game developers can manage their own memory, but an operating
system that has proper virtual memory will let you develop
a lot faster.
The lack of memory protection is the worst aspect of mac development.
You can just merrily write all over other programs, the development
environment, and the operating system from any application.
I remember that. From dos 3.3 in 1990.
Guard pages will help catch simple overruns, but it won't do anything
for all sorts of other problems.
The second order problem is lack of preemptive multitasking.
The general responsiveness while working with multiple apps
is significantly worse than windows, and you often run into
completely modal dialogs that don't let you do anything else at all.
A third order problem is that a lot of the interfaces are fairly
clunky.
There are still many aspects of the mac that clearly show design
decisions based on a 128k 68000 based machine. Wintel has grown
a lot more than the mac platform did. It may have been because the
intel architecture didn't evolve gracefully and that forced the
software to reevaluate itself more fully, or it may just be that
microsoft pushed harder.
Carbon sanitizes the worst of the crap, but it doesn't turn it
into anything particularly good looking.
MacOS X nails all these problems, but thats still a ways away.
I did figure one thing out -- I was always a little curious why
the early BeOS advocates were so enthusiastic. Coming from a
NEXTSTEP background, BeOS looked to me like a fairly interesting
little system, but nothing special. To a mac developer, it must
have looked like the promised land...
12/30/98
--------
I got several vague comments about being able to read "stuff" from shared
memory, but no concrete examples of security problems.
However, Gregory Maxwell pointed out that it wouldn't work cross platform
with 64 bit pointer environments like linux alpha. That is a killer, so
I will be forced to do everything the hard way. Its probably for the
best, from a design standpoint anyway, but it will take a little more effort.
12/29/98
--------
I am considering taking a shortcut with my virtual machine implementation
that would make the integration a bit easier, but I'm not sure that it
doesn't compromise the integrity of the base system.
I am considering allowing the interpreted code to live in the global address
space, instead of a private 0 based address space of its own. Store
instructions from the VM would be confined to the interpreter's address
space, but loads could access any structures.
On the positive side:
This would allow full speed (well, full interpreted speed) access to variables
shared between the main code and the interpreted modules. This allows system
calls to return pointers, instead of filling in allocated space in the
interpreter's address space.
For most things, this is just a convenience that will cut some development
time. Most of the shared accesses could be recast as "get" system calls,
and it is certainly arguable that that would be a more robust programming
style.
The most prevelent change this would prevent is all the cvar_t uses. Things
could stay in the same style as Q2, where cvar accesses are free and
transparantly updated. If the interpreter lives only in its own address
space, then cvar access would have to be like Q1, where looking up a
variable is a potentially time consuming operation, and you wind up adding
lots of little cvar caches that are updated every from or restart.
On the negative side:
A client game module with a bug could cause a bus error, which would not be
possible with a pure local address space interpreter.
I can't think of any exploitable security problems that read only access to
the entire address space opens, but if anyone thinks of something, let me
know.
11/4/98
-------
More extensive comments on the interpreted-C decision later, but a quick
note: the plan is to still allow binary dll loading so debuggers can be
used, but it should be interchangable with the interpreted code. Client
modules can only be debugged if the server is set to allow cheating, but
it would be possible to just use the binary interface for server modules
if you wanted to sacrifice portability. Most mods will be able to be
implemented with just the interpreter, but some mods that want to do
extensive file access or out of band network communications could still
be implemented just as they are in Q2. I will not endorse any use of
binary client modules, though.
11/3/98
-------
This was the most significant thing I talked about at The Frag, so here it
is for everyone else.
The way the QA game architecture has been developed so far has been as two
seperate binary dll's: one for the server side game logic, and one for the
client side presentation logic.
While it was easiest to begin development like that, there are two crucial
problems with shipping the game that way: security and portability.
It's one thing to ask the people who run dedicated servers to make informed
decisions about the safety of a given mod, but its a completely different
matter to auto-download a binary image to a first time user connecting to a
server they found.
The quake 2 server crashing attacks have certainly proven that there are
hackers that enjoy attacking games, and shipping around binary code would
be a very tempting opening for them to do some very nasty things.
With quake and Quake 2, all game modifications were strictly server side,
so any port of the game could connect to any server without problems.
With Quake 2's binary server dll's not all ports could necessarily run a
server, but they could all play.
With significant chunks of code now running on the client side, if we stuck
with binary dll's then the less popular systems would find that they could
not connect to new servers because the mod code hadn't been ported. I
considered having things set up in such a way that client game dll's could
be sort of forwards-compatable, where they could always connect and play,
but new commands and entity types just might now show up. We could also
GPL the game code to force mod authors to release source with the binaries,
but that would still be inconvenient to deal with all the porting.
Related to both issues is client side cheating. Certain cheats are easy to do
if you can hack the code, so the server will need to verify which code the
client is running. With multiple ported versions, it wouldn't be possible
to do any binary verification.
If we were willing to wed ourselves completely to the windows platform, we
might have pushed ahead with some attempt at binary verification of dlls,
but I ruled that option out. I want QuakeArena running on every platform
that has hardware accelerated OpenGL and an internet connection.
The only real solution to these problems is to use an interpreted language
like Quake 1 did. I have reached the conclusion that the benefits of a
standard language outweigh the benefits of a custom language for our
purposes. I would not go back and extend QC, because that stretches the
effort from simply system and interpreter design to include language design,
and there is already plenty to do.
I had been working under the assumption that Java was the right way to go,
but recently I reached a better conclusion.
The programming language for QuakeArena mods is interpreted ANSI C. (well,
I am dropping the double data type, but otherwise it should be pretty
conformant)
The game will have an interpreter for a virtual RISC-like CPU. This should
have a minor speed benefit over a byte-coded, stack based java interpreter.
Loads and stores are confined to a preset block of memory, and access to all
external system facilities is done with system traps to the main game code,
so it is completely secure.
The tools necessary for building mods will all be freely available: a
modified version of LCC and a new program called q3asm. LCC is a wonderful
project -- a cross platform, cross compiling ANSI C compiler done in under
20K lines of code. Anyone interested in compilers should pick up a copy of
"A retargetable C compiler: design and implementation" by Fraser and Hanson.
You can't link against any libraries, so every function must be resolved.
Things like strcmp, memcpy, rand, etc. must all be implemented directly. I
have code for all the ones I use, but some people may have to modify their
coding styles or provide implementations for other functions.
It is a fair amount of work to restructure all the interfaces to not share
pointers between the system and the games, but it is a whole lot easier
than porting everything to a new language. The client game code is about
10k lines, and the server game code is about 20k lines.
The drawback is performance. It will probably perform somewhat like QC.
Most of the heavy lifting is still done in the builtin functions for path
tracing and world sampling, but you could still hurt yourself by looping
over tons of objects every frame. Yes, this does mean more load on servers,
but I am making some improvements in other parts that I hope will balance
things to about the way Q2 was on previous generation hardware.
There is also the amusing avenue of writing hand tuned virtual assembly
assembly language for critical functions...
I think this is The Right Thing.
10/14/98
--------
It has been difficult to write .plan updates lately. Every time I start
writing something, I realize that I'm not going to be able to cover it
satisfactorily in the time I can spend on it. I have found that terse
little comments either get misinterpreted, or I get deluged by email
from people wanting me to expand upon it.
I wanted to do a .plan about my evolving thoughts on code quality
and lessons learned through quake and quake 2, but in the interest
of actually completing an update, I decided to focus on one change
that was intended to just clean things up, but had a surprising
number of positive side effects.
Since DOOM, our games have been defined with portability in mind.
Porting to a new platform involves having a way to display output,
and having the platform tell you about the various relevant inputs.
There are four principle inputs to a game: keystrokes, mouse moves,
network packets, and time. (If you don't consider time an input
value, think about it until you do -- it is an important concept)
These inputs were taken in separate places, as seemed logical at the
time. A function named Sys_SendKeyEvents() was called once a
frame that would rummage through whatever it needed to on a
system level, and call back into game functions like Key_Event( key,
down ) and IN_MouseMoved( dx, dy ). The network system
dropped into system specific code to check for the arrival of packets.
Calls to Sys_Milliseconds() were littered all over the code for
various reasons.
I felt that I had slipped a bit on the portability front with Q2 because
I had been developing natively on windows NT instead of cross
developing from NEXTSTEP, so I was reevaluating all of the system
interfaces for Q3.
I settled on combining all forms of input into a single system event
queue, similar to the windows message queue. My original intention
was to just rigorously define where certain functions were called and
cut down the number of required system entry points, but it turned
out to have much stronger benefits.
With all events coming through one point (The return values from
system calls, including the filesystem contents, are "hidden" inputs
that I make no attempt at capturing, ), it was easy to set up a
journalling system that recorded everything the game received. This
is very different than demo recording, which just simulates a network
level connection and lets time move at its own rate. Realtime
applications have a number of unique development difficulties
because of the interaction of time with inputs and outputs.
Transient flaw debugging. If a bug can be reproduced, it can be
fixed. The nasty bugs are the ones that only happen every once in a
while after playing randomly, like occasionally getting stuck on a
corner. Often when you break in and investigate it, you find that
something important happened the frame before the event, and you
have no way of backing up. Even worse are realtime smoothness
issues -- was that jerk of his arm a bad animation frame, a network
interpolation error, or my imagination?
Accurate profiling. Using an intrusive profiler on Q2 doesn't give
accurate results because of the realtime nature of the simulation. If
the program is running half as fast as normal due to the
instrumentation, it has to do twice as much server simulation as it
would if it wasn't instrumented, which also goes slower, which
compounds the problem. Aggressive instrumentation can slow it
down to the point of being completely unplayable.
Realistic bounds checker runs. Bounds checker is a great tool, but
you just can't interact with a game built for final checking, its just
waaaaay too slow. You can let a demo loop play back overnight, but
that doesn't exercise any of the server or networking code.
The key point: Journaling of time along with other inputs turns a
realtime application into a batch process, with all the attendant
benefits for quality control and debugging. These problems, and
many more, just go away. With a full input trace, you can accurately
restart the session and play back to any point (conditional
breakpoint on a frame number), or let a session play back at an
arbitrarily degraded speed, but cover exactly the same code paths..
I'm sure lots of people realize that immediately, but it only truly sunk
in for me recently. In thinking back over the years, I can see myself
feeling around the problem, implementing partial journaling of
network packets, and included the "fixedtime" cvar to eliminate most
timing reproducibility issues, but I never hit on the proper global
solution. I had always associated journaling with turning an
interactive application into a batch application, but I never
considered the small modification necessary to make it applicable to
a realtime application.
In fact, I was probably blinded to the obvious because of one of my
very first successes: one of the important technical achievements
of Commander Keen 1 was that, unlike most games of the day, it
adapted its play rate based on the frame speed (remember all those
old games that got unplayable when you got a faster computer?). I
had just resigned myself to the non-deterministic timing of frames
that resulted from adaptive simulation rates, and that probably
influenced my perspective on it all the way until this project.
Its nice to see a problem clearly in its entirety for the first time, and
know exactly how to address it.
9/10/98
-------
I recently set out to start implementing the dual-processor acceleration
for QA, which I have been planning for a while. The idea is to have one
processor doing all the game processing, database traversal, and lighting,
while the other processor does absolutely nothing but issue OpenGL calls.
This effectively treats the second processor as a dedicated geometry
accelerator for the 3D card. This can only improve performance if the
card isn't the bottleneck, but voodoo2 and TNT cards aren't hitting their
limits at 640*480 on even very fast processors right now.
For single player games where there is a lot of cpu time spent running the
server, there could conceivably be up to an 80% speed improvement, but for
network games and timedemos a more realistic goal is a 40% or so speed
increase. I will be very satisfied if I can makes a dual pentium-pro 200
system perform like a pII-300.
I started on the specialized code in the renderer, but it struck me that
it might be possible to implement SMP acceleration with a generic OpenGL
driver, which would allow Quake2 / sin / halflife to take advantage of it
well before QuakeArena ships.
It took a day of hacking to get the basic framework set up: an smpgl.dll
that spawns another thread that loads the original oepngl32.dll or
3dfxgl.dll, and watches a work que for all the functions to call.
I get it basically working, then start doing some timings. Its 20%
slower than the single processor version.
I go in and optimize all the queing and working functions, tune the
communications facilities, check for SMP cache collisions, etc.
After a day of optimizing, I finally squeak out some performance gains on
my tests, but they aren't very impressive: 3% to 15% on one test scene,
but still slower on the another one.
This was fairly depressing. I had always been able to get pretty much
linear speedups out of the multithreaded utilities I wrote, even up to
sixteen processors. The difference is that the utilities just split up
the work ahead of time, then don't talk to each other until they are done,
while here the two threads work in a high bandwidth producer / consumer
relationship.
I finally got around to timing the actual communication overhead, and I was
appalled: it was taking 12 msec to fill the que, and 17 msec to read it out
on a single frame, even with nothing else going on. I'm surprised things
got faster at all with that much overhead.
The test scene I was using created about 1.5 megs of data to relay all the
function calls and vertex data for a frame. That data had to go to main
memory from one processor, then back out of main memory to the other.
Admitedly, it is a bitch of a scene, but that is where you want the
acceleration...
The write times could be made over twice as fast if I could turn on the
PII's write combining feature on a range of memory, but the reads (which
were the gating factor) can't really be helped much.
Streaming large amounts of data to and from main memory can be really grim.
The next write may force a cache writeback to make room for it, then the
read from memory to fill the cacheline (even if you are going to write over
the entire thing), then eventually the writeback from the cache to main
memory where you wanted it in the first place. You also tend to eat one
more read when your program wants to use the original data that got evicted
at the start.
What is really needed for this type of interface is a streaming read cache
protocol that performs similarly to the write combining: three dedicated
cachelines that let you read or write from a range without evicting other
things from the cache, and automatically prefetching the next cacheline as
you read.
Intel's write combining modes work great, but they can't be set directly
from user mode. All drivers that fill DMA buffers (like OpenGL ICDs...)
should definately be using them, though.
Prefetch instructions can help with the stalls, but they still don't prevent
all the wasted cache evictions.
It might be possible to avoid main memory alltogether by arranging things
so that the sending processor ping-pongs between buffers that fit in L2,
but I'm not sure if a cache coherent read on PIIs just goes from one L2
to the other, or if it becomes a forced memory transaction (or worse, two
memory transactions). It would also limit the maximum amount of overlap
in some situations. You would also get cache invalidation bus traffic.
I could probably trim 30% of my data by going to a byte level encoding of
all the function calls, instead of the explicit function pointer / parameter
count / all-parms-are-32-bits that I have now, but half of the data is just
raw vertex data, which isn't going to shrink unless I did evil things like
quantize floats to shorts.
Too much effort for what looks like a reletively minor speedup. I'm giving
up on this aproach, and going back to explicit threading in the renderer so
I can make most of the communicated data implicit.
Oh well. It was amusing work, and I learned a few things along the way.
9/7/98
------
I just got a production TNT board installed in my Dolch today.
The riva-128 was a troublesome part. It scored well on benchmarks, but it had
some pretty broken aspects to it, and I never reccomended it (you are better
off with an intel I740).
There aren't any troublesome aspects to TNT. Its just great. Good work, Nvidia.
In terms of raw speed, a 16 bit color multitexture app (like quake / quake 2)
should still run a bit faster on a voodoo2, and an SLI voodoo2 should be faster
for all 16 bit color rendering, but TNT has a lot of other things going for it:
32 bit color and 24 bit z buffers. They cost speed, but it is usually a better
quality tradeoff to go one resolution lower but with twice the color depth.
More flexible multitexture combine modes. Voodoo can use its multitexture for
diffuse lightmaps, but not for the specular lightmaps we offer in QuakeArena.
If you want shiny surfaces, voodoo winds up leaving half of its texturing
power unused (you can still run with diffuse lightmaps for max speed).
Stencil buffers. There aren't any apps that use it yet, but stencil allows
you to do a lot of neat tricks.
More texture memory. Even more than it seems (16 vs 8 or 12), because all of the
TNT's memory can be used without restrictions. Texture swapping is the voodoo's
biggest problem.
3D in desktop applications. There is enough memory that you don't have to worry
about window and desktop size limits, even at 1280*1024 true color resolution.
Better OpenGL ICD. 3dfx will probably do something about that, though.
This is the shape of 3D boards to come. Professional graphics level
rendering quality with great performance at a consumer price.
We will be releasing preliminary QuakeArena benchmarks on all the new boards
in a few weeks. Quake 2 is still a very good benchmark for moderate polygon
counts, so our test scenes for QA involve very high polygon counts, which
stresses driver quality a lot more. There are a few surprises in the current
timings...
---
A few of us took a couple days off in vegas this weekend. After about
ten hours at the tables over friday and saturday, I got a tap on the shoulder...
Three men in dark suits introduced themselves and explained that I was welcome
to play any other game in the casino, but I am not allowed to play
blackjack anymore.
Ah well, I guess my blackjack days are over. I was actually down a bit for
the day when they booted me, but I made +$32k over five trips to vegas in the
past two years or so.
I knew I would get kicked out sooner or later, because I don't play "safely".
I sit at the same table for several hours, and I range my bets around 10 to 1.
8/17/98
-------
I added support for HDTV style wide screen displays in QuakeArena, so
24" and 28" monitors can now cover the entire screen with game graphics.
On a normal 4:3 aspect ratio screen, a 90 degree horizontal field of view
gives a 75 degree vertical field of view. If you keep the vertical fov
constant and run on a wide screen, you get a 106 degree horizontal fov.
Because we specify fov with the horizontal measurement, you need to change
fov when going into or out of a wide screen mode. I am considering changing
fov to be the vertical measurement, but it would probably cause a lot of
confusion if "fov 90" becomes a big fisheye.
Many video card drivers are supporting the ultra high res settings
like 1920 * 1080, but hopefully they will also add support for lower
settings that can be good for games, like 856 * 480.
---
I spent a day out at apple last week going over technical issues.
I'm feeling a lot better about MacOS X. Almost everything I like about
rhapsody will be there, plus some solid additions.
I presented the OpenGL case directly to Steve Jobs as strongly as possible.
If Apple embraces OpenGL, I will be strongly behind them. I like OpenGL more
than I dislike MacOS. :)
---
Last friday I got a phone call: "want to make some exhibition runs at the
import / domestic drag wars this sunday?". It wasn't particularly good
timing, because the TR had a slipping clutch and the F50 still hasn't gotten
its computer mapping sorted out, but we got everything functional in time.
The tech inspector said that my cars weren't allowed to run in the 11s
at the event because they didn't have roll cages, so I was supposed to go
easy.
The TR wasn't running its best, only doing low 130 mph runs. The F50 was
making its first sorting out passes at the event, but it was doing ok. My
last pass was an 11.8(oops) @ 128, but we still have a ways to go to get the
best times out of it.
I'm getting some racing tires on the F50 before I go back. It sucked watching
a tiny honda race car jump ahead of me off the line. :)
I think ESPN took some footage at the event.