id Software's Usenet Group Posts Archive!

id_notes/John C/1998-08-17_1999-03-17



[idsoftware.com]


Welcome to id Software's Finger Service V1.5!



Name: John Carmack


Email: johnc@idsoftware.com


Description: Programmer


Project: Quake Arena


Last Updated: 03/17/1999 01:53:50 (Central Standard Time)


-------------------------------------------------------------------------------


3/17/99


-------


First impressions of the SGI visual workstation 320:



I placed an order for a loaded system ($11k) from their web site two months


ago. It still hasn't arrived (bad impression), but SGI did bring a loaner


system by for us to work with.



The system tower is better than standard pc fare, but I still think Apple's


new G3 has the best designed computer case.



The wide aspect LCD panel is very nice. A while ago I had been using a dual


monitor LCD setup on a previous intergraph, but the analog syncing LCD


screens had some fringing problems, and the gamma range was fairly different


from a CRT. The SGI display is perfectly crisp (digital interface), and


has great color. The lighting is a little bit uneven top to bottom on this


unit -- I am interested to see how the next unit looks for comparison.



Unfortunately, the card that they bundle with the LCD if you buy it


separately is NOT a good 3D accelerator, so if you care about 3D and want


this LCD screen, you need to buy an sgi visual workstation. Several of the


next generation consumer cards are going to support digital flat panel


outputs, so hopefully soon you will be able to run a TNT2 or something


out to one of these.



The super memory system does not appear to have provided ANY benefit to the


CPU. My memory benchmarking tests showed it running about the same as a


standard intel design.



Our first graphics testing looked very grim -- Quake3 didn't draw the world


at all. I spent a while trying to coax some output by disabling various


things, but to no avail. We reported it to SGI, and they got us a fix the


next day. Some bug with depthRange(). Even with the fix, 16 bit rendering


doesn't seem to work. I expect they will address this.



Other than that, there haven't been any driver anomolies, and both the game


and editor run flawlessly.



For single pass, top quality rendering (32 bit framebuffer, 32 bit depth


buffer, 32 bit trilinear textures, high res screen), the SGI has a higher


fill rate than any other card we have ever tested on a pc, but not by too


wide of a margin.



If your application can take advantage of multitexture, a TNT or rage128


will deliver slightly greater fill performance. It is likely that the next


speed bump of both chips will be just plain faster than the SGI on all


fill modes.



A serious flaw is that the LCD display can't support ANY other resolutions


except the native 1600*1024. The game chunks along at ten to fifteen fps


at that resolution (but it looks cool!). They desperately need to support


a pixel doubled 800*512 mode to make any kind of full screen work


possible. I expect they will address this.



Vsync disable is implemented wrong. Disabling sync causes it to continue


rendering, but the flip still doesn't happen until the next frame. This


gives repeatable (and faster) benchmark numbers, but with a flashing


screen that is unusable. The right way is to just cause the flip to happen


on the next scan line, like several consumer cards do, or blit. It gives


tearing artifacts, but it is still completely usable, and avoids temporal


nyquist issues between 30 and 60 hz. I expect they will address this.



Total throughput for games is only fair, about like an intergraph. Any


of the fast consumer cards will run a quake engine game faster than the


sgi in its current form. I'm sure some effort will be made to improve


this, but I doubt it will be a serious focus, unless some SGI engineers


develop unhealthy quake addictions. :-)



The unified memory system allows nearly a gig of textures, and individual


textures can by up to 4k by 4k. AGP texturing provides some of this


benefit for consumer cards, but not to the same degree or level of


performance.



The video stream support looks good, but I haven't tried using it yet.



Very high interpolater accuracy. All the consumer cards start to break


up a bit with high magnification, weird aspects, or long edges. The


professional cards (intergraph, glint, E&S, SGI) still do a better job.



SGI exports quite a few more useful OpenGL extensions than intergraph


does, but multisample antialiasing (as claimed in their literature)


doesn't seem to be one of them.



Overall, it looks pretty good, and I am probably going to move over to


using the SGI workstation full time when my real system arrives.



I was very happy with the previous two generations of intergraph


workstations, but this last batch (GT1) has been a bunch of lemons, and


the wildcat graphics has been delayed too long. The current realizm-II


systems just don't have enough fill rate for high end development.



For developers that don't have tons of money, the decision right now


is an absolute no-brainer -- buy a TNT and stick it in a cheap system.


Its a better "professional" 3D accelerator than you could buy at any


price not too long ago.




3/3/99


------


On the issue of railgun firing rates -- we played with it for a while at


the slower speed, but it has been put back to exactly Q2's rate of fire.



I do agree with Thresh that the way we had it initially (faster than Q2,


but with the same damage) made it an overpowered weapon in the hands of


highly skilled players, which is exactly what we should try to avoid.



An ideal game should give scores as close to directly proportional to


the players reletive skills as possible. The better player should win


in almost all cases, but the game will be more entertaining if the


inferior players are not completely dominated.



Quake 1 had really bad characteristics that way -- Thresh can play


extremely talented players and often prevent them from scoring a single


point. We wouldn't put up with a conventional sport that commonly


game scores of 20 to 1 in championship matches, and I don't think


we should encourage it in our games.



Eliminating health items is probably the clearest way to prevent


blowout games, but that has never been popular. Still, we should


try to avoid weapon decisions that allow the hyper-skilled to pull


even farther away from the rest of the crowd. They will still win,


no matter what the weapons are, just not by as wide a margin.




1/29/98


-------


The issue of the 8192 unit map dimension limit had been nagging at me for


a long time now. The reason for the limit is that coordinates are


communicated over the network in 16 bit shorts divided as a sign bit,


twelve unit bits, and three fractional bits. There was also another


side to the representation problem, but it was rarely visible: if


you timescale down, you can actually tell the fractional bit granularity


in position and velocity.



The rest of the system (rendering and gameplay) has never had any issues


with larger maps, even in quake 1. There are some single precision


floating point issues that begin to creep in if things get really huge,


but maps should be able to cover many times the current limit without


any other changes.



A while ago I had changed Q3 so that the number of fractional bits was


a compile time option, which allowed you to trade off fine grain precision


for larger size. I was considering automatically optimizing this for each


level based on its size, but it still didn't feel like a great solution.



Another aspect of the problem that wasn't visible to the public was that


the fractional quantization of position could cause the position to


actually be inside a nearby solid when used for client side prediction.


The code had to check for this and try to correct the situation by jittering


the position in each of the possible directions it might have been


truncated from. This is a potential issue whenever there is any loss of


precision whatsoever in the server to client communication.



The obvious solution is to just send the full floating point value for


positions, but I resisted that because the majority of our network


traffic is positional updates, and I didn't want to bloat it. There have


been other bandwidth savings in Q3, and LANs and LPB connections are also


relevant, so I was constantly evaluating the tradeoff.



Dealing with four or five players in view isn't a real problem. The big


bandwidth issues arrive when multiple players start unloading with rapid


fire weapons. (as an aside, I tried making 5hz fire weapons for Q3 to save


bandwidth, but no matter how much damage they did, 5hz fire rates just


seemed to feel slow and weak...)



I finally moved to a bit-level stream encoding to save some more bandwidth


and give me some more representational flexibility, and this got me thinking


about the characteristics of the data that bother us.



In general, the floating point coordinates have significant bits all through


the mantissa. Any movement along an angle will more or less randomize the


low order bits.



My small little insight was that because missiles are evaluated


parameterically instead of itteretively in Q3, a one-time snapping of the


coordinates can be performed at their generation time, giving them fixed


values with less significant bits for their lifetime without any effort


outside their spawning function. It also works for doors and plats, which


are also parametrically represented now. Most events will also have


integral coordinates.



The float encoder can check for an integral value in a certain range and


send that as a smaller number of bits, say 13 or so. If the value isn't


integral, it will be transmitted as a full 32 bit float.



The other thing I am investigating is sub-byte delta encoding of floating


point values. Even with arbitrary precision movement deltas, the sign and


exponent bits change with very low frequency except when you are very


near the origin. At the minimum, I should be able to cut the standard


player coordinate delta reps to three bytes from four.



So, the bottom line is that the bandwidth won't move much (it might even


go down if I cut the integral bits below 15), the maps become unbounded


in size to the point of single precision roundoff, and the client doesn't


have to care about position jittering (which was visible in Q3 code that


will be released).




1/10/99


-------



Ok, many of you have probably heard that I spoke at the macworld


keynote on tuesday. Some information is probably going to get


distorted in the spinning and retelling, so here is an info


dump straight from me:



Q3test, and later the full commercial Quake3: Arena, will be simultaniously


released on windows, mac, and linux platforms.



I think Apple is doing a lot of things right. A lot of what they are


doing now is catch-up to wintel, but if they can keep it up for the next


year, they may start making a really significant impact.



I still can't give the mac an enthusiastic reccomendation for sophisticated


users right now because of the operating system issues, but they are working


towards correcting that with MacOS X.




The scoop on the new G3 mac hardware:



Basically, its a great system, but Apple has oversold its


performance reletive to intel systems. In terms of timedemo scores,


the new G3 systems should be near the head of the pack, but there


will be intel systems outperforming them to some degree. The mac has


not instantly become a "better" platform for games than wintel, it


has just made a giant leap from the back of the pack to near the


front.



I wish Apple would stop quoting "Bytemarks". I need to actually


look at the contents of that benchmark and see how it can be so


misleading. It is pretty funny listening to mac evangelist types


try to say that an iMac is faster than a pentium II-400. Nope.


Not even close.



From all of my tests and experiments, the new mac systems are


basically as fast as the latest pentium II systems for general


cpu and memory performance. This is plenty good, but it doesn't


make the intel processors look like slugs.



Sure, an in-cache, single precision, multiply-accumulate loop could


run twice as fast as a pentium II of the same clock rate, but


conversly, a double precision add loop would run twice as fast


on the pentium II.



Spec95 is a set of valid benchmarks in my opinion, and I doubt the


PPC systems significantly (if at all) outperform the intel systems.




The IO system gets mixed marks. The 66 mhz video slot is a good step


up from 33 mhz pci in previous products, but that's still half the


bandwidth of AGP 2X, and it can't texture from main memory. This


will have a small effect on 3D gaming, but not enough to push it


out of its class.



The 64 bit pci slots are a good thing for network and storage cards,


but the memory controller doesn't come close to getting peak


utilization out of it. Better than normal pci, though.




The video card is almost exactly what you will be able to get on


the pc side: a 16 mb rage-128. Running on a 66mhz pci bus, it's


theoretical peak performance will be midway between the pci and


agp models on pc systems for command traffic limited scenes. Note


that current games are not actually command traffic limited, so the


effect will be significantly smaller. The fill rates will be identical.



The early systems are running the card at 75 mhz, which does put


it at a slight disadvantage to the TNT, but faster versions are


expected later. As far as I can tell, the rage-128 is as perfect


as the TNT feature-wise. The 32 mb option is a feature ATI can


hold over TNT.




Firewire is cool.




Its a simple thing, but the aspect of the new G3 systems that


struck me the most was the new case design. Not the flashy plastic


exterior, but the functional structure of it. The side of the


system just pops open, even with the power on, and lays the


motherboard and cards down flat while the disks and power supply


stay in the encloser. It really is a great design, and the benefits


were driven home yesterday when I had to scavenge some ram out of old


wintel systems yesterday -- most case designs suck really bad.




---



I could gripe a bit about the story of our (lack of) involvement


with Apple over the last four years or so, but I'm calling that


water under the bridge now.



After all the past fiascos, I had been basically planning on ignoring Apple


until MacOS X (rhapsody) shipped, which would then turn it into a platform


that I was actually interested in.



Recently, Apple made a strategic corporate decision that games were a


fundamental part of a consumer oriented product line (duh). To help that


out, Apple began an evaluation of what it needed to do to help game


developers.



My first thought was "throw out MacOS", but they are already in the process


of doing that, and its just not going to be completed overnight.



Apple has decent APIs for 2D graphics, input, sound, and networking,


but they didn't have a solid 3D graphics strategy.



Rave was sort of ok. Underspecified and with no growth path, but


sort of ok. Pursuing a proprietary api that wasn't competetive with


other offerings would have been a Very Bad Idea. They could have tried


to make it better, or even invent a brand new api, but Apple doesn't have


much credibility in 3D programming.



For a while, it was looking like Apple might do something stupid,


like license DirectX from microsoft and be put into a guaranteed


trailing edge position behind wintel.



OpenGL was an obvious direction, but there were significant issues with


the licensing and implementation that would need to be resolved.



I spent a day at apple with the various engineering teams and executives,


laying out all the issues.



The first meeting didn't seem like it went all that well, and there wasn't


a clear direction visible for the next two months. Finally, I got the all


clear signal that OpenGL was a go and that apple would be getting both the


sgi codebase and the conix codebease and team (the best possible arrangement).



So, I got a mac and started developing on it. My first weekend of


effort had QuakeArena limping along while held together with duct


tape, but weekend number two had it properly playable, and weekend


number three had it brought up to full feature compatability. I


still need to do some platform specific things with odd configurations


like multi monitor and addon controlers, but basically now its


just a matter of compiling on the mac to bring it up to date.



This was important to me, because I felt that Quake2 had slipped a bit in


portability because it had been natively developed on windows. I like the


discipline of simultanious portable development.



After 150 hours or so of concentrated mac development, I learned a


lot about the platform.



CodeWarrior is pretty good. I was comfortable devloping there


almost immediately. I would definately say VC++ 6.0 is a more powerful


overall tool, but CW has some nice little aspects that I like. I


am definately looking forward to CW for linux. Unix purists may


be aghast, but I have always liked gui dev environments more than


a bunch of xterms running vi and gdb.



The hardware (even the previous generation stuff) is pretty good.



The OpenGL performance is pretty good. There is a lot of work


underway to bring the OpenGL performance to the head of the pack,


but the existing stuff works fine for development.



The low level operating systems SUCKS SO BAD it is hard to believe.



The first order problem is lack of memory management / protection.



It took me a while to figure out that the zen of mac development is


"be at peace while rebooting". I rebooted my mac system more times


the first weekend than I have rebooted all the WinNT systems I


have ever owned. True, it has gotten better now that I know my


way around a bit more, and the codebase is fully stable, but there


is just no excuse for an operating system in this day and age to


act like it doesn't have access to memory protection.



The first thing that bit me was the static memory allocation for


the program. Modern operating systems just figure out how much


memory you need, but because the mac was originally dsigned for


systems without memory management, significant things have to be


laid out ahead of time.



Porting a win32 game to the mac will probably involve more work


dealing with memory than any other aspect. Graphics, sound, and


networking have reasonable analogues, but you just can't rely


on being able to malloc() whatever you want on the mac.



Sure, game developers can manage their own memory, but an operating


system that has proper virtual memory will let you develop


a lot faster.



The lack of memory protection is the worst aspect of mac development.


You can just merrily write all over other programs, the development


environment, and the operating system from any application.



I remember that. From dos 3.3 in 1990.



Guard pages will help catch simple overruns, but it won't do anything


for all sorts of other problems.




The second order problem is lack of preemptive multitasking.



The general responsiveness while working with multiple apps


is significantly worse than windows, and you often run into


completely modal dialogs that don't let you do anything else at all.




A third order problem is that a lot of the interfaces are fairly


clunky.



There are still many aspects of the mac that clearly show design


decisions based on a 128k 68000 based machine. Wintel has grown


a lot more than the mac platform did. It may have been because the


intel architecture didn't evolve gracefully and that forced the


software to reevaluate itself more fully, or it may just be that


microsoft pushed harder.



Carbon sanitizes the worst of the crap, but it doesn't turn it


into anything particularly good looking.




MacOS X nails all these problems, but thats still a ways away.



I did figure one thing out -- I was always a little curious why


the early BeOS advocates were so enthusiastic. Coming from a


NEXTSTEP background, BeOS looked to me like a fairly interesting


little system, but nothing special. To a mac developer, it must


have looked like the promised land...




12/30/98


--------


I got several vague comments about being able to read "stuff" from shared


memory, but no concrete examples of security problems.



However, Gregory Maxwell pointed out that it wouldn't work cross platform


with 64 bit pointer environments like linux alpha. That is a killer, so


I will be forced to do everything the hard way. Its probably for the


best, from a design standpoint anyway, but it will take a little more effort.




12/29/98


--------


I am considering taking a shortcut with my virtual machine implementation


that would make the integration a bit easier, but I'm not sure that it


doesn't compromise the integrity of the base system.



I am considering allowing the interpreted code to live in the global address


space, instead of a private 0 based address space of its own. Store


instructions from the VM would be confined to the interpreter's address


space, but loads could access any structures.



On the positive side:



This would allow full speed (well, full interpreted speed) access to variables


shared between the main code and the interpreted modules. This allows system


calls to return pointers, instead of filling in allocated space in the


interpreter's address space.



For most things, this is just a convenience that will cut some development


time. Most of the shared accesses could be recast as "get" system calls,


and it is certainly arguable that that would be a more robust programming


style.



The most prevelent change this would prevent is all the cvar_t uses. Things


could stay in the same style as Q2, where cvar accesses are free and


transparantly updated. If the interpreter lives only in its own address


space, then cvar access would have to be like Q1, where looking up a


variable is a potentially time consuming operation, and you wind up adding


lots of little cvar caches that are updated every from or restart.



On the negative side:



A client game module with a bug could cause a bus error, which would not be


possible with a pure local address space interpreter.



I can't think of any exploitable security problems that read only access to


the entire address space opens, but if anyone thinks of something, let me


know.




11/4/98


-------


More extensive comments on the interpreted-C decision later, but a quick


note: the plan is to still allow binary dll loading so debuggers can be


used, but it should be interchangable with the interpreted code. Client


modules can only be debugged if the server is set to allow cheating, but


it would be possible to just use the binary interface for server modules


if you wanted to sacrifice portability. Most mods will be able to be


implemented with just the interpreter, but some mods that want to do


extensive file access or out of band network communications could still


be implemented just as they are in Q2. I will not endorse any use of


binary client modules, though.




11/3/98


-------



This was the most significant thing I talked about at The Frag, so here it


is for everyone else.



The way the QA game architecture has been developed so far has been as two


seperate binary dll's: one for the server side game logic, and one for the


client side presentation logic.



While it was easiest to begin development like that, there are two crucial


problems with shipping the game that way: security and portability.



It's one thing to ask the people who run dedicated servers to make informed


decisions about the safety of a given mod, but its a completely different


matter to auto-download a binary image to a first time user connecting to a


server they found.



The quake 2 server crashing attacks have certainly proven that there are


hackers that enjoy attacking games, and shipping around binary code would


be a very tempting opening for them to do some very nasty things.




With quake and Quake 2, all game modifications were strictly server side,


so any port of the game could connect to any server without problems.


With Quake 2's binary server dll's not all ports could necessarily run a


server, but they could all play.



With significant chunks of code now running on the client side, if we stuck


with binary dll's then the less popular systems would find that they could


not connect to new servers because the mod code hadn't been ported. I


considered having things set up in such a way that client game dll's could


be sort of forwards-compatable, where they could always connect and play,


but new commands and entity types just might now show up. We could also


GPL the game code to force mod authors to release source with the binaries,


but that would still be inconvenient to deal with all the porting.



Related to both issues is client side cheating. Certain cheats are easy to do


if you can hack the code, so the server will need to verify which code the


client is running. With multiple ported versions, it wouldn't be possible


to do any binary verification.



If we were willing to wed ourselves completely to the windows platform, we


might have pushed ahead with some attempt at binary verification of dlls,


but I ruled that option out. I want QuakeArena running on every platform


that has hardware accelerated OpenGL and an internet connection.




The only real solution to these problems is to use an interpreted language


like Quake 1 did. I have reached the conclusion that the benefits of a


standard language outweigh the benefits of a custom language for our


purposes. I would not go back and extend QC, because that stretches the


effort from simply system and interpreter design to include language design,


and there is already plenty to do.



I had been working under the assumption that Java was the right way to go,


but recently I reached a better conclusion.



The programming language for QuakeArena mods is interpreted ANSI C. (well,


I am dropping the double data type, but otherwise it should be pretty


conformant)



The game will have an interpreter for a virtual RISC-like CPU. This should


have a minor speed benefit over a byte-coded, stack based java interpreter.


Loads and stores are confined to a preset block of memory, and access to all


external system facilities is done with system traps to the main game code,


so it is completely secure.



The tools necessary for building mods will all be freely available: a


modified version of LCC and a new program called q3asm. LCC is a wonderful


project -- a cross platform, cross compiling ANSI C compiler done in under


20K lines of code. Anyone interested in compilers should pick up a copy of


"A retargetable C compiler: design and implementation" by Fraser and Hanson.



You can't link against any libraries, so every function must be resolved.


Things like strcmp, memcpy, rand, etc. must all be implemented directly. I


have code for all the ones I use, but some people may have to modify their


coding styles or provide implementations for other functions.



It is a fair amount of work to restructure all the interfaces to not share


pointers between the system and the games, but it is a whole lot easier


than porting everything to a new language. The client game code is about


10k lines, and the server game code is about 20k lines.



The drawback is performance. It will probably perform somewhat like QC.


Most of the heavy lifting is still done in the builtin functions for path


tracing and world sampling, but you could still hurt yourself by looping


over tons of objects every frame. Yes, this does mean more load on servers,


but I am making some improvements in other parts that I hope will balance


things to about the way Q2 was on previous generation hardware.



There is also the amusing avenue of writing hand tuned virtual assembly


assembly language for critical functions...



I think this is The Right Thing.




10/14/98


--------



It has been difficult to write .plan updates lately. Every time I start


writing something, I realize that I'm not going to be able to cover it


satisfactorily in the time I can spend on it. I have found that terse


little comments either get misinterpreted, or I get deluged by email


from people wanting me to expand upon it.



I wanted to do a .plan about my evolving thoughts on code quality


and lessons learned through quake and quake 2, but in the interest


of actually completing an update, I decided to focus on one change


that was intended to just clean things up, but had a surprising


number of positive side effects.



Since DOOM, our games have been defined with portability in mind.


Porting to a new platform involves having a way to display output,


and having the platform tell you about the various relevant inputs.


There are four principle inputs to a game: keystrokes, mouse moves,


network packets, and time. (If you don't consider time an input


value, think about it until you do -- it is an important concept)



These inputs were taken in separate places, as seemed logical at the


time. A function named Sys_SendKeyEvents() was called once a


frame that would rummage through whatever it needed to on a


system level, and call back into game functions like Key_Event( key,


down ) and IN_MouseMoved( dx, dy ). The network system


dropped into system specific code to check for the arrival of packets.


Calls to Sys_Milliseconds() were littered all over the code for


various reasons.



I felt that I had slipped a bit on the portability front with Q2 because


I had been developing natively on windows NT instead of cross


developing from NEXTSTEP, so I was reevaluating all of the system


interfaces for Q3.



I settled on combining all forms of input into a single system event


queue, similar to the windows message queue. My original intention


was to just rigorously define where certain functions were called and


cut down the number of required system entry points, but it turned


out to have much stronger benefits.



With all events coming through one point (The return values from


system calls, including the filesystem contents, are "hidden" inputs


that I make no attempt at capturing, ), it was easy to set up a


journalling system that recorded everything the game received. This


is very different than demo recording, which just simulates a network


level connection and lets time move at its own rate. Realtime


applications have a number of unique development difficulties


because of the interaction of time with inputs and outputs.



Transient flaw debugging. If a bug can be reproduced, it can be


fixed. The nasty bugs are the ones that only happen every once in a


while after playing randomly, like occasionally getting stuck on a


corner. Often when you break in and investigate it, you find that


something important happened the frame before the event, and you


have no way of backing up. Even worse are realtime smoothness


issues -- was that jerk of his arm a bad animation frame, a network


interpolation error, or my imagination?



Accurate profiling. Using an intrusive profiler on Q2 doesn't give


accurate results because of the realtime nature of the simulation. If


the program is running half as fast as normal due to the


instrumentation, it has to do twice as much server simulation as it


would if it wasn't instrumented, which also goes slower, which


compounds the problem. Aggressive instrumentation can slow it


down to the point of being completely unplayable.



Realistic bounds checker runs. Bounds checker is a great tool, but


you just can't interact with a game built for final checking, its just


waaaaay too slow. You can let a demo loop play back overnight, but


that doesn't exercise any of the server or networking code.



The key point: Journaling of time along with other inputs turns a


realtime application into a batch process, with all the attendant


benefits for quality control and debugging. These problems, and


many more, just go away. With a full input trace, you can accurately


restart the session and play back to any point (conditional


breakpoint on a frame number), or let a session play back at an


arbitrarily degraded speed, but cover exactly the same code paths..



I'm sure lots of people realize that immediately, but it only truly sunk


in for me recently. In thinking back over the years, I can see myself


feeling around the problem, implementing partial journaling of


network packets, and included the "fixedtime" cvar to eliminate most


timing reproducibility issues, but I never hit on the proper global


solution. I had always associated journaling with turning an


interactive application into a batch application, but I never


considered the small modification necessary to make it applicable to


a realtime application.



In fact, I was probably blinded to the obvious because of one of my


very first successes: one of the important technical achievements


of Commander Keen 1 was that, unlike most games of the day, it


adapted its play rate based on the frame speed (remember all those


old games that got unplayable when you got a faster computer?). I


had just resigned myself to the non-deterministic timing of frames


that resulted from adaptive simulation rates, and that probably


influenced my perspective on it all the way until this project.



Its nice to see a problem clearly in its entirety for the first time, and


know exactly how to address it.




9/10/98


-------



I recently set out to start implementing the dual-processor acceleration


for QA, which I have been planning for a while. The idea is to have one


processor doing all the game processing, database traversal, and lighting,


while the other processor does absolutely nothing but issue OpenGL calls.



This effectively treats the second processor as a dedicated geometry


accelerator for the 3D card. This can only improve performance if the


card isn't the bottleneck, but voodoo2 and TNT cards aren't hitting their


limits at 640*480 on even very fast processors right now.



For single player games where there is a lot of cpu time spent running the


server, there could conceivably be up to an 80% speed improvement, but for


network games and timedemos a more realistic goal is a 40% or so speed


increase. I will be very satisfied if I can makes a dual pentium-pro 200


system perform like a pII-300.



I started on the specialized code in the renderer, but it struck me that


it might be possible to implement SMP acceleration with a generic OpenGL


driver, which would allow Quake2 / sin / halflife to take advantage of it


well before QuakeArena ships.



It took a day of hacking to get the basic framework set up: an smpgl.dll


that spawns another thread that loads the original oepngl32.dll or


3dfxgl.dll, and watches a work que for all the functions to call.



I get it basically working, then start doing some timings. Its 20%


slower than the single processor version.



I go in and optimize all the queing and working functions, tune the


communications facilities, check for SMP cache collisions, etc.



After a day of optimizing, I finally squeak out some performance gains on


my tests, but they aren't very impressive: 3% to 15% on one test scene,


but still slower on the another one.



This was fairly depressing. I had always been able to get pretty much


linear speedups out of the multithreaded utilities I wrote, even up to


sixteen processors. The difference is that the utilities just split up


the work ahead of time, then don't talk to each other until they are done,


while here the two threads work in a high bandwidth producer / consumer


relationship.



I finally got around to timing the actual communication overhead, and I was


appalled: it was taking 12 msec to fill the que, and 17 msec to read it out


on a single frame, even with nothing else going on. I'm surprised things


got faster at all with that much overhead.



The test scene I was using created about 1.5 megs of data to relay all the


function calls and vertex data for a frame. That data had to go to main


memory from one processor, then back out of main memory to the other.


Admitedly, it is a bitch of a scene, but that is where you want the


acceleration...



The write times could be made over twice as fast if I could turn on the


PII's write combining feature on a range of memory, but the reads (which


were the gating factor) can't really be helped much.



Streaming large amounts of data to and from main memory can be really grim.


The next write may force a cache writeback to make room for it, then the


read from memory to fill the cacheline (even if you are going to write over


the entire thing), then eventually the writeback from the cache to main


memory where you wanted it in the first place. You also tend to eat one


more read when your program wants to use the original data that got evicted


at the start.



What is really needed for this type of interface is a streaming read cache


protocol that performs similarly to the write combining: three dedicated


cachelines that let you read or write from a range without evicting other


things from the cache, and automatically prefetching the next cacheline as


you read.



Intel's write combining modes work great, but they can't be set directly


from user mode. All drivers that fill DMA buffers (like OpenGL ICDs...)


should definately be using them, though.



Prefetch instructions can help with the stalls, but they still don't prevent


all the wasted cache evictions.



It might be possible to avoid main memory alltogether by arranging things


so that the sending processor ping-pongs between buffers that fit in L2,


but I'm not sure if a cache coherent read on PIIs just goes from one L2


to the other, or if it becomes a forced memory transaction (or worse, two


memory transactions). It would also limit the maximum amount of overlap


in some situations. You would also get cache invalidation bus traffic.



I could probably trim 30% of my data by going to a byte level encoding of


all the function calls, instead of the explicit function pointer / parameter


count / all-parms-are-32-bits that I have now, but half of the data is just


raw vertex data, which isn't going to shrink unless I did evil things like


quantize floats to shorts.



Too much effort for what looks like a reletively minor speedup. I'm giving


up on this aproach, and going back to explicit threading in the renderer so


I can make most of the communicated data implicit.



Oh well. It was amusing work, and I learned a few things along the way.




9/7/98


------


I just got a production TNT board installed in my Dolch today.



The riva-128 was a troublesome part. It scored well on benchmarks, but it had


some pretty broken aspects to it, and I never reccomended it (you are better


off with an intel I740).



There aren't any troublesome aspects to TNT. Its just great. Good work, Nvidia.



In terms of raw speed, a 16 bit color multitexture app (like quake / quake 2)


should still run a bit faster on a voodoo2, and an SLI voodoo2 should be faster


for all 16 bit color rendering, but TNT has a lot of other things going for it:



32 bit color and 24 bit z buffers. They cost speed, but it is usually a better


quality tradeoff to go one resolution lower but with twice the color depth.



More flexible multitexture combine modes. Voodoo can use its multitexture for


diffuse lightmaps, but not for the specular lightmaps we offer in QuakeArena.


If you want shiny surfaces, voodoo winds up leaving half of its texturing


power unused (you can still run with diffuse lightmaps for max speed).



Stencil buffers. There aren't any apps that use it yet, but stencil allows


you to do a lot of neat tricks.



More texture memory. Even more than it seems (16 vs 8 or 12), because all of the


TNT's memory can be used without restrictions. Texture swapping is the voodoo's


biggest problem.



3D in desktop applications. There is enough memory that you don't have to worry


about window and desktop size limits, even at 1280*1024 true color resolution.



Better OpenGL ICD. 3dfx will probably do something about that, though.



This is the shape of 3D boards to come. Professional graphics level


rendering quality with great performance at a consumer price.



We will be releasing preliminary QuakeArena benchmarks on all the new boards


in a few weeks. Quake 2 is still a very good benchmark for moderate polygon


counts, so our test scenes for QA involve very high polygon counts, which


stresses driver quality a lot more. There are a few surprises in the current


timings...



---



A few of us took a couple days off in vegas this weekend. After about


ten hours at the tables over friday and saturday, I got a tap on the shoulder...



Three men in dark suits introduced themselves and explained that I was welcome


to play any other game in the casino, but I am not allowed to play


blackjack anymore.



Ah well, I guess my blackjack days are over. I was actually down a bit for


the day when they booted me, but I made +$32k over five trips to vegas in the


past two years or so.



I knew I would get kicked out sooner or later, because I don't play "safely".


I sit at the same table for several hours, and I range my bets around 10 to 1.





8/17/98


-------


I added support for HDTV style wide screen displays in QuakeArena, so


24" and 28" monitors can now cover the entire screen with game graphics.



On a normal 4:3 aspect ratio screen, a 90 degree horizontal field of view


gives a 75 degree vertical field of view. If you keep the vertical fov


constant and run on a wide screen, you get a 106 degree horizontal fov.



Because we specify fov with the horizontal measurement, you need to change


fov when going into or out of a wide screen mode. I am considering changing


fov to be the vertical measurement, but it would probably cause a lot of


confusion if "fov 90" becomes a big fisheye.



Many video card drivers are supporting the ultra high res settings


like 1920 * 1080, but hopefully they will also add support for lower


settings that can be good for games, like 856 * 480.



---



I spent a day out at apple last week going over technical issues.



I'm feeling a lot better about MacOS X. Almost everything I like about


rhapsody will be there, plus some solid additions.



I presented the OpenGL case directly to Steve Jobs as strongly as possible.



If Apple embraces OpenGL, I will be strongly behind them. I like OpenGL more


than I dislike MacOS. :)



---



Last friday I got a phone call: "want to make some exhibition runs at the


import / domestic drag wars this sunday?". It wasn't particularly good


timing, because the TR had a slipping clutch and the F50 still hasn't gotten


its computer mapping sorted out, but we got everything functional in time.



The tech inspector said that my cars weren't allowed to run in the 11s


at the event because they didn't have roll cages, so I was supposed to go


easy.



The TR wasn't running its best, only doing low 130 mph runs. The F50 was


making its first sorting out passes at the event, but it was doing ok. My


last pass was an 11.8(oops) @ 128, but we still have a ways to go to get the


best times out of it.



I'm getting some racing tires on the F50 before I go back. It sucked watching


a tiny honda race car jump ahead of me off the line. :)



I think ESPN took some footage at the event.