Friday, April 22, 2022

Will GPT-2 Large (1.5B) run on NVIDIA RTX 3060 12GB? (sampling)

Yes. The GPT-2 large model with 1.5B parameters will run on RTX 3060 12GB but it barely fits. It will, however, not load in tensorflow <2.4 because you need CUDA 11, at least with this card. Anything below will simply make your GPU visible to tensorflow but it will stay hanged during model loading (showing very slight GPU activity).

It will neither load on tensorflow >2.4, most likely to either tensorflow or CUDA 11.2 taking a bit more memory, hence the model will not fit on the GPU and instead crash with OOM every time.


So what do you need?

  • Have RTX 3060 as secondary GPU so that it has as much spare memory as possible,
  • Have drivers compatible with CUDA 11 and cuDNN 8.0 (I used 473.04),
  • Use tensorflow version 2.4.x only,
  • Slightly update the code if you want to have vanilla setup,
  • Do not output too much tokens at the same time,
  • Keep output amount of tokens low (like 20), reduce to 10 when input >750 tokens.
  • (optional) Keep looping output->input in an [input size]=[maximum length]-[# token to output] manner while keeping last tokens in input and removing <|endoftext|> token in output to produce one "multi-sampled" long output. For example, for 10 output tokens the input should be the last 1014 tokens, then remove 10 tokens from the beginning and append output at the end of input, and repeat. (tested with TOP-K 1400, temp: 0.95; other settings may require further token reduction; speed: 1T/~0.45s)


Not yet known:

  • Will it crash during model loading at times due to different seeds?


Deep learing and sampling comparison RTX 1660 vs RTX 3060 with GPT-2 (original code and modified nshepperd implementation):

The deep learning is a bit faster for 3060 but sampling is slower than 1660. However, 1660 barely has 774M fit into it and crashes during model loading sometimes. On 1660 sampling 20T with 774M model took 3~4s where on 3060 it took 5~6s. 3060 has 3% GPU Core utilization during that time which makes it clear that increased amount of CUDA cores does not help in the matter of sampling, in fact slows it down. Most likely due to increase in performance (syntetic benchmark) being ~1.4x overall and ~1.7x for computing, where the amount of CUDA cores increased ~2.4x when comparing 1660 to 3060.


Why is sampling not utilizing neither of cards fully?

I believe it is due to the fact that GPT-2 has auto-regressive decoder which does really decode 1 token at a time, so all those fancy mathematics are simply not enough to fully utilize the GPU card and batching may speed up the process. However, if the model barely fits the batching is a problem as well...

In conclusion. As long as usage of GPT-2 is concerned. If you want a bit faster deep learning and want to fit largest GPT-2 model into GPU memory, upgrading from 1660 to 3060 seems like a valid choice. However, if you're fine with 334M or 774M sampling on GPU with the second occasionally crashing, it may be better to stay with 1660, as each of its CUDA cores seem faster to get smaller sampling times.


  • Ryzen 1600X, 32GB RAM
  • NVIDIA 3060 12GB, LHR model (secondary card, not used for screen)
  • Windows 7 x64 (that's right, still ol' good Windows, neither Windows 10 nor Linux)
  • Python 3.8.10 for Windows
  • Tensorflow 2.4.4
  • Latest drivers for Windows 7 x64: 473.04
  • CUDA11, cuDNN 8.0
  • PATH with additional data: [Python location] and [Python location]\Scripts; [cuDNN 8.0 location]\bin, [CUDA 11.0 location]\bin, [CUDA 11.0 location]\libnvvp, [CUDA 11.0 location]\extras\CUPTI\libx64 and [CUDA 11.0 location]\include
  • Modified vanilla GPT-2 code for webpage interface using database, IN->OUT->IN multiple mini-sample looping and some post-processing for longer sentences, eot token removal and simple repeated text removal.
  • Modified nshepperd's code for deep learning with learning rate updating fixed (original code has it not working at all - it updates it for graph only, not for ADAM), removal of dangerous graph updates slowing down learning process and overloading the memory, saving ADAM state and ability to remove it later, and some other improvements to try replicating similar learning process to that described by GPT creators including warm up, attention drop out and cosine schedule. Deep learning tests were performed on custom parameter, newly initialized models after tokenization. (the fixed code is not available anywhere, as it is not cleaned up and neither proven to not contain other bugs - tough as far as tested it works properly) Why not other code, like huggingface? Because we are going vanilla here by having codes of those who were first to have them working, just fixing and updating them to meet our needs.


WARNING! The text below is the continuation generated by AI for demonstration purposes:


// The structure of models is common to simple's and deep which we don't power yet, we just use some state holder of basic ideas like // whether we are learning epoch or not, etc, should already be present anyway, that contains the features used // for training, and contains some training properties like block size, re-training state and have min_lr init rate // it holds input, output and samples from training model, and needs to be initialized to some states for test as well // initial: initial learning rate = 0.1, re-training state = 0, k-samples or max_features = 10 // given_state: example_state_list has 50 items, when we have only few samples for LSTM operations we break it into smaller // lists, so that it doesn't hold too many states and may be more compact for training, otherwise for the input we // need a model with only 100 items in params . final: static Random r = new Random(16384); min_lr = 0.005; std::vector<Int32> params_c; std::vector<float> samples_c; Tensor product_tensor = {0}; // y_test: x_train and y_test will be constant tensors, typically float32s thresh = 0.001; // trained gradient_y_updated is the model matrix of (row H, column L) that will be updated during training // // initialize_state: it contains initial model state . final: [[layer_init;_final]] = || {} .final: //[evaluate_grad_y_1_test;_final]] = && { for i := 0 ;i < 5000 ;i++ { r =; // find the lud of r // skip if there is no different for running tests in test mode, otherwise we don't support // multi-dimensional contributions r.get(); if ((r.get() != r.h) && (r.get() != r.l)) { // we have reported that the current test fails, return the error r = r.queable(); // we only ask if the initial test result is less than a regular failure per test // check whether we get a [positive or not] warning, otherwise don't do anything, if // there is one in the test mode do nothing else if ( continue; // print out the current state r.get(); // print the current version, and assign a temporary value r} // print out all errors printf("Error : %s the model was already initialized

", r.get()->get_previous()->get_error()); // add before the error buffer if (r.get(), r.get(), r.get() += 100 - 100) { r.get(); // print the current state before the error r.get(); // print out the current version, and assign a temporary value r} // print out the current state ... // and return the tentative error buffer - our error initializer run_err() { // although the error message is usually a single line, one error message is enough, // just to show the user a few messages, so it will be easier see the error later // we need a simple way to move the error left if at all possible r.reformat(); // discard some bad old errors ... printf("%s(%x) ", r.error(), r.err_count()); // add before the static buffer, very important nheq(r); // RNN operation over or over, to move the error forward or backward so it doesn't break the flow r = RNN(DataFrameOperation(params_c, r, 90, 0, 0, samples_c, samples_c, 0) + weights_c); // initialize our state .final: do_run(DataFrameOperation); while(r.get() != r.h) { r.get(); } // if all failures were recoverable, stop looping, we won't generate any inputs .... -1: rnt = r.queue_permutation(); -1: for (;;) { // traverse over all the successes [(max_train_y_original + 1) % max_chance_expected] for i := 0 ;i < max_chance_expected ;i++ ) // assess whether we extracted all the observations from the batch data of the model, // every difficulty, one, in the list if there are more than one observation from our batch, // it must be to set up for more information, and later evaluate the model `input` // predictions or `output` , i am only testing for removal { if (p.metrics() == see_all_see_all) break; // there are enough observations left not to extract more, you can wait up to // the next set of output models for extracting all observations // logistic regression, we want to make sure that no [repeated] observations will occur for this step - .lstm().logistic { // if there is multiple observations, that we saw earlier, we can find the earlier

No comments: