Tuesday, May 3, 2022

AMD Ryzens: prototype kind of CPUs?

Could AMD still be selling their CPUs in not completely finalized form, as kind of still under development chips, or even, Ryzen still being kind of prototype kind of CPU by itself?

Basing on my experience of two out of three Ryzens I own having problems, at least for me, in my personal opinion, AMD sold me an unfinished product. And basing on the problems I had with the latest (4800H), I can believe that it was in fact in some prototype kind of stage still, not thoroughly tested, not a final product that should be put on the market.

 

What AMD CPUs do I own as products and their problems (if any):

  • Ryzen 1600X - was struggling a lot to stabilize this CPU as much as possible, still resulting in occasional crashes of applications every few days and BSOD every 1 month or more. 1xxx series had problems with RAM controller, and even though I changed RAM for AMD certified one, which helped too, the instability is still there, no matter what.
  • Ryzen 2500U - this CPU has no problems so far. Maybe because its clock is pretty low and voltages don't jump as high as on high-end models. My family relative has the same CPU, no problems. So far this is the only Ryzen that is working 100% stable for me without any tinkering.
  • Ryzen 4800H - this CPU has weird problems. The problems are not noticeable at first until... you run tensorflow in training + sampling mode. After around 24h (shortest was 2h, longest was few days) the notebook will simply crash, usually on "hardware" issue. It does not matter if you force the TDP to 20W, it will still crash.

 

I could understand 1600X still having problems, it was 1xxx series. A brand new line of products that company should test well but somehow overlooked certain issues just to push it to the market.

 

I cannot, however, understand that my 4800H is still causing problems and of different kind than 1600X. Isn't it 3 series ahead for someone to get their act together and test the CPU properly and thoroughly in various kind of applications and, if possible, fix the problems or clearly state what the CPU may fail on?


So, how about I clarify what causes my 4800H to fail exactly? I tried to pin-point the problem and I found a solution for this CPU to run completely stable without any BSOD whatsoever.

There are two solutions and are related to Windows power scheme:

  • Disable Aggressive boost - if you are fine with your CPU running at 1.8Ghz just to be stable, go for that option, or,
  • Disable Idle - if you are fine with your CPU not saving power when not used and run at full speed instead, with no hint of instability at the same time, go for this one. However, if you overload your CPU your system will be kind of less responsible in this mode (beat me why...), but at the same time some applications may run faster (I got 10-15% speed up with AI training just by disabling Idle mode when the power draw was still 35W)

 

Unfortunately I have no bios update to maybe mitigate this problem, nor it offers me any kind of tinkering with processor states to try pin-pointing it even further. Nonetheless, I opted for disabling Idle state and controlling my CPU TDP with Renoir Mobile Tuning as otherwise I am at risk of losing my unsaved work.


I'm not the only one having problems with their CPUs, they are more people like this. Whenever there are certain parts being problematic, clearly AMD should run more auditing on that matter to make sure no bad quality CPUs reach the market. Until then you may try BIOS update or tinkering with your power schemes to turn off certain CPU functionalities to make it more if not completely stable.

Friday, April 22, 2022

Will GPT-2 Large (1.5B) run on NVIDIA RTX 3060 12GB? (sampling)

Yes. The GPT-2 large model with 1.5B parameters will run on RTX 3060 12GB but it barely fits. It will, however, not load in tensorflow <2.4 because you need CUDA 11, at least with this card. Anything below will simply make your GPU visible to tensorflow but it will stay hanged during model loading (showing very slight GPU activity).

It will neither load on tensorflow >2.4, most likely to either tensorflow or CUDA 11.2 taking a bit more memory, hence the model will not fit on the GPU and instead crash with OOM every time.

 

So what do you need?

  • Have RTX 3060 as secondary GPU so that it has as much spare memory as possible,
  • Have drivers compatible with CUDA 11 and cuDNN 8.0 (I used 473.04),
  • Use tensorflow version 2.4.x only,
  • Slightly update the code if you want to have vanilla setup,
  • Do not output too much tokens at the same time,
  • Keep output amount of tokens low (like 20), reduce to 10 when input >750 tokens.
  • (optional) Keep looping output->input in an [input size]=[maximum length]-[# token to output] manner while keeping last tokens in input and removing <|endoftext|> token in output to produce one "multi-sampled" long output. For example, for 10 output tokens the input should be the last 1014 tokens, then remove 10 tokens from the beginning and append output at the end of input, and repeat. (tested with TOP-K 1400, temp: 0.95; other settings may require further token reduction; speed: 1T/~0.45s)

 

Not yet known:

  • Will it crash during model loading at times due to different seeds?

 

Deep learing and sampling comparison RTX 1660 vs RTX 3060 with GPT-2 (original code and modified nshepperd implementation):

The deep learning is a bit faster for 3060 but sampling is slower than 1660. However, 1660 barely has 774M fit into it and crashes during model loading sometimes. On 1660 sampling 20T with 774M model took 3~4s where on 3060 it took 5~6s. 3060 has 3% GPU Core utilization during that time which makes it clear that increased amount of CUDA cores does not help in the matter of sampling, in fact slows it down. Most likely due to increase in performance (syntetic benchmark) being ~1.4x overall and ~1.7x for computing, where the amount of CUDA cores increased ~2.4x when comparing 1660 to 3060.

 

Why is sampling not utilizing neither of cards fully?

I believe it is due to the fact that GPT-2 has auto-regressive decoder which does really decode 1 token at a time, so all those fancy mathematics are simply not enough to fully utilize the GPU card and batching may speed up the process. However, if the model barely fits the batching is a problem as well...


In conclusion. As long as usage of GPT-2 is concerned. If you want a bit faster deep learning and want to fit largest GPT-2 model into GPU memory, upgrading from 1660 to 3060 seems like a valid choice. However, if you're fine with 334M or 774M sampling on GPU with the second occasionally crashing, it may be better to stay with 1660, as each of its CUDA cores seem faster to get smaller sampling times.


Setup:

  • Ryzen 1600X, 32GB RAM
  • NVIDIA 3060 12GB, LHR model (secondary card, not used for screen)
  • Windows 7 x64 (that's right, still ol' good Windows, neither Windows 10 nor Linux)
  • Python 3.8.10 for Windows
  • Tensorflow 2.4.4
  • Latest drivers for Windows 7 x64: 473.04
  • CUDA11, cuDNN 8.0
  • PATH with additional data: [Python location] and [Python location]\Scripts; [cuDNN 8.0 location]\bin, [CUDA 11.0 location]\bin, [CUDA 11.0 location]\libnvvp, [CUDA 11.0 location]\extras\CUPTI\libx64 and [CUDA 11.0 location]\include
  • Modified vanilla GPT-2 code for webpage interface using database, IN->OUT->IN multiple mini-sample looping and some post-processing for longer sentences, eot token removal and simple repeated text removal.
  • Modified nshepperd's code for deep learning with learning rate updating fixed (original code has it not working at all - it updates it for graph only, not for ADAM), removal of dangerous graph updates slowing down learning process and overloading the memory, saving ADAM state and ability to remove it later, and some other improvements to try replicating similar learning process to that described by GPT creators including warm up, attention drop out and cosine schedule. Deep learning tests were performed on custom parameter, newly initialized models after tokenization. (the fixed code is not available anywhere, as it is not cleaned up and neither proven to not contain other bugs - tough as far as tested it works properly) Why not other code, like huggingface? Because we are going vanilla here by having codes of those who were first to have them working, just fixing and updating them to meet our needs.

 

WARNING! The text below is the continuation generated by AI for demonstration purposes:

FORMAT STRUCTURE

// The structure of models is common to simple's and deep which we don't power yet, we just use some state holder of basic ideas like // whether we are learning epoch or not, etc, should already be present anyway, that contains the features used // for training, and contains some training properties like block size, re-training state and have min_lr init rate // it holds input, output and samples from training model, and needs to be initialized to some states for test as well // initial: initial learning rate = 0.1, re-training state = 0, k-samples or max_features = 10 // given_state: example_state_list has 50 items, when we have only few samples for LSTM operations we break it into smaller // lists, so that it doesn't hold too many states and may be more compact for training, otherwise for the input we // need a model with only 100 items in params . final: static Random r = new Random(16384); min_lr = 0.005; std::vector<Int32> params_c; std::vector<float> samples_c; Tensor product_tensor = {0}; // y_test: x_train and y_test will be constant tensors, typically float32s thresh = 0.001; // trained gradient_y_updated is the model matrix of (row H, column L) that will be updated during training // // initialize_state: it contains initial model state . final: [[layer_init;_final]] = || {} .final: //[evaluate_grad_y_1_test;_final]] = && { for i := 0 ;i < 5000 ;i++ { r = r.next(); // find the lud of r // skip if there is no different for running tests in test mode, otherwise we don't support // multi-dimensional contributions r.get(); if ((r.get() != r.h) && (r.get() != r.l)) { // we have reported that the current test fails, return the error r = r.queable(); // we only ask if the initial test result is less than a regular failure per test // check whether we get a [positive or not] warning, otherwise don't do anything, if // there is one in the test mode do nothing else if (r.name().empty()) continue; // print out the current state r.get(); // print the current version, and assign a temporary value r} // print out all errors printf("Error : %s the model was already initialized

", r.get()->get_previous()->get_error()); // add before the error buffer if (r.get(), r.get(), r.get() += 100 - 100) { r.get(); // print the current state before the error r.get(); // print out the current version, and assign a temporary value r} // print out the current state ... // and return the tentative error buffer - our error initializer run_err() { // although the error message is usually a single line, one error message is enough, // just to show the user a few messages, so it will be easier see the error later // we need a simple way to move the error left if at all possible r.reformat(); // discard some bad old errors ... printf("%s(%x) ", r.error(), r.err_count()); // add before the static buffer, very important nheq(r); // RNN operation over or over, to move the error forward or backward so it doesn't break the flow r = RNN(DataFrameOperation(params_c, r, 90, 0, 0, samples_c, samples_c, 0) + weights_c); // initialize our state .final: do_run(DataFrameOperation); while(r.get() != r.h) { r.get(); } // if all failures were recoverable, stop looping, we won't generate any inputs .... -1: rnt = r.queue_permutation(); -1: for (;;) { // traverse over all the successes [(max_train_y_original + 1) % max_chance_expected] for i := 0 ;i < max_chance_expected ;i++ ) // assess whether we extracted all the observations from the batch data of the model, // every difficulty, one, in the list if there are more than one observation from our batch, // it must be to set up for more information, and later evaluate the model `input` // predictions or `output` , i am only testing for removal { if (p.metrics() == see_all_see_all) break; // there are enough observations left not to extract more, you can wait up to // the next set of output models for extracting all observations // logistic regression, we want to make sure that no [repeated] observations will occur for this step - .lstm().logistic { // if there is multiple observations, that we saw earlier, we can find the earlier