Tuesday, May 3, 2022

AMD Ryzens: prototype kind of CPUs?

Could AMD still be selling their CPUs in not completely finalized form, as kind of still under development chips, or even, Ryzen still being kind of prototype kind of CPU by itself?

Basing on my experience of two out of three Ryzens I own having problems, at least for me, in my personal opinion, AMD sold me an unfinished product. And basing on the problems I had with the latest (4800H), I can believe that it was in fact in some prototype kind of stage still, not thoroughly tested, not a final product that should be put on the market.

 

What AMD CPUs do I own as products and their problems (if any):

  • Ryzen 1600X - was struggling a lot to stabilize this CPU as much as possible, still resulting in occasional crashes of applications every few days and BSOD every 1 month or more. 1xxx series had problems with RAM controller, and even though I changed RAM for AMD certified one, which helped too, the instability is still there, no matter what.
  • Ryzen 2500U - this CPU has no problems so far. Maybe because its clock is pretty low and voltages don't jump as high as on high-end models. My family relative has the same CPU, no problems. So far this is the only Ryzen that is working 100% stable for me without any tinkering.
  • Ryzen 4800H - this CPU has weird problems. The problems are not noticeable at first until... you run tensorflow in training + sampling mode. After around 24h (shortest was 2h, longest was few days) the notebook will simply crash, usually on "hardware" issue. It does not matter if you force the TDP to 20W, it will still crash.

 

I could understand 1600X still having problems, it was 1xxx series. A brand new line of products that company should test well but somehow overlooked certain issues just to push it to the market.

 

I cannot, however, understand that my 4800H is still causing problems and of different kind than 1600X. Isn't it 3 series ahead for someone to get their act together and test the CPU properly and thoroughly in various kind of applications and, if possible, fix the problems or clearly state what the CPU may fail on?


So, how about I clarify what causes my 4800H to fail exactly? I tried to pin-point the problem and I found a solution for this CPU to run completely stable without any BSOD whatsoever.

There are two solutions and are related to Windows power scheme:

  • Disable Aggressive boost - if you are fine with your CPU running at 1.8Ghz just to be stable, go for that option, or,
  • Disable Idle - if you are fine with your CPU not saving power when not used and run at full speed instead, with no hint of instability at the same time, go for this one. However, if you overload your CPU your system will be kind of less responsible in this mode (beat me why...), but at the same time some applications may run faster (I got 10-15% speed up with AI training just by disabling Idle mode when the power draw was still 35W)

 

Unfortunately I have no bios update to maybe mitigate this problem, nor it offers me any kind of tinkering with processor states to try pin-pointing it even further. Nonetheless, I opted for disabling Idle state and controlling my CPU TDP with Renoir Mobile Tuning as otherwise I am at risk of losing my unsaved work.


I'm not the only one having problems with their CPUs, they are more people like this. Whenever there are certain parts being problematic, clearly AMD should run more auditing on that matter to make sure no bad quality CPUs reach the market. Until then you may try BIOS update or tinkering with your power schemes to turn off certain CPU functionalities to make it more if not completely stable.

Friday, April 22, 2022

Will GPT-2 Large (1.5B) run on NVIDIA RTX 3060 12GB? (sampling)

Yes. The GPT-2 large model with 1.5B parameters will run on RTX 3060 12GB but it barely fits. It will, however, not load in tensorflow <2.4 because you need CUDA 11, at least with this card. Anything below will simply make your GPU visible to tensorflow but it will stay hanged during model loading (showing very slight GPU activity).

It will neither load on tensorflow >2.4, most likely to either tensorflow or CUDA 11.2 taking a bit more memory, hence the model will not fit on the GPU and instead crash with OOM every time.

 

So what do you need?

  • Have RTX 3060 as secondary GPU so that it has as much spare memory as possible,
  • Have drivers compatible with CUDA 11 and cuDNN 8.0 (I used 473.04),
  • Use tensorflow version 2.4.x only,
  • Slightly update the code if you want to have vanilla setup,
  • Do not output too much tokens at the same time,
  • Keep output amount of tokens low (like 20), reduce to 10 when input >750 tokens.
  • (optional) Keep looping output->input in an [input size]=[maximum length]-[# token to output] manner while keeping last tokens in input and removing <|endoftext|> token in output to produce one "multi-sampled" long output. For example, for 10 output tokens the input should be the last 1014 tokens, then remove 10 tokens from the beginning and append output at the end of input, and repeat. (tested with TOP-K 1400, temp: 0.95; other settings may require further token reduction; speed: 1T/~0.45s)

 

Not yet known:

  • Will it crash during model loading at times due to different seeds?

 

Deep learing and sampling comparison RTX 1660 vs RTX 3060 with GPT-2 (original code and modified nshepperd implementation):

The deep learning is a bit faster for 3060 but sampling is slower than 1660. However, 1660 barely has 774M fit into it and crashes during model loading sometimes. On 1660 sampling 20T with 774M model took 3~4s where on 3060 it took 5~6s. 3060 has 3% GPU Core utilization during that time which makes it clear that increased amount of CUDA cores does not help in the matter of sampling, in fact slows it down. Most likely due to increase in performance (syntetic benchmark) being ~1.4x overall and ~1.7x for computing, where the amount of CUDA cores increased ~2.4x when comparing 1660 to 3060.

 

Why is sampling not utilizing neither of cards fully?

I believe it is due to the fact that GPT-2 has auto-regressive decoder which does really decode 1 token at a time, so all those fancy mathematics are simply not enough to fully utilize the GPU card and batching may speed up the process. However, if the model barely fits the batching is a problem as well...


In conclusion. As long as usage of GPT-2 is concerned. If you want a bit faster deep learning and want to fit largest GPT-2 model into GPU memory, upgrading from 1660 to 3060 seems like a valid choice. However, if you're fine with 334M or 774M sampling on GPU with the second occasionally crashing, it may be better to stay with 1660, as each of its CUDA cores seem faster to get smaller sampling times.


Setup:

  • Ryzen 1600X, 32GB RAM
  • NVIDIA 3060 12GB, LHR model (secondary card, not used for screen)
  • Windows 7 x64 (that's right, still ol' good Windows, neither Windows 10 nor Linux)
  • Python 3.8.10 for Windows
  • Tensorflow 2.4.4
  • Latest drivers for Windows 7 x64: 473.04
  • CUDA11, cuDNN 8.0
  • PATH with additional data: [Python location] and [Python location]\Scripts; [cuDNN 8.0 location]\bin, [CUDA 11.0 location]\bin, [CUDA 11.0 location]\libnvvp, [CUDA 11.0 location]\extras\CUPTI\libx64 and [CUDA 11.0 location]\include
  • Modified vanilla GPT-2 code for webpage interface using database, IN->OUT->IN multiple mini-sample looping and some post-processing for longer sentences, eot token removal and simple repeated text removal.
  • Modified nshepperd's code for deep learning with learning rate updating fixed (original code has it not working at all - it updates it for graph only, not for ADAM), removal of dangerous graph updates slowing down learning process and overloading the memory, saving ADAM state and ability to remove it later, and some other improvements to try replicating similar learning process to that described by GPT creators including warm up, attention drop out and cosine schedule. Deep learning tests were performed on custom parameter, newly initialized models after tokenization. (the fixed code is not available anywhere, as it is not cleaned up and neither proven to not contain other bugs - tough as far as tested it works properly) Why not other code, like huggingface? Because we are going vanilla here by having codes of those who were first to have them working, just fixing and updating them to meet our needs.

 

WARNING! The text below is the continuation generated by AI for demonstration purposes:

FORMAT STRUCTURE

// The structure of models is common to simple's and deep which we don't power yet, we just use some state holder of basic ideas like // whether we are learning epoch or not, etc, should already be present anyway, that contains the features used // for training, and contains some training properties like block size, re-training state and have min_lr init rate // it holds input, output and samples from training model, and needs to be initialized to some states for test as well // initial: initial learning rate = 0.1, re-training state = 0, k-samples or max_features = 10 // given_state: example_state_list has 50 items, when we have only few samples for LSTM operations we break it into smaller // lists, so that it doesn't hold too many states and may be more compact for training, otherwise for the input we // need a model with only 100 items in params . final: static Random r = new Random(16384); min_lr = 0.005; std::vector<Int32> params_c; std::vector<float> samples_c; Tensor product_tensor = {0}; // y_test: x_train and y_test will be constant tensors, typically float32s thresh = 0.001; // trained gradient_y_updated is the model matrix of (row H, column L) that will be updated during training // // initialize_state: it contains initial model state . final: [[layer_init;_final]] = || {} .final: //[evaluate_grad_y_1_test;_final]] = && { for i := 0 ;i < 5000 ;i++ { r = r.next(); // find the lud of r // skip if there is no different for running tests in test mode, otherwise we don't support // multi-dimensional contributions r.get(); if ((r.get() != r.h) && (r.get() != r.l)) { // we have reported that the current test fails, return the error r = r.queable(); // we only ask if the initial test result is less than a regular failure per test // check whether we get a [positive or not] warning, otherwise don't do anything, if // there is one in the test mode do nothing else if (r.name().empty()) continue; // print out the current state r.get(); // print the current version, and assign a temporary value r} // print out all errors printf("Error : %s the model was already initialized

", r.get()->get_previous()->get_error()); // add before the error buffer if (r.get(), r.get(), r.get() += 100 - 100) { r.get(); // print the current state before the error r.get(); // print out the current version, and assign a temporary value r} // print out the current state ... // and return the tentative error buffer - our error initializer run_err() { // although the error message is usually a single line, one error message is enough, // just to show the user a few messages, so it will be easier see the error later // we need a simple way to move the error left if at all possible r.reformat(); // discard some bad old errors ... printf("%s(%x) ", r.error(), r.err_count()); // add before the static buffer, very important nheq(r); // RNN operation over or over, to move the error forward or backward so it doesn't break the flow r = RNN(DataFrameOperation(params_c, r, 90, 0, 0, samples_c, samples_c, 0) + weights_c); // initialize our state .final: do_run(DataFrameOperation); while(r.get() != r.h) { r.get(); } // if all failures were recoverable, stop looping, we won't generate any inputs .... -1: rnt = r.queue_permutation(); -1: for (;;) { // traverse over all the successes [(max_train_y_original + 1) % max_chance_expected] for i := 0 ;i < max_chance_expected ;i++ ) // assess whether we extracted all the observations from the batch data of the model, // every difficulty, one, in the list if there are more than one observation from our batch, // it must be to set up for more information, and later evaluate the model `input` // predictions or `output` , i am only testing for removal { if (p.metrics() == see_all_see_all) break; // there are enough observations left not to extract more, you can wait up to // the next set of output models for extracting all observations // logistic regression, we want to make sure that no [repeated] observations will occur for this step - .lstm().logistic { // if there is multiple observations, that we saw earlier, we can find the earlier

Friday, June 26, 2020

Real Time Corruptor automation?




Corruptions, corruptions, corruptions. They can happen everywhere at the most critical moment, like copying important financial data from one place to another. But with backups we are safe, yay! But this is not about backups...

So what is Real Time Corruptor (RTC for short)? It's a piece of software that hooks into an emulator's* memory and modifies random bits to some other bits making the game run in unstable way and, in the worst case scenario, crash completely.

This is how it looks like:


But it's not about RTC either, it's about how you could possibly automate (or integrate) it with something that is not just "Start Auto-Corrupt". Well, with the magic of programming you can.

Lately I wrote a small console APP in C that can run desired intensity level by running this APP with a numeric parameter, being that intensity. "What for, you still need to run it manually," you may ask. Well, that's not really the case if you develop something else as well that can run shell APPs. For that Python is helpful. Without much of a hassle one can create Python script to run that application on a command, but what for? Python can be integrated in a lot of tools... *cough*  streaming software *cough*... and then controlled through those means for something outside triggering it to happen - there are those sensor things to developer boards. How about reading loudness and putting that as a corruption. What? You can do that? Yes!

Unfortunately both tools (the APP and the script) are currently closed-source and private, because they were developed with some purpose in mind (no, I'm not revealing it at this point). Buy hey, I can guarantee you that I had it working as a proof-of-concept. Don't believe me? Fine.

And no, I used plain C, not Autohotkey (hint, hint).


* - note that emulation in some countries may be banned. Always use the title you owned and the hardware you owned if you really need to do that (there is just no other way to test how game acts during corruption process)
 
The text below was generated by AI (GPT-2 1.5B). All information is false and is purely for demonstration purposes: 

/* * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * a * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. */ import sys, os import colour char = 0x00 char_buffer = [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] GPIO_MODE_INT = 1 GPIO_MODE_DMA = 4 GPIO_MODE_HAS_PRIV = 5 GPIO_MODE_VCTRL2 = 6 GPIO_MODE_VAL3 = 7 GPIO_MODE_PRIV = 8 GPIO_MODE_SLEEP = 9 GPIO_MODE_DMA = 10 GPIO_MODE_DIRMA_VCTRL = 11 GPIO_MODE_MAX28 = 12 GPIO_MODE_MIN2 = 13 GPIO_MODE_JOYB3 = 14 GPIO_MODE_LIBRARY = 15 16 /* Power off, mode and mode_reserved space for arbitrary purposes */ 17 GPIO_UNUSED_MAX = GPIO_UNUSED_MAX + 1 18 SET_RBUFSIZ 8000000 19 GPIO_KEY = 0x00 20 GPIO_UNUSED = GPIO_UNUSED - 1 21 GPIO_USE_UBOOK = 0x01 22 SET_RBUFSIZ 3000000 23 GPIO_UNUSED_0 = GPIO_UNUSED_0 + ( 1 << 13 ) 24 GPIO_UNUSED_1 = GPIO_UNUSED_1 + ( 1 << 12 ) 25 GPIO_UNUSED_2 = GPIO_UNUSED_2 + ( 1 << 11 ) 26 GPIO_UNUSED_3 = GPIO_UNUSED_3 + ( 1 << 10 ) 27 GPIO_UNUSED_4 = GPIO_UNUSED_4 + ( 1 << 9 ) 28 GPIO_UNUSED_5 = GPIO_UNUSED_5 + ( 1 << 8 ) 29 GPIO_UNUSED_6 = GPIO_UNUSED_6 + ( 1 << 7 ) 30 GPIO_UNUSED_7 = GPIO_UNUSED_7 + ( 1 << 6 ) 31 GPIO_UNUSED_8 = GPIO_UNUSED_8 + ( 1 << 5 ) 32 GPIO_UNUSED_9 = GPIO_UNUSED_9 + ( 1 << 4 ) 33 GPIO_UNUSED_0_EN = GPIO_UNUSED_0_EN + ( 1 << 3 ) 34 GPIO_UNUSED_1_EN = GPIO_UNUSED_1_EN + ( 1 << 2 ) 35 GPIO_UNUSED_2_EN = GPIO_UNUSED_2_EN + ( 1 << 1 ) 36 GPIO_UNUSED_3_EN = GPIO_UNUSED_3_EN + ( 1 << 0 ) 37 GPIO_UNUSED_4_EN = GPIO_UNUSED_4_EN + ( 1 << 0 ) 38 GPIO_UNUSED_5_EN = GPIO_UNUSED_5_EN + ( 1 << 0 ) 39 GPIO_UNUSED_6_EN = GPIO_UNUSED_6_EN + ( 1 << 0 ) 40 GPIO_UNUSED_7_EN = GPIO_UNUSED_7_EN + ( 1 << 0 ) 41 GPIO_UNUSED_8_EN = GPIO_UNUSED_8_EN + ( 1 << 0 ) 42 GPIO_UNUSED_9_EN = GPIO_UNUSED_9_EN + ( 1 << 0 ) 43 44 HID_KEY = 0x00 45 SET_RBUFSIZ 20 46 KEY_LENASELEN = 7 47 GPIO_EPROM_PORT = 0xA 48 49 50 51 52

Editing AHX music

 

This is what editing music in AHX format looks like. AHX are small files created in this tracker that gives the tune this unique chiptune vibe to it. In this tracker, however, you can define multiple notes and effects as an instrument itselfl, making those chiptune-like vibrations just by using the final, desired notes in tracker later without any other effects added (if you wish to).

If you want to try it yourself you would need an Amiga computer with hard drive installed or means to run it in software.

1. You need the proper Amiga running for this, I used Amiga 1200,
2. Connect hard drive to it and boot Workbench 3.0 floppy disk installing it on your hard drive,
3. Download AYS-A233.LHA from ahx and extract the archive,
4. Copy extracted files to your Amiga hard drive,
5. When in workbench, open the hard drive, pick view all files from menu (just browse a bit, you will find it),
6. Browse to the directory with AHX installed and double-click AHX-68000 (or AHX-68020 if you have an Amiga with suitable CPU for that),
7. You should be ready to go.

Remember that right mouse click shows menu on Amiga computers.

Sorry for not being detailed enough, I wanted to keep it brief, so you try and figure it out. I think it's always good to learn by trying things by yourself too, not just doing step-by-step instructions.

Also a small update, I minimized the amount of ADs to a minimum, because I hate those lagging sites that even on my Ryzen work like piece of junk... I hope it's OK to keep at least a bit of that content, right? We have those distractors everywhere nowadays... Oh, look... ADD... :)