Sunday, July 10, 2022

Improving VDSL speed with Draytek routers.

Please note: this method is not guaranteed to work, neither guaranteed that DLM won't trigger line banding. It is, however, a method I found working in my case.


First of all. My line is configured as 80/8mbps, however, only 55/7.5mbps is attainable with stock Draytek 2760n Delight settings. It is working in Interleave mode. Said so, currently I'm synced with 59/7.5mbps and I'm still on the way to see if it can be further improved without any physical changes to the line.


Draytek has a console where there is this magical command: vdsl snr [-50~50] with which you can control your downline Signal to Noise Ratio margin. What I see in practice is that when you set it to 50, the modem will refuse to accept signal that is lower than standard SNR +5dB, which may be good to stabilize the line if the standard used by your provider is not enough and you want to manually find the sweet spot before the DLM triggers and possibly cuts the speed more that you could achieve as stable by setting the margin by hand.

My provider wants me to work with 8dB SNR, giving me around 55mbps downline. When I force it with +5dB margin I get 14dB SNR and sync speed around 40mbps. However, the CRC Count and ES Count still shows up with such setting, having absolutely no difference between 8dB and 14dB which makes me believe that the line is not well maintained by our provider here (Orange Polska), not properly shielded or using faulty devices (or shortwaves and other interference surges are so high that it can't be properly avoided), and even with 14dB SNR on 14dB attenuation generates errors and in exactly the same amount as with 8dB SNR.

The good thing is that vdsl snr command allows you to use negative margin and I can confirm it working to some extend on my line. Though, as simple as it is to set positive values, it's not that straight forward when you want DSLAM to give you more. I guarantee you that it doesn't like when you suddenly want to reset the line from your side and negotiate a lower SNR. I'm currently at -1dB offset, though the SNR is still 8dB, my sync downlink speed is higher and offsetting it by -0.1dB did indeed raise it by approx 0.5mbps at a time until I forgot to do it properly and had -0.2dB "stolen" from me.


So how to do it properly?

First you need to remember that your line is under constant measurement in search for errors and link times. Most forums usually define it as 15 min checkpoints and ~24h summary. Thus you SHOULD NOT reset your line too often or preferably at all or you may risk being banded. Said so, there is a bit of "magic" you can perform to reset the line very nicely for DSLAM that I found working well in my case. This method does not necessarily require you to take off the phone line wire and wait 30 min. In fact, disconnecting your phone line from the modem could possibly send some "not nice" noise back to DSLAM that you would like to avoid at all costs, just in case.


In order to see if it is possible to be synced at better speed you need to have a lot of patience and properly measure your line over time (vdsl status) to find the sweet spot when DSLAM "wants" to give you one of the best possible attainable speeds. For me it's 4~6AM my time.

Do NOT use vdsl snr with negative values right away. First use vdsl idle on and wait until your line resets by itself, making your modem quiet so DSLAM can gracefully RESET the line. After this you can either wait 30 minutes or not, it's up to you and maybe how easy it is to trigger line banding with your provider. The next thing is to decrement snr margin by very little, possibly by 1 (0.1dB) at a time, thus vdsl snr -1, vdsl snr -2 and so on and do it preferably every 24h (it can work more often but to avoid banding it is the best to wait between resets). Do not forget vdsl idle on, if you do, it is possible that you won't get any higher speeds and doing it later correctly may even result in DSLAM taking away 0.5mbps from you. You must do vdsl idle on first when reducing SNR margin below 0, acting like a gentleman with DSLAM.


My experience:

- every time I reduced SNR margin by -0.1dB I got +0.5mbps sync

- when I forgot vdsl idle on once I got the same sync with higher attainable speed

- when I corrected my mistake right away I ended up with 0.5mbps being taken from me

- when I yet again (3 time that day, after 2h) reduced SNR by -0.1dB I got back my 0.5mbps

- Result? After forgetting vdsl idle on just once I ended up with -0.2dB margin offset and absolutely no change in actual downlink rate which may be either a limit for me or I triggered DLM to band me a bit, though I was capable of regaining the speed at the loss of 0.2dB of fine-tuning.


Type ? for command help
vdsl status

  ---------------------- ATU-R Info (hw: annex A, f/w: annex A/B/C) -----------
   Running Mode            :      17A       State                : SHOWTIME
   DS Actual Rate          : 56506000 bps   US Actual Rate       :  7616000 bps
   DS Attainable Rate      : 61409048 bps   US Attainable Rate   :  7897000 bps
   DS Path Mode            :  Interleave    US Path Mode         :  Interleave 
   DS Interleave Depth     :      507       US Interleave Depth  :      101 
   NE Current Attenuation  :       18 dB    Cur SNR Margin       :        8  dB
   DS actual PSD           :   -10.-5 dB    US actual PSD        :    14. 5  dB
   NE CRC Count            :      176       FE CRC Count         :     9463
   NE ES Count             :       20       FE  ES Count         :     2349
   Xdsl Reset Times        :        0       Xdsl Link  Times     :        8
   ITU Version[0]          : 00000000       ITU Version[1]       : 00000000
   VDSL Firmware Version   : 05-07-09-0F-01-07   [with Vectoring support] 
   Power Management Mode   : DSL_G997_PMS_L0 
   Test Mode               : DISABLE 
  -------------------------------- ATU-C Info ---------------------------------
   Far Current Attenuation :       14 dB    Far SNR Margin       :        8  dB
   CO ITU Version[0]       : b5004244       CO ITU Version[1]    : 434da49e
   DSLAM CHIPSET VENDOR    : < BDCM >
vdsl idle

% Usage : adsl idle [on | tcpmessage | tcpmessage_off]
% DSL is under [DISABLE] test mode.
% DSL debug tool mode is off.
vdsl idle on

% DSL is under [IDLE/QUIET] test mode.
% DSL debug tool mode is off.
vdsl snr -3

VDSL SNR update successfully !
Restarting modem ...
vdsl status

  ---------------------- ATU-R Info (hw: annex A, f/w: annex A/B/C) -----------
   Running Mode            :      17A       State                : SHOWTIME
   DS Actual Rate          : 57186000 bps   US Actual Rate       :  7702000 bps
   DS Attainable Rate      : 61619704 bps   US Attainable Rate   :  7931000 bps
   DS Path Mode            :  Interleave    US Path Mode         :  Interleave 
   DS Interleave Depth     :      513       US Interleave Depth  :      101 
   NE Current Attenuation  :       18 dB    Cur SNR Margin       :        8  dB
   DS actual PSD           :   -10.-5 dB    US actual PSD        :    14. 5  dB
   NE CRC Count            :        0       FE CRC Count         :     9463
   NE ES Count             :        0       FE  ES Count         :     2349
   Xdsl Reset Times        :        0       Xdsl Link  Times     :        9
   ITU Version[0]          : 00000000       ITU Version[1]       : 00000000
   VDSL Firmware Version   : 05-07-09-0F-01-07   [with Vectoring support] 
   Power Management Mode   : DSL_G997_PMS_L0 
   Test Mode               : DISABLE 
  -------------------------------- ATU-C Info ---------------------------------
   Far Current Attenuation :       14 dB    Far SNR Margin       :        8  dB
   CO ITU Version[0]       : 00000000       CO ITU Version[1]    : 00000000
   DSLAM CHIPSET VENDOR    : < ----- >


The following day:

vdsl status

  ---------------------- ATU-R Info (hw: annex A, f/w: annex A/B/C) -----------
   Running Mode            :      17A       State                : SHOWTIME
   DS Actual Rate          : 57186000 bps   US Actual Rate       :  7702000 bps
   DS Attainable Rate      : 61563928 bps   US Attainable Rate   :  7931000 bps
   DS Path Mode            :  Interleave    US Path Mode         :  Interleave 
   DS Interleave Depth     :      513       US Interleave Depth  :      101 
   NE Current Attenuation  :       18 dB    Cur SNR Margin       :        8  dB
   DS actual PSD           :   -10.-5 dB    US actual PSD        :    14. 5  dB
   NE CRC Count            :       46       FE CRC Count         :     9475
   NE ES Count             :        8       FE  ES Count         :     2354
   Xdsl Reset Times        :        0       Xdsl Link  Times     :        9
   ITU Version[0]          : 00000000       ITU Version[1]       : 00000000
   VDSL Firmware Version   : 05-07-09-0F-01-07   [with Vectoring support] 
   Power Management Mode   : DSL_G997_PMS_L0 
   Test Mode               : DISABLE 
  -------------------------------- ATU-C Info ---------------------------------
   Far Current Attenuation :       14 dB    Far SNR Margin       :        8  dB
   CO ITU Version[0]       : b5004244       CO ITU Version[1]    : 434da49e
   DSLAM CHIPSET VENDOR    : < BDCM >
vdsl idle on

% DSL is under [IDLE/QUIET] test mode.
% DSL debug tool mode is off.
vdsl snr -4

VDSL SNR update successfully !
Restarting modem ...
vdsl status

  ---------------------- ATU-R Info (hw: annex A, f/w: annex A/B/C) -----------
   Running Mode            :      17A       State                : SHOWTIME
   DS Actual Rate          : 57631000 bps   US Actual Rate       :  7748000 bps
   DS Attainable Rate      : 61981992 bps   US Attainable Rate   :  8040000 bps
   DS Path Mode            :  Interleave    US Path Mode         :  Interleave 
   DS Interleave Depth     :      517       US Interleave Depth  :      101 
   NE Current Attenuation  :       18 dB    Cur SNR Margin       :        8  dB
   DS actual PSD           :   -10.-5 dB    US actual PSD        :    14. 5  dB
   NE CRC Count            :        0       FE CRC Count         :     9484
   NE ES Count             :        0       FE  ES Count         :     2355
   Xdsl Reset Times        :        0       Xdsl Link  Times     :       10
   ITU Version[0]          : 00000000       ITU Version[1]       : 00000000
   VDSL Firmware Version   : 05-07-09-0F-01-07   [with Vectoring support] 
   Power Management Mode   : DSL_G997_PMS_L0 
   Test Mode               : DISABLE 
  -------------------------------- ATU-C Info ---------------------------------
   Far Current Attenuation :       14 dB    Far SNR Margin       :        8  dB
   CO ITU Version[0]       : 00000000       CO ITU Version[1]    : 00000000
   DSLAM CHIPSET VENDOR    : < ----- >

 

SNR margin +4dB:




 

SNR margin -0.1dB:


SNR margin -1dB:



 

Your millage may differ.


Python script to produce images from your commands: vdsl showbins  and vdsl showbins up. Must be text file bins_dl.txt and bins_ul.txt respectively. Both starting with bin 0, 1, 2, 3 and ending with 4092, 4093, 4094, 4095. No enter allowed at the end of file. Saves three files (check line with filenames list variable). Labels may not be visible due to adjusted fit to image size - change plt.subplot_adjust numeric parameters if you prefer it otherwise.

import matplotlib.pyplot as plt

def loaddata(file):
    with open(file,mode="r") as f:
        data = f.read()
        data = data.replace("*","\n")
        bins_raw = data.split("\n")
        all_bins = []
        for bin in bins_raw:
            bin_data = bin.split()
            all_bins.append(bin_data)
    bins = []
    bits = []
    snrs = []
    gains = []
    for bin in all_bins:
        bins.append(int(bin[0]))
        bits.append(int(bin[3]))
        snrs.append(int(bin[1]))
        gains.append(int(bin[2]))
    return bins,bits,snrs,gains

data_dl = loaddata("bins_dl.txt")
data_ul = loaddata("bins_ul.txt")

ylabels = ["Bits", "SNR dB", "Gain .1dB"]
filenames = ["dsl_bits_per_bin.png","dsl_snr_per_bin.png","dsl_gain_per_bin.png"]

for x in range(0,3):
    plt.clf()
    plt.xlabel("Bins")
    plt.ylabel(ylabels[x])
    plt.bar(data_dl[0],data_dl[x+1], label="Downstream")
    plt.bar(data_ul[0],data_ul[x+1], label="Upstream")
    plt.subplots_adjust(bottom=0.05, left=0.07, right=1, top=1)
    plt.legend()
    plt.savefig(filenames[x],dpi=1200)


Commands, bins dump and image generating python could possibly be fully automated on Linux machine, using ssh for controlling and getting data from router. Including finding the best possible time to gracefully reset connection for best sync speeds at 24h+ interval.

Tuesday, May 3, 2022

AMD Ryzens: prototype kind of CPUs?

Could AMD still be selling their CPUs in not completely finalized form, as kind of still under development chips, or even, Ryzen still being kind of prototype kind of CPU by itself?

Basing on my experience of two out of three Ryzens I own having problems, at least for me, in my personal opinion, AMD sold me an unfinished product. And basing on the problems I had with the latest (4800H), I can believe that it was in fact in some prototype kind of stage still, not thoroughly tested, not a final product that should be put on the market.

 

What AMD CPUs do I own as products and their problems (if any):

  • Ryzen 1600X - was struggling a lot to stabilize this CPU as much as possible, still resulting in occasional crashes of applications every few days and BSOD every 1 month or more. 1xxx series had problems with RAM controller, and even though I changed RAM for AMD certified one, which helped too, the instability is still there, no matter what.
  • Ryzen 2500U - this CPU has no problems so far. Maybe because its clock is pretty low and voltages don't jump as high as on high-end models. My family relative has the same CPU, no problems. So far this is the only Ryzen that is working 100% stable for me without any tinkering.
  • Ryzen 4800H - this CPU has weird problems. The problems are not noticeable at first until... you run tensorflow in training + sampling mode. After around 24h (shortest was 2h, longest was few days) the notebook will simply crash, usually on "hardware" issue. It does not matter if you force the TDP to 20W, it will still crash.

 

I could understand 1600X still having problems, it was 1xxx series. A brand new line of products that company should test well but somehow overlooked certain issues just to push it to the market.

 

I cannot, however, understand that my 4800H is still causing problems and of different kind than 1600X. Isn't it 3 series ahead for someone to get their act together and test the CPU properly and thoroughly in various kind of applications and, if possible, fix the problems or clearly state what the CPU may fail on?


So, how about I clarify what causes my 4800H to fail exactly? I tried to pin-point the problem and I found a solution for this CPU to run completely stable without any BSOD whatsoever.

There are two solutions and are related to Windows power scheme:

  • Disable Aggressive boost - if you are fine with your CPU running at 1.8Ghz just to be stable, go for that option, or,
  • Disable Idle - if you are fine with your CPU not saving power when not used and run at full speed instead, with no hint of instability at the same time, go for this one. However, if you overload your CPU your system will be kind of less responsible in this mode (beat me why...), but at the same time some applications may run faster (I got 10-15% speed up with AI training just by disabling Idle mode when the power draw was still 35W)

 

Unfortunately I have no bios update to maybe mitigate this problem, nor it offers me any kind of tinkering with processor states to try pin-pointing it even further. Nonetheless, I opted for disabling Idle state and controlling my CPU TDP with Renoir Mobile Tuning as otherwise I am at risk of losing my unsaved work.


I'm not the only one having problems with their CPUs, they are more people like this. Whenever there are certain parts being problematic, clearly AMD should run more auditing on that matter to make sure no bad quality CPUs reach the market. Until then you may try BIOS update or tinkering with your power schemes to turn off certain CPU functionalities to make it more if not completely stable.

Friday, April 22, 2022

Will GPT-2 Large (1.5B) run on NVIDIA RTX 3060 12GB? (sampling)

Yes. The GPT-2 large model with 1.5B parameters will run on RTX 3060 12GB but it barely fits. It will, however, not load in tensorflow <2.4 because you need CUDA 11, at least with this card. Anything below will simply make your GPU visible to tensorflow but it will stay hanged during model loading (showing very slight GPU activity).

It will neither load on tensorflow >2.4, most likely to either tensorflow or CUDA 11.2 taking a bit more memory, hence the model will not fit on the GPU and instead crash with OOM every time.

 

So what do you need?

  • Have RTX 3060 as secondary GPU so that it has as much spare memory as possible,
  • Have drivers compatible with CUDA 11 and cuDNN 8.0 (I used 473.04),
  • Use tensorflow version 2.4.x only,
  • Slightly update the code if you want to have vanilla setup,
  • Do not output too much tokens at the same time,
  • Keep output amount of tokens low (like 20), reduce to 10 when input >750 tokens.
  • (optional) Keep looping output->input in an [input size]=[maximum length]-[# token to output] manner while keeping last tokens in input and removing <|endoftext|> token in output to produce one "multi-sampled" long output. For example, for 10 output tokens the input should be the last 1014 tokens, then remove 10 tokens from the beginning and append output at the end of input, and repeat. (tested with TOP-K 1400, temp: 0.95; other settings may require further token reduction; speed: 1T/~0.45s)

 

Not yet known:

  • Will it crash during model loading at times due to different seeds?

 

Deep learing and sampling comparison RTX 1660 vs RTX 3060 with GPT-2 (original code and modified nshepperd implementation):

The deep learning is a bit faster for 3060 but sampling is slower than 1660. However, 1660 barely has 774M fit into it and crashes during model loading sometimes. On 1660 sampling 20T with 774M model took 3~4s where on 3060 it took 5~6s. 3060 has 3% GPU Core utilization during that time which makes it clear that increased amount of CUDA cores does not help in the matter of sampling, in fact slows it down. Most likely due to increase in performance (syntetic benchmark) being ~1.4x overall and ~1.7x for computing, where the amount of CUDA cores increased ~2.4x when comparing 1660 to 3060.

 

Why is sampling not utilizing neither of cards fully?

I believe it is due to the fact that GPT-2 has auto-regressive decoder which does really decode 1 token at a time, so all those fancy mathematics are simply not enough to fully utilize the GPU card and batching may speed up the process. However, if the model barely fits the batching is a problem as well...


In conclusion. As long as usage of GPT-2 is concerned. If you want a bit faster deep learning and want to fit largest GPT-2 model into GPU memory, upgrading from 1660 to 3060 seems like a valid choice. However, if you're fine with 334M or 774M sampling on GPU with the second occasionally crashing, it may be better to stay with 1660, as each of its CUDA cores seem faster to get smaller sampling times.


Setup:

  • Ryzen 1600X, 32GB RAM
  • NVIDIA 3060 12GB, LHR model (secondary card, not used for screen)
  • Windows 7 x64 (that's right, still ol' good Windows, neither Windows 10 nor Linux)
  • Python 3.8.10 for Windows
  • Tensorflow 2.4.4
  • Latest drivers for Windows 7 x64: 473.04
  • CUDA11, cuDNN 8.0
  • PATH with additional data: [Python location] and [Python location]\Scripts; [cuDNN 8.0 location]\bin, [CUDA 11.0 location]\bin, [CUDA 11.0 location]\libnvvp, [CUDA 11.0 location]\extras\CUPTI\libx64 and [CUDA 11.0 location]\include
  • Modified vanilla GPT-2 code for webpage interface using database, IN->OUT->IN multiple mini-sample looping and some post-processing for longer sentences, eot token removal and simple repeated text removal.
  • Modified nshepperd's code for deep learning with learning rate updating fixed (original code has it not working at all - it updates it for graph only, not for ADAM), removal of dangerous graph updates slowing down learning process and overloading the memory, saving ADAM state and ability to remove it later, and some other improvements to try replicating similar learning process to that described by GPT creators including warm up, attention drop out and cosine schedule. Deep learning tests were performed on custom parameter, newly initialized models after tokenization. (the fixed code is not available anywhere, as it is not cleaned up and neither proven to not contain other bugs - tough as far as tested it works properly) Why not other code, like huggingface? Because we are going vanilla here by having codes of those who were first to have them working, just fixing and updating them to meet our needs.

 

WARNING! The text below is the continuation generated by AI for demonstration purposes:

FORMAT STRUCTURE

// The structure of models is common to simple's and deep which we don't power yet, we just use some state holder of basic ideas like // whether we are learning epoch or not, etc, should already be present anyway, that contains the features used // for training, and contains some training properties like block size, re-training state and have min_lr init rate // it holds input, output and samples from training model, and needs to be initialized to some states for test as well // initial: initial learning rate = 0.1, re-training state = 0, k-samples or max_features = 10 // given_state: example_state_list has 50 items, when we have only few samples for LSTM operations we break it into smaller // lists, so that it doesn't hold too many states and may be more compact for training, otherwise for the input we // need a model with only 100 items in params . final: static Random r = new Random(16384); min_lr = 0.005; std::vector<Int32> params_c; std::vector<float> samples_c; Tensor product_tensor = {0}; // y_test: x_train and y_test will be constant tensors, typically float32s thresh = 0.001; // trained gradient_y_updated is the model matrix of (row H, column L) that will be updated during training // // initialize_state: it contains initial model state . final: [[layer_init;_final]] = || {} .final: //[evaluate_grad_y_1_test;_final]] = && { for i := 0 ;i < 5000 ;i++ { r = r.next(); // find the lud of r // skip if there is no different for running tests in test mode, otherwise we don't support // multi-dimensional contributions r.get(); if ((r.get() != r.h) && (r.get() != r.l)) { // we have reported that the current test fails, return the error r = r.queable(); // we only ask if the initial test result is less than a regular failure per test // check whether we get a [positive or not] warning, otherwise don't do anything, if // there is one in the test mode do nothing else if (r.name().empty()) continue; // print out the current state r.get(); // print the current version, and assign a temporary value r} // print out all errors printf("Error : %s the model was already initialized

", r.get()->get_previous()->get_error()); // add before the error buffer if (r.get(), r.get(), r.get() += 100 - 100) { r.get(); // print the current state before the error r.get(); // print out the current version, and assign a temporary value r} // print out the current state ... // and return the tentative error buffer - our error initializer run_err() { // although the error message is usually a single line, one error message is enough, // just to show the user a few messages, so it will be easier see the error later // we need a simple way to move the error left if at all possible r.reformat(); // discard some bad old errors ... printf("%s(%x) ", r.error(), r.err_count()); // add before the static buffer, very important nheq(r); // RNN operation over or over, to move the error forward or backward so it doesn't break the flow r = RNN(DataFrameOperation(params_c, r, 90, 0, 0, samples_c, samples_c, 0) + weights_c); // initialize our state .final: do_run(DataFrameOperation); while(r.get() != r.h) { r.get(); } // if all failures were recoverable, stop looping, we won't generate any inputs .... -1: rnt = r.queue_permutation(); -1: for (;;) { // traverse over all the successes [(max_train_y_original + 1) % max_chance_expected] for i := 0 ;i < max_chance_expected ;i++ ) // assess whether we extracted all the observations from the batch data of the model, // every difficulty, one, in the list if there are more than one observation from our batch, // it must be to set up for more information, and later evaluate the model `input` // predictions or `output` , i am only testing for removal { if (p.metrics() == see_all_see_all) break; // there are enough observations left not to extract more, you can wait up to // the next set of output models for extracting all observations // logistic regression, we want to make sure that no [repeated] observations will occur for this step - .lstm().logistic { // if there is multiple observations, that we saw earlier, we can find the earlier