来自【mp.weixin.qq.com】的分享 https://mp.weixin.qq.com/s?__biz=Mzk0MzUzMTMyMg==&mid=2247486601&idx=3&sn=05d9a95f1315b5537e1d8d2a50339eb5&chksm=c2fc558ea34f33d3148fe020e0309288df06cff3fafa6ff3d74f0e1f2b1bd9281c7ccc9879d5&xtrack=1&scene=90&subscene=93&sessionid=1712793423&flutter_pos=3&clicktime=1712793590&enterid=1712793590&finder_biz_enter_id=4&ranksessionid=1712793435&ascene=56&fasttmpl_type=0&fasttmpl_fullversion=7157335-zh_CN-zip&fasttmpl_flag=0&realreporttime=1712793590485&devicetype=android-33&version=2800303f&nettype=cmnet&abtest_cookie=AAACAA%3D%3D&lang=zh_CN&session_us=gh_e5f8aa4eaf0f&countrycode=CN&exportkey=n_ChQIAhIQJ0dmS4Gu%2FHu6Is%2B8UuM%2F9BLhAQIE97dBBAEAAAAAACQnJ8tMG20AAAAOpnltbLcz9gKNyK89dVj0oZ1e6CI%2FP1w9wXPOctz2%2FtA1M5KjN%2FV1fa%2FWvfjS95ltHxGsB%2BMwisI8jq7MPSUm6exaQYKJd0XKpcpHPnxGWSU5GNdI5x3NgdnT7KDthNkBR9PMaPXoBCcOq5LUpI8xp2k7nuePtxnZNa5FMZjoLnXJC1PzFQ9F1L4%2FovpimvE5XpggKwVag4ge9RrxDJEXWne5wcMCzbn3BepZMc%2BEAN%2FoR0N9KZhcvJ1KuQxhqT5LX556DRbaoHsgYw%3D%3D&pass_ticket=hLGS6Gh00uvEyGfVP%2FFpbff%2FVneewZMkbnqRBfds9anItYcXEJogmLEPpMfAiiyGAzr%2F0O51mH4LuqXYUBwFAw%3D%3D&wx_header=3&t=2
I will tell you everything you'd ever like to know,
All the stars you wish upon I will make them yours,
I'll speak to you in poetry to fill your dreams with light,
Make sure your every fantasy comes to life,
So many thing I've never known,
I will learn from you,
I'll find reason in your words,
That are tried and true,
I don't wish here your laughter,
Every smile and tears,
And cherish one more chance to hold you near,
You are my miracle,
You are soft and pure,
Just like the air I breathe,
You, heart of my heart,
You heart of my heart,
As a secret that you hold,
That you hold,
Deep within that so discretely,
Hides you soul,
The strength I simply never found,
I will find for you,
What I believe our future holds,
I will see it through,
To be with you a lifetime,
Is really all I ask,
Come'se fosse l'ultimo per noi,
You are my miracle,
And it's crystal clear,
Like endless summer skies,
You, heart of me heart,
As secret that you hold, that you hold,
And it's hidden so discretely,
There to be my guide,
To a love that lasts forever,
Love, heart of my heart,
Has a secret that you hold, that you hold,
And it's hidden so discreetly,
Will you share with me your secret,
Ever more.
All the stars you wish upon I will make them yours,
I'll speak to you in poetry to fill your dreams with light,
Make sure your every fantasy comes to life,
So many thing I've never known,
I will learn from you,
I'll find reason in your words,
That are tried and true,
I don't wish here your laughter,
Every smile and tears,
And cherish one more chance to hold you near,
You are my miracle,
You are soft and pure,
Just like the air I breathe,
You, heart of my heart,
You heart of my heart,
As a secret that you hold,
That you hold,
Deep within that so discretely,
Hides you soul,
The strength I simply never found,
I will find for you,
What I believe our future holds,
I will see it through,
To be with you a lifetime,
Is really all I ask,
Come'se fosse l'ultimo per noi,
You are my miracle,
And it's crystal clear,
Like endless summer skies,
You, heart of me heart,
As secret that you hold, that you hold,
And it's hidden so discretely,
There to be my guide,
To a love that lasts forever,
Love, heart of my heart,
Has a secret that you hold, that you hold,
And it's hidden so discreetly,
Will you share with me your secret,
Ever more.
#MoE##LLM实践##Scaling#
数据集 the tiny shakespeare dataset:
https://t.cn/A6T9xaHf;
没有使用tokenizer, 简单以单个字符为一个token, 词表大小voice_size: 65;
总token数:1,115,394 大概 1 M;
硬件环境: 直接可以在CPU上训练, 也可以使用FLOPS 相对高的 GPU加速训练
目的: (参数量小,训练时间不长) 主要是验证在相同的训练步长下,不同模型结构下,loss收敛速度。希望能复现下Switch Transform中的scaling training step , scaling training time,Scaling Versus a Larger Dense Model(这个需要训练机器成本高,复现难)
训练超参数:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel? B
block_size = 32 # what is the maximum context length for predictions? seq_len T
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 400
head_size = 16
n_embed = 128 # C
n_head = 8
n_layer = 8
dropout = 0.1
num_experts = 8
top_k = 2
aux_loss_coef=0.01
moe_self_attention=False
对比模型结构:
M1. MultiHeadAttention + SparseMoE 模型结构 参数量:8.996545 M parameters
M2. SparseMoEMultiHeadAttention + SparseMoE 模型结构 参数量:9.668417 M parameters
如图所示:
在相同的训练步长, M2 比 M1 ,loss收敛速度更快。
实验操作笔记:https://t.cn/A6TokRGd
数据集 the tiny shakespeare dataset:
https://t.cn/A6T9xaHf;
没有使用tokenizer, 简单以单个字符为一个token, 词表大小voice_size: 65;
总token数:1,115,394 大概 1 M;
硬件环境: 直接可以在CPU上训练, 也可以使用FLOPS 相对高的 GPU加速训练
目的: (参数量小,训练时间不长) 主要是验证在相同的训练步长下,不同模型结构下,loss收敛速度。希望能复现下Switch Transform中的scaling training step , scaling training time,Scaling Versus a Larger Dense Model(这个需要训练机器成本高,复现难)
训练超参数:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel? B
block_size = 32 # what is the maximum context length for predictions? seq_len T
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 400
head_size = 16
n_embed = 128 # C
n_head = 8
n_layer = 8
dropout = 0.1
num_experts = 8
top_k = 2
aux_loss_coef=0.01
moe_self_attention=False
对比模型结构:
M1. MultiHeadAttention + SparseMoE 模型结构 参数量:8.996545 M parameters
M2. SparseMoEMultiHeadAttention + SparseMoE 模型结构 参数量:9.668417 M parameters
如图所示:
在相同的训练步长, M2 比 M1 ,loss收敛速度更快。
实验操作笔记:https://t.cn/A6TokRGd
✋热门推荐