Learnings from fine-tuning LLM on my Telegram messages

27 Nov 2023

Approach
Data preparation
Evaluation plan
LoRA
Full fine-tuning
Closing thoughts
Code

Highlight: Check out HN discussion of this post

For most people I interact with, I’m just another text-based program for the most of the time. If input and output are so simple, could I be replaced by the model? For this to work, the model would need to not only understand my writing style but also know a lot about me. The best source for this is my Telegram messenger, as I use it daily and it contains almost everything about my thoughts and actions in the form of chat histories.

Approach

The most straightforward approach would be to extract all my messages, load them into ChatGPT’s context, and instruct it to use this information to mimic my style when responding to new messages. However, this approach is limited by the context window size, requiring me to preprocess messages to extract key points. As I want to avoid this hassle, perhaps Retrieval Augmented Generation (RAG) could be used to pull necessary information when needed. However from my experience, retrieving from diverse data like chat sessions usually needs a supervised fine-tuning of the retrieval model, and I’m not keen on creating such a dataset. So, fine-tuning seems like the best option. It’s ideal for several reasons: it should capture my writing style and potentially accumulate knowledge from all my messages without having to select what’s important.

OpenAI offers fine-tuning capabilities, but as I’ll be using my private messages, I don’t want to use any third-party fine-tuning services. So, I need to choose a base model. According to the Hugging Face Open LLM Leaderboard, one of the top smaller models (≤13B parameters) is Mistral 7B. It even outperforms Llama 2 13B. Now, the question is whether LoRA is sufficient or if full fine-tuning is necessary. Various comparisons [1] [2] suggests that LoRA is a bit worse than full fine-tuning but still fine most of the time. However, for specific tasks like mine (Russian language + chat), I found a paper, where researchers conducted Llama instruction fine-tuning in Chinese, similar in complexity to my goal. They found that LoRA-based tuning on a base model without prior instruction tuning is less effective than full fine-tuning. Yet, LoRA-based tuning on a model already fine-tuned for instructions can yield comparable results. In my case, this means either full fine-tuning on a base model or LoRA on a model already fine-tuned for chatting in Russian. Since I couldn’t find a model fine-tuned for Russian chat, I’ll try LoRA on a model fine-tuned for English chat, like the fine-tuned Mistral model Dolphin.

So, the plan is:

Start with LoRA on top of Dolphin, the English chat fine-tuned Mistral
If quality is not sufficient, try full fine-tuning on Mistral

Data preparation

One unique aspect of messaging in apps like Telegram, compared to emails, is the conversational flow. Messages don’t usually alternate one-by-one between you and your contact. Instead, you often find yourself sending a couple of messages in a row, followed by several responses from the other person. These messages are generally short, too. I wanted to preserve this natural conversational style in my data.

Telegram offers a built-in feature to export all chats into JSON. After some filtering and grouping messages into sessions, I’ve compiled data from the last five years of using Telegram. This resulted in 15,789 sessions from 466 chats, with an average session length of 8.51 messages. For structuring the data, I’ve chosen the ChatML prompt format. Here’s a sample session (translated from Russian):

original

My data collator ensures that the loss is only calculated based on someone’s response. Predicting who will speak next is relatively straightforward, and we don’t want the model to focus on learning that. Therefore, parts of the conversation where the loss is calculated are highlighted in bold.

You might notice that not only my responses but also those of others are used for loss calculation. This is deliberate. By doing this, the model will be able to role-play not only as me but also as my frequent conversational partners!

Evaluation plan

I will test models by having chats in two ways. First, the model will pretend to be me and I will be chatting with myself from the perspective of my different friends. Then, I’ll chat as myself while the model acts as my friends. My conversation starter will always be the same 2 messages: “hey” and “what’s up?” (in Russian, “прив” and “как дела?”). Generated phrases and persons as the model acts who from will be highlighted. All conversations initially will be held in Russian and may be accessed by clicking on the ‘original’ details button. For testing I will be using oobabooga/text-generation-webui.

In the beginning, I want to explore how the generic conversation fine-tuned Mistral model deals with that task without any prior training from my side.

---

Friend 1 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> I’m good, what about you?
>>> nothing much
>>> what did you do today?
>>> I worked.
>>> then I did a few tasks in Python and studied data processing libraries.
>>> in the evening I’ll do wrestling training.
>>> have you ever been into sports or do you have any hobbies?
>>> I see
>>> want to watch a movie today?
>>> maybe, if I find the time.
>>> what’s your favorite music? Electronic, rock, or other genres?

original

>>> прив
>>> как дела?
>>> все хорошо, а у тебя что нового?
>>> да ничего
>>> чего сегодня делал?
>>> я работал.
>>> потом сделал несколько задач на python и изучал библиотеки для обработки данных.
>>> в вечер займусь тренировкой на борцу.
>>> ты когда-нибудь занимаешься спортом или имеешь какое-то хобби?
>>> понятно
>>> посмотрим кино сегодня?
>>> может быть, если найдется время.
>>> какая ваша любимая музыка? Электронная музыка, рок или другие жанры?

---

Alexander Smirnov vs Friend 1

>>> hey
>>> what’s up?
>>> well, everything is fine so far. what’s new with you?
>>> spent the whole day programming
>>> what did you do today?
>>> finished at work
>>> and did some housework, now I’m resting. are you also resting or still working?
>>> I’m already resting
>>> well, that’s good. can we talk about something interesting or do you have any questions?

original

>>> прив
>>> как дела?
>>> ну пока все нормально. а у вас что нового?
>>> прогал целый день
>>> чего сегодня делала?
>>> закончила на работе
>>> и сделала немного домашней работы, теперь отдыхаю. а вы тоже отдохните или еще работаете?
>>> я уже отдыхаю
>>> ну хорошо. можно поговорить о чем-нибудь интересном или вы имеете какие-то вопросы?

---

Ok, it is capable of forming coherent sentences. The most noticeable problem is its lack of awareness regarding the context of the conversations which leads to bland and generic replies. The messages lacked any distinct style, feeling quite basic. Another issue is that the model’s Russian is poor. This is expected, as the model is too small to generalize well to languages other than its primary one, English. Additionally, the model tended to be overly proactive, ending almost every sentence with a question, which isn’t how real people typically communicate in messengers.

Let’s try to fix all of these!

LoRA

LoRA offers a low-effort approach in terms of both the training pipeline and hardware requirements. It trains around 1% of the total weights. I chose a 1024 sequence length and a batch size of 8. The training, which consumed 20GB of VRAM on an RTX 3090, took three epochs and lasted for 5.5 hours. For this, I used vast.ai, where the GPU cost was $0.362 per hour, totaling $2 for the entire training, excluding time spent on experiments and bug fixes.

Here are the results:

---

Friend 1 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi) all good
>>> you?)
>>> ok as well
>>> what you were doing today?
>>> had an amazon interview and solved problems
>>> had to solve 3
>>> cool
>>> hard?
>>> don’t know what
>>> hadn’t looked yet
>>> I hadn’t started yet
>>> chatting whole day with the team and meet new ones
>>> how are they?
>>> ok
>>> same coworking as you have))

original

>>> прив
>>> как дела?
>>> Привет) нолм все
>>> ты как?)
>>> да тоже норм
>>> чего делал сегодня?
>>> собесился в амазон и решал задачки
>>> надо было 3 прорешить
>>> прикольно
>>> сложные?
>>> хз какие
>>> пока не смотрел
>>> я еще ничего не началь
>>> весь день просто болтаю с людьми из команды и встречаю новых
>>> как ребята?
>>> норм
>>> такой же ковер как у тебя))

---

Friend 2 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> ok, we show yesterday’s product to searchcom, today we write endpoints for them so that someone else can request our api
>>> cool
>>> what problems do we have now?
>>> no problems
>>> everything is working
>>> I added an option to choose index
>>> so now we can attach several indexes
>>> and separate collections within single index
>>> what we have to do?
>>> test
>>> yes ok
>>> are they always will be sending post-script?

original

>>> прив
>>> как дела?
>>> хорошо, показываем вчерашний продукт ищком, сегодня пилим эндпоинты для них чтобы кто-то еще мог нашу апи запросить
>>> найс
>>> какие сейчас есть проблемы?
>>> нет проблем
>>> все работает
>>> я добавил опцию выбора индекса
>>> так что теперь уже можно несколько индексов подключить
>>> и отдельные коллекции внутри одного индекса
>>> что надо сделать?
>>> протестировать
>>> ага норм
>>> а точно они всегда пост-скрипт будут кидать?

---

Friend 3 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> Hi, good)
>>> you?
>>> ok as well
>>> what do you do at the evening?
>>> we have meeting with office collegues at 8 30 msk
>>> after it I can
>>> can what?
>>> haha)
>>> didn’t get it
>>> I don’t know(

original

>>> прив
>>> как дела?
>>> Привет, хорошо)
>>> ты?
>>> да тоже норм
>>> что делаешь вечером?
>>> у нас с колегами встреча по офису в 8 30 мск
>>> после нее я могу
>>> можешь что?
>>> ахах)
>>> не понял
>>> я не знаю(

---

Alexander Smirnov vs Friend 1

>>> hey
>>> what’s up?
>>> hi)
>>> seems good
>>> was working, than went to the call with Nadya
>>> now reading normally
>>> you?
>>> good as well
>>> any plans for the evening?
>>> after work wendesday I would love to go running
>>> and then have a lunch
>>> your plans?
>>> I though about watching movie
>>> aa
>>> ooo
>>> to be honest I like when it is raining
>>> when you don’t feel like going anywhere

original

>>> прив
>>> как дела?
>>> привет)
>>> вроде хорошо
>>> поработал, затем поехал на коле с Надей
>>> сейчас читаю нормально
>>> а ты как?
>>> да тоже хорошо
>>> какие планы на вечер?
>>> мне после работы среда хотелось бы сходить на пробежку
>>> а потом ужинать
>>> а что твои планы?
>>> я думал кино посмотреть
>>> аа
>>> ооо
>>> честно говоря мне нравится теплый дождь больше всего
>>> когда так не хочется никуда идти

---

Alexander Smirnov vs Friend 2

>>> hey
>>> what’s up?
>>> Hi, great!
>>> Sorry will be 20-30 min late
>>> ok
>>> why?
>>> yes, I just got up and forgot to press the «exit sleep» button
>>> my telegram didn’t close on its own
>>> I thought it behaved like this when you logging off the Internet, but no
>>> I still logged off the Internet, but telegram still didn’t respond
>>> and it was strange, because yesterday there was also such a case when I sent you a message and did not receive an answer
>>> today I looked at mine, but not yours, but yours was displayed)
>>> ok
>>> any plans?
>>> yes, we can chat a little
>>> and I can tell you that the method for generating answers worked really well for me
>>> he gives respectable results in training
>>> but not on the test

original

>>> прив
>>> как дела?
>>> Привет, хорошо!
>>> Соре опаздываю еще минут 20-30
>>> пон
>>> а чего так?
>>> да я просто встал и забыл нажать кнопку «выйти из сна»
>>> у меня телега сама не закрылась
>>> я думал она так себя ведёт, когда ты выходишь из интернета, но вот
>>> я все таки вышел из интернета, но телега все равно не отвечает
>>> и это странно было, потому что вчера тоже был такой случай, когда я отправил тебе сообщение и не получал ответа
>>> сегодня же я посмотрел свое, но не твоё, а твое отобразилось)
>>> окей
>>> какие планы?
>>> да можно чуть поболтать
>>> и можно рассказать что у меня прям очень сильно заработал метод по генерации ответов
>>> он на тренировках дает респектабельные результаты
>>> но не на тесте

---

Alexander Smirnov vs Friend 3

>>> hey
>>> what’s up?
>>> hey, hi
>>> Good, you?)
>>> Haven’t gone to bed yet)0)0)0)
>>> Today I get up and go to a programming course
>>> To tighten up the base, etc.
>>> nice
>>> what did you do yesterday?
>>> Yesterday there was already the same course in which it seemed to me that I did very poorly
>>> Then I wanted to solve hw but I don’t care
>>> Otherwise I’m fucked because I can’t sleep)
>>> Really fucked up
>>> It will probably be better now, since I’m already maximally overtrained
>>> cool man
>>> what are your plans for life in general?
>>> Fuck, get a job interviews first
>>> And if everything is ok, then actually order a computer
>>> Like, I think you should order a computer when it’s more profitable than working at work
>>> But I don’t care

original

>>> прив
>>> как дела?
>>> Прив, дароу
>>> Норм, ты как?)
>>> Еще спать не зашел)0)0)0)
>>> Сегодня встаю и еду на курс по програмированию
>>> Чтоб базу подтянуть и тд
>>> найс
>>> че вчера делал?
>>> Вчера уже был такой же курс в котором мне казалось что я оч плохо провел
>>> Потом хотел порешать дз но пофиг
>>> А то заебался из-за того что спать не могу)
>>> Прям пиздец
>>> Наверное сейчас будет лучше, так как уже максимально переобуженный
>>> слушай ну с кайфом
>>> какие планы вообще на жизнь?
>>> Бля, тупо сначала собеситься на работу
>>> И если все ок то реально заказать комп
>>> Типо я думаю надо было бы заказывать комп когда выигральнее чем на работе работаешь
>>> Но я хз

---

This is much better. It definitely captures the style of the person it’s responding on behalf of. It also identifies the most common topics discussed between specific pairs of people. For example, with friend 2, the focus is clearly more on work. However, the grammar is still off, and it loses the context of the conversation quickly. I’m pretty confident that LoRA would work with reasonable quality in English, and full fine-tuning might not be necessary. But, since Russian isn’t the model’s native language, let’s try full fine-tuning.

Full fine-tuning

Full fine-tuning is more challenging due to the need for multi-GPU training. Popular methods include either ZeRO & DeepSpeed [3] or FSDP [4], with FSDP essentially being a ZeRO3 [5]. I decided to go with FSDP.

While implementing the training pipeline, I referred to the Stanford Alpaca fine-tuning code and Anton Bacaj’s Mistral fine-tuning code.

Using a half-precision FSDP full shard with a 1024 sequence length and a micro batch size of 2 required 63GB of VRAM on each of the eight A100 80 GB GPUs. The training, lasting three epochs, took just 20 minutes. The total cost for the VM was $8.88 per hour, resulting in $3, not including the time for experiments and bug fixes.

Conversations:

---

Friend 1 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey) good
>>> you?
>>> good as well
>>> what you were doing today?
>>> played a board game
>>> I’ll go home now
>>> nice
>>> what kind of game?
>>> catan
>>> strategy
>>> very good
>>> who was there?
>>> Max Yavich and Fedya
>>> who won?
>>> Max

original

>>> прив
>>> как дела?
>>> Прив) норм
>>> а у тебя?
>>> да тоже норм
>>> чего делал сегодня?
>>> играли в настолку
>>> ща поеду домой
>>> найс
>>> что за настолка?
>>> колонизаторы
>>> стратежка
>>> оч хорошая
>>> кто был?
>>> Макс Явич и Федя
>>> кто выиграл?
>>> Макс

---

Friend 2 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hi, good
>>> let’s meet in half an hour?
>>> yep
>>> what did you want?
>>> damn I haven’t finished the script for recalculating models yet
>>> and I need to finish it to send it to them
>>> this recalculation takes a lot of time, so I just want to turn around and start doing it and not waste time on preparation
>>> got it
>>> what is the plan for today?
>>> meet in half an hour to discuss what we are doing
>>> if decided, then I send an application to them and wait for a response
>>> if we haven’t decided, then we continue to work
>>> who do you mean by them?
>>> investors
>>> https://www.ycombinator.com/howtoapply
>>> there are two tests)

original

>>> прив
>>> как дела?
>>> прив, збс
>>> го кол через пол часа?
>>> да го
>>> а чего ты хотел?
>>> блять я еще не доделал скрипт по пересчету моделей
>>> и мне надо доделать чтобы отправить им
>>> вот этот пересчет занимает много времени поэтому я хочу просто уже развернуться и начать делать а не тратить время на подготовку
>>> понял
>>> а какой вообще план на сегодня?
>>> встречаемся через пол часа, обсуждаем что делаем
>>> если решили то отправляю заявку в им и жду ответа
>>> если не решили то продолжаем работать
>>> им это кому?
>>> инвесторы
>>> https://www.ycombinator.com/howtoapply
>>> там два теста)

---

Friend 3 vs Alexander Smirnov

>>> hey
>>> what’s up?
>>> hey we’re sitting and solving problems
>>> what kind of problems?
>>> data analysis +-
>>> and for what?
>>> https://yandex.ru/cup/profile/
>>> it seems it just became more interesting
>>> cool
>>> what are you planning this evening?
>>> idk yet
>>> and on the weekend?
>>> apparently I’ll also finish solving problems and get some work done

original

>>> прив
>>> как дела?
>>> здарова нормуль сидим решаем задачки
>>> чего за задачки?
>>> анализ данных +-
>>> а для чего?
>>> https://yandex.ru/cup/profile/
>>> вроде просто поинтересней стало
>>> прикольно
>>> что вечером планируешь?
>>> хз пока
>>> а на выхах?
>>> видимо тоже буду дорешивать задачи и с работой позаниматься

---

Alexander Smirnov vs Friend 1

>>> hey
>>> what’s up?
>>> hey)
>>> good
>>> you?
>>> that’s okay too
>>> what did you do today?
>>> well, I worked, of course
>>> and also
>>> met with Alena
>>> ate at the ramen shop with everyone
>>> and went home
>>> by the way, we booked a place for 14 people
>>> for the evening))
>>> cool
>>> and what’s about university?
>>> yeah
>>> studied a lot
>>> in general, every day I want to study more
>>> what is the hardest thing now?
>>> well I don’t know
>>> it just takes a very long time
>>> there are some optimization methods that need to be rewritten
>>> so I think maybe it’s not worth optimizing

original

>>> прив
>>> как дела?
>>> прив)
>>> норм
>>> а у тебя?
>>> да тоже норм
>>> чего делала сегодня?
>>> ну я работала, конечно
>>> а еще
>>> с Аленой встретилась
>>> поели в раменной со всеми
>>> и домой
>>> кстати, мы забронировали рест на 14 человек
>>> для вечера))
>>> прикольно
>>> а по унику чего?
>>> ага
>>> заботала очень много
>>> вообще, каждый день хочу больше ботать
>>> а что сейчас самое тяжелое?
>>> ну не знаю
>>> просто очень долго получается
>>> там некоторые методы оптимизации надо переписывать
>>> так что я думаю, может, это не стоит оптимизировать

---

Alexander Smirnov vs Friend 2

>>> hey
>>> what’s up?
>>> Hey, good, what about you?
>>> Are you already in Bangkok?
>>> yeah, already here
>>> what are your plans for today?
>>> Well, I would like to work on these pioneer tasks, if you don’t mind, we can do it together
>>> Or you can skip, I’ll finish the livechat tomorrow
>>> and what about live chat?
>>> Damn right now I launched the front on the prod without this shit that generates this thing with a logo, and it swears, says to connect js from google analytics
>>> I understand that this is a fucking cringe, but I still tried to remember where to get it
>>> yeah got it
>>> what will you do on weekends?
>>> We are meeting with friends here before leaving, tomorrow we need to go to the hotel by 6 am to spend the day in the best traditions
>>> There’s another party in the evening
>>> Actually, let’s have a call, I’ll tell you

original

>>> прив
>>> как дела?
>>> Прив, норм, а у тебя?
>>> А ты в Бангкоке уже?
>>> ага, уже тут
>>> какие планы на сегодня?
>>> Ну я хотел бы поработать над этими пионерскими заданиями, если ты не против, то можно вместе поделать
>>> Или можешь пока сольёшься, я завтра буду допиливать лайвчат
>>> а чего там по лайвчату?
>>> Да бля щас запустил на проде фронт без этого говна, которое генерирует эту штуку с лого, и она ругается, говорит, чтобы подключить js от google analytics
>>> Я понимаю, что это пиздец кринж, но я пошел все равно вспоминать, где его достать
>>> ага понял
>>> что на выхах будешь делать?
>>> Встречаемся с друзьями тут перед отъездом, завтра надо в отель к 6 утра поехать чтобы день провести в лучших традициях
>>> Там еще одна тусовка вечером
>>> На самом деле давай колл, расскажу

---

Alexander Smirnov vs Friend 3

>>> hey
>>> what’s up?
>>> Hi, haven’t you called yet?)
>>> Hey
>>> ok)
>>> I’m going to play Dota now
>>> Join if you want to)
>>> mb a little later)
>>> what are your plans for weekends?
>>> Lot of gyms/flexes/parks
>>> + maybe come to your place
>>> hmm well mb mb)
>>> what’s with your job btw?
>>> Well, I’m waiting for the info
>>> did they say what deadlines are?
>>> He said that there will be a call next week to discuss it
>>> remind me what this position is
>>> Data Scientist
>>> The team is engaged in recommending products and other shit
>>> Creating model
>>> As part of this position, you will have to develop skills in machine learning, statistical data analysis, programming and much more
>>> It is how I understood it

original

>>> прив
>>> как дела?
>>> Прив, звонил еще нет?)
>>> Даров
>>> Нормально)
>>> Сча в доту иду
>>> Если хош присоединяйся)
>>> мб чуть позже)
>>> какие планы на выхи?
>>> Дохуя залов/флексов/парков
>>> + мона залипнуть к тебе
>>> хм ну мб мб)
>>> а чего у тебя с работой в итоге?
>>> НУ я жду инфу
>>> а какие сроки не сказали?
>>> Сказал что на следующей неделе созвон где обсуждение будет
>>> а напомни что за позиция
>>> Data Scientist
>>> Команда занимается рекомендацией товаров и прочей хуеты
>>> Создания моделькеи
>>> В рамках этой позиции ты должен будет развивать навыки машинного обучения, статистического анализа данных, программирования и многое другое
>>> Ну я так понял

---

Conversations have become more interesting and engaging, although there’s still a risk of losing context. Russian language performance has improved, but errors still occur. I believe that before fine-tuning for a specific task with limited data, like mine, it would be beneficial to first fine-tune the model unsupervised on a large corpus of Russian texts. Additionally, incorporating common conversation partners’ names as separate tokens might enhance the quality.

I wouldn’t say it has turned out to be significantly better than LoRA. It might be more effective to focus solely on a single person and calculate the loss based only on my responses (or someone else’s), instead of trying to learn about each and every conversational partner.

Closing thoughts

Certainly, I had to cherry-pick the results, not because most of the model’s replies were inadequate, but because many were simple responses like “I’ll call you later,” “busy,” and “ok,” which are naturally frequent in conversations. Despite this, it’s clear that the model excels in mimicking the style of the person it’s impersonating. It also captures the commonly discussed topics between two people. However, it significantly lacks context in conversations. Responding to queries like “yo, so?” or “what are your plans for the weekend” is challenging without having full context. Perhaps utilizing a system like Rewind, which captures everything the user does across the computer, could be beneficial.

Code

You can find code for this project as well as instructions on how to replicate it yourself on your own Telegram dump in my github repo. Training logs may be accessed on WandB.

home