Highlight: Check out HN discussion of this post
For most people I interact with, I’m just another text-based program for the most of the time. If input and output are so simple, could I be replaced by the model? For this to work, the model would need to not only understand my writing style but also know a lot about me. The best source for this is my Telegram messenger, as I use it daily and it contains almost everything about my thoughts and actions in the form of chat histories.
The most straightforward approach would be to extract all my messages, load them into ChatGPT’s context, and instruct it to use this information to mimic my style when responding to new messages. However, this approach is limited by the context window size, requiring me to preprocess messages to extract key points. As I want to avoid this hassle, perhaps Retrieval Augmented Generation (RAG) could be used to pull necessary information when needed. However from my experience, retrieving from diverse data like chat sessions usually needs a supervised fine-tuning of the retrieval model, and I’m not keen on creating such a dataset. So, fine-tuning seems like the best option. It’s ideal for several reasons: it should capture my writing style and potentially accumulate knowledge from all my messages without having to select what’s important.
OpenAI offers fine-tuning capabilities, but as I’ll be using my private messages, I don’t want to use any third-party fine-tuning services. So, I need to choose a base model. According to the Hugging Face Open LLM Leaderboard, one of the top smaller models (≤13B parameters) is Mistral 7B. It even outperforms Llama 2 13B. Now, the question is whether LoRA is sufficient or if full fine-tuning is necessary. Various comparisons [1] [2] suggests that LoRA is a bit worse than full fine-tuning but still fine most of the time. However, for specific tasks like mine (Russian language + chat), I found a paper, where researchers conducted Llama instruction fine-tuning in Chinese, similar in complexity to my goal. They found that LoRA-based tuning on a base model without prior instruction tuning is less effective than full fine-tuning. Yet, LoRA-based tuning on a model already fine-tuned for instructions can yield comparable results. In my case, this means either full fine-tuning on a base model or LoRA on a model already fine-tuned for chatting in Russian. Since I couldn’t find a model fine-tuned for Russian chat, I’ll try LoRA on a model fine-tuned for English chat, like the fine-tuned Mistral model Dolphin.
So, the plan is:
One unique aspect of messaging in apps like Telegram, compared to emails, is the conversational flow. Messages don’t usually alternate one-by-one between you and your contact. Instead, you often find yourself sending a couple of messages in a row, followed by several responses from the other person. These messages are generally short, too. I wanted to preserve this natural conversational style in my data.
Telegram offers a built-in feature to export all chats into JSON. After some filtering and grouping messages into sessions, I’ve compiled data from the last five years of using Telegram. This resulted in 15,789 sessions from 466 chats, with an average session length of 8.51 messages. For structuring the data, I’ve chosen the ChatML prompt format. Here’s a sample session (translated from Russian):
<|im_start|>John Smith
>>> damn,
can’t get around the 135 time limit
>>> trying to do everything super optimally, but no
luck<|im_end|>
<|im_start|>Alexander
Smirnov
>>> yeah same
>>> you still going with the same
idea?<|im_end|>
<|im_start|>John
Smith
>>> dunno, I think we’re on the same
page
>>> as you said
>>> going with the reversed string in a try and
trying to find something there
>>>
seems like real shit because z function ruins
everything……………………<|im_end|>
<|im_start|>Alexander Smirnov
>>> don’t
get where z comes into this<|im_end|>
<|im_start|>John Smith
>>> dunno seems
like I’m doing everything iteratively anyway, but yeah gotta reverse
some strings to build the z function
>>> and it’s just a random solution
>>> from discussions<|im_end|>
<|im_start|>Alexander Smirnov
>>> got
it<|im_end|>
My data collator ensures that the loss is only calculated based on someone’s response. Predicting who will speak next is relatively straightforward, and we don’t want the model to focus on learning that. Therefore, parts of the conversation where the loss is calculated are highlighted in bold.
You might notice that not only my responses but also those of others are used for loss calculation. This is deliberate. By doing this, the model will be able to role-play not only as me but also as my frequent conversational partners!
I will test models by having chats in two ways. First, the model will pretend to be me and I will be chatting with myself from the perspective of my different friends. Then, I’ll chat as myself while the model acts as my friends. My conversation starter will always be the same 2 messages: “hey” and “what’s up?” (in Russian, “прив” and “как дела?”). Generated phrases and persons as the model acts who from will be highlighted. All conversations initially will be held in Russian and may be accessed by clicking on the ‘original’ details button. For testing I will be using oobabooga/text-generation-webui.
In the beginning, I want to explore how the generic conversation fine-tuned Mistral model deals with that task without any prior training from my side.
---
>>> hey
>>> what’s up?
>>> I’m good, what about you?
>>> nothing much
>>> what did you do
today?
>>> I worked.
>>> then I did a few tasks in Python and studied
data processing libraries.
>>> in
the evening I’ll do wrestling training.
>>> have you ever been into sports or do you have
any hobbies?
>>> I see
>>>
want to watch a movie today?
>>> maybe, if I
find the time.
>>> what’s your
favorite music? Electronic, rock, or other genres?
>>> прив
>>> как дела?
>>> все хорошо, а у тебя что нового?
>>> да ничего
>>> чего сегодня делал?
>>> я работал.
>>>
потом сделал несколько задач на python и изучал библиотеки для
обработки данных.
>>> в вечер
займусь тренировкой на борцу.
>>> ты
когда-нибудь занимаешься спортом или имеешь какое-то
хобби?
>>> понятно
>>>
посмотрим кино сегодня?
>>> может быть, если
найдется время.
>>> какая ваша
любимая музыка? Электронная музыка, рок или другие
жанры?
---
>>> hey
>>> what’s up?
>>> well, everything is fine so far. what’s new
with you?
>>> spent the whole day
programming
>>> what did you do today?
>>> finished at work
>>> and did some housework, now I’m resting. are
you also resting or still working?
>>> I’m
already resting
>>> well, that’s good. can we
talk about something interesting or do you have any
questions?
>>> прив
>>> как дела?
>>> ну пока все нормально. а у вас что
нового?
>>> прогал целый день
>>> чего сегодня делала?
>>>
закончила на работе
>>> и сделала
немного домашней работы, теперь отдыхаю. а вы тоже отдохните или еще
работаете?
>>> я уже отдыхаю
>>> ну хорошо. можно поговорить о чем-нибудь
интересном или вы имеете какие-то вопросы?
---
Ok, it is capable of forming coherent sentences. The most noticeable problem is its lack of awareness regarding the context of the conversations which leads to bland and generic replies. The messages lacked any distinct style, feeling quite basic. Another issue is that the model’s Russian is poor. This is expected, as the model is too small to generalize well to languages other than its primary one, English. Additionally, the model tended to be overly proactive, ending almost every sentence with a question, which isn’t how real people typically communicate in messengers.
Let’s try to fix all of these!
LoRA offers a low-effort approach in terms of both the training pipeline and hardware requirements. It trains around 1% of the total weights. I chose a 1024 sequence length and a batch size of 8. The training, which consumed 20GB of VRAM on an RTX 3090, took three epochs and lasted for 5.5 hours. For this, I used vast.ai, where the GPU cost was $0.362 per hour, totaling $2 for the entire training, excluding time spent on experiments and bug fixes.
Here are the results:
---
>>> hey
>>> what’s up?
>>> hi) all good
>>> you?)
>>> ok as
well
>>> what you were doing today?
>>> had an amazon interview and solved
problems
>>> had to solve
3
>>> cool
>>> hard?
>>> don’t know what
>>> hadn’t looked yet
>>> I hadn’t started yet
>>> chatting whole day with the team and meet new
ones
>>> how are they?
>>> ok
>>> same
coworking as you have))
>>> прив
>>> как дела?
>>> Привет) нолм все
>>> ты как?)
>>> да тоже
норм
>>> чего делал сегодня?
>>> собесился в амазон и решал
задачки
>>> надо было 3
прорешить
>>> прикольно
>>>
сложные?
>>> хз какие
>>> пока не смотрел
>>> я еще ничего не началь
>>> весь день просто болтаю с людьми из команды и
встречаю новых
>>> как ребята?
>>> норм
>>> такой
же ковер как у тебя))
---
>>> hey
>>> what’s up?
>>> ok, we show yesterday’s product to searchcom,
today we write endpoints for them so that someone else can request
our api
>>> cool
>>> what
problems do we have now?
>>> no
problems
>>> everything is
working
>>> I added an option to
choose index
>>> so now we can
attach several indexes
>>> and
separate collections within single index
>>>
what we have to do?
>>> test
>>> yes ok
>>> are
they always will be sending post-script?
>>> прив
>>> как дела?
>>> хорошо, показываем вчерашний продукт ищком,
сегодня пилим эндпоинты для них чтобы кто-то еще мог нашу апи
запросить
>>> найс
>>> какие
сейчас есть проблемы?
>>> нет
проблем
>>> все
работает
>>> я добавил опцию выбора
индекса
>>> так что теперь уже можно
несколько индексов подключить
>>> и
отдельные коллекции внутри одного индекса
>>> что надо сделать?
>>>
протестировать
>>> ага
норм
>>> а точно они всегда
пост-скрипт будут кидать?
---
>>> hey
>>> what’s up?
>>> Hi, good)
>>>
you?
>>> ok as well
>>> what
do you do at the evening?
>>> we have meeting
with office collegues at 8 30 msk
>>> after it I can
>>> can
what?
>>> haha)
>>>
didn’t get it
>>> I don’t know(
>>> прив
>>> как дела?
>>> Привет, хорошо)
>>> ты?
>>> да тоже
норм
>>> что делаешь вечером?
>>> у нас с колегами встреча по офису в 8 30
мск
>>> после нее я
могу
>>> можешь что?
>>> ахах)
>>> не
понял
>>> я не знаю(
---
>>> hey
>>> what’s up?
>>> hi)
>>> seems
good
>>> was working, than went to
the call with Nadya
>>> now reading
normally
>>> you?
>>> good as well
>>> any plans for the
evening?
>>> after work wendesday I would
love to go running
>>> and then have
a lunch
>>> your
plans?
>>> I though about watching
movie
>>> aa
>>> ooo
>>> to be
honest I like when it is raining
>>>
when you don’t feel like going anywhere
>>> прив
>>> как дела?
>>> привет)
>>>
вроде хорошо
>>> поработал, затем
поехал на коле с Надей
>>> сейчас
читаю нормально
>>> а ты
как?
>>> да тоже хорошо
>>>
какие планы на вечер?
>>> мне после работы
среда хотелось бы сходить на пробежку
>>> а потом ужинать
>>> а что твои планы?
>>> я
думал кино посмотреть
>>> аа
>>> ооо
>>> честно
говоря мне нравится теплый дождь больше всего
>>> когда так не хочется никуда идти
---
>>> hey
>>> what’s up?
>>> Hi, great!
>>>
Sorry will be 20-30 min late
>>> ok
>>> why?
>>> yes, I just got up and
forgot to press the «exit sleep» button
>>> my telegram didn’t close on its
own
>>> I thought it behaved like
this when you logging off the Internet, but no
>>> I still logged off the Internet, but telegram
still didn’t respond
>>> and it was
strange, because yesterday there was also such a case when I sent
you a message and did not receive an answer
>>> today I looked at mine, but not yours, but
yours was displayed)
>>> ok
>>> any plans?
>>> yes, we can chat
a little
>>> and I can tell you that
the method for generating answers worked really well for
me
>>> he gives respectable results
in training
>>> but not on the
test
>>> прив
>>> как дела?
>>> Привет, хорошо!
>>> Соре опаздываю еще минут 20-30
>>> пон
>>> а чего так?
>>> да я просто встал и забыл нажать кнопку «выйти
из сна»
>>> у меня телега сама не
закрылась
>>> я думал она так себя
ведёт, когда ты выходишь из интернета, но вот
>>> я все таки вышел из интернета, но телега все
равно не отвечает
>>> и это странно
было, потому что вчера тоже был такой случай, когда я отправил тебе
сообщение и не получал ответа
>>>
сегодня же я посмотрел свое, но не твоё, а твое
отобразилось)
>>> окей
>>>
какие планы?
>>> да можно чуть
поболтать
>>> и можно рассказать что
у меня прям очень сильно заработал метод по генерации
ответов
>>> он на тренировках дает
респектабельные результаты
>>> но не
на тесте
---
>>> hey
>>> what’s up?
>>> hey, hi
>>>
Good, you?)
>>> Haven’t gone to bed
yet)0)0)0)
>>> Today I get up and go
to a programming course
>>> To
tighten up the base, etc.
>>> nice
>>> what did you do yesterday?
>>>
Yesterday there was already the same course in which it seemed to me
that I did very poorly
>>> Then I
wanted to solve hw but I don’t care
>>> Otherwise I’m fucked because I can’t
sleep)
>>> Really fucked
up
>>> It will probably be better
now, since I’m already maximally overtrained
>>> cool man
>>> what are your plans for
life in general?
>>> Fuck, get a job
interviews first
>>> And if
everything is ok, then actually order a computer
>>> Like, I think you should order a computer when
it’s more profitable than working at work
>>> But I don’t care
>>> прив
>>> как дела?
>>> Прив, дароу
>>>
Норм, ты как?)
>>> Еще спать не
зашел)0)0)0)
>>> Сегодня встаю и еду
на курс по програмированию
>>> Чтоб
базу подтянуть и тд
>>> найс
>>> че вчера делал?
>>> Вчера уже
был такой же курс в котором мне казалось что я оч плохо
провел
>>> Потом хотел порешать дз
но пофиг
>>> А то заебался из-за
того что спать не могу)
>>> Прям
пиздец
>>> Наверное сейчас будет
лучше, так как уже максимально переобуженный
>>> слушай ну с кайфом
>>> какие планы
вообще на жизнь?
>>> Бля, тупо сначала
собеситься на работу
>>> И если все
ок то реально заказать комп
>>> Типо
я думаю надо было бы заказывать комп когда выигральнее чем на работе
работаешь
>>> Но я хз
---
This is much better. It definitely captures the style of the person it’s responding on behalf of. It also identifies the most common topics discussed between specific pairs of people. For example, with friend 2, the focus is clearly more on work. However, the grammar is still off, and it loses the context of the conversation quickly. I’m pretty confident that LoRA would work with reasonable quality in English, and full fine-tuning might not be necessary. But, since Russian isn’t the model’s native language, let’s try full fine-tuning.
Full fine-tuning is more challenging due to the need for multi-GPU training. Popular methods include either ZeRO & DeepSpeed [3] or FSDP [4], with FSDP essentially being a ZeRO3 [5]. I decided to go with FSDP.
While implementing the training pipeline, I referred to the Stanford Alpaca fine-tuning code and Anton Bacaj’s Mistral fine-tuning code.
Using a half-precision FSDP full shard with a 1024 sequence length and a micro batch size of 2 required 63GB of VRAM on each of the eight A100 80 GB GPUs. The training, lasting three epochs, took just 20 minutes. The total cost for the VM was $8.88 per hour, resulting in $3, not including the time for experiments and bug fixes.
Conversations:
---
>>> hey
>>> what’s up?
>>> hey) good
>>>
you?
>>> good as well
>>>
what you were doing today?
>>> played a board
game
>>> I’ll go home
now
>>> nice
>>> what kind of
game?
>>> catan
>>> strategy
>>>
very good
>>> who was there?
>>> Max Yavich and Fedya
>>> who won?
>>> Max
>>> прив
>>> как дела?
>>> Прив) норм
>>>
а у тебя?
>>> да тоже норм
>>> чего делал сегодня?
>>> играли в
настолку
>>> ща поеду
домой
>>> найс
>>> что за
настолка?
>>> колонизаторы
>>> стратежка
>>>
оч хорошая
>>> кто был?
>>> Макс Явич и Федя
>>>
кто выиграл?
>>> Макс
---
>>> hey
>>> what’s up?
>>> hi, good
>>>
let’s meet in half an hour?
>>> yep
>>> what did you want?
>>> damn I
haven’t finished the script for recalculating models
yet
>>> and I need to finish it to
send it to them
>>> this
recalculation takes a lot of time, so I just want to turn around and
start doing it and not waste time on preparation
>>> got it
>>> what is the plan for
today?
>>> meet in half an hour to discuss
what we are doing
>>> if decided,
then I send an application to them and wait for a
response
>>> if we haven’t decided,
then we continue to work
>>> who do you mean
by them?
>>> investors
>>>
https://www.ycombinator.com/howtoapply
>>> there are two tests)
>>> прив
>>> как дела?
>>> прив, збс
>>>
го кол через пол часа?
>>> да го
>>> а чего ты хотел?
>>> блять я еще
не доделал скрипт по пересчету моделей
>>> и мне надо доделать чтобы отправить
им
>>> вот этот пересчет занимает
много времени поэтому я хочу просто уже развернуться и начать делать
а не тратить время на подготовку
>>>
понял
>>> а какой вообще план на сегодня?
>>> встречаемся через пол часа, обсуждаем что
делаем
>>> если решили то отправляю
заявку в им и жду ответа
>>> если не
решили то продолжаем работать
>>> им это
кому?
>>> инвесторы
>>>
https://www.ycombinator.com/howtoapply
>>> там два теста)
---
>>> hey
>>> what’s up?
>>> hey we’re sitting and solving
problems
>>> what kind of problems?
>>> data analysis +-
>>>
and for what?
>>>
https://yandex.ru/cup/profile/
>>>
it seems it just became more interesting
>>>
cool
>>> what are you planning this evening?
>>> idk yet
>>> and on the
weekend?
>>> apparently I’ll also finish
solving problems and get some work done
>>> прив
>>> как дела?
>>> здарова нормуль сидим решаем
задачки
>>> чего за задачки?
>>> анализ данных +-
>>> а
для чего?
>>>
https://yandex.ru/cup/profile/
>>>
вроде просто поинтересней стало
>>>
прикольно
>>> что вечером планируешь?
>>> хз пока
>>> а на
выхах?
>>> видимо тоже буду дорешивать задачи
и с работой позаниматься
---
>>> hey
>>> what’s up?
>>> hey)
>>>
good
>>> you?
>>> that’s okay too
>>> what did you do
today?
>>> well, I worked, of
course
>>> and also
>>> met with Alena
>>> ate at the ramen shop with
everyone
>>> and went
home
>>> by the way, we booked a
place for 14 people
>>> for the
evening))
>>> cool
>>> and
what’s about university?
>>>
yeah
>>> studied a
lot
>>> in general, every day I want
to study more
>>> what is the hardest thing
now?
>>> well I don’t know
>>> it just takes a very long time
>>> there are some optimization methods that need
to be rewritten
>>> so I think maybe
it’s not worth optimizing
>>> прив
>>> как дела?
>>> прив)
>>>
норм
>>> а у тебя?
>>> да тоже норм
>>> чего делала
сегодня?
>>> ну я работала,
конечно
>>> а еще
>>> с Аленой встретилась
>>> поели в раменной со всеми
>>> и домой
>>>
кстати, мы забронировали рест на 14 человек
>>> для вечера))
>>>
прикольно
>>> а по унику чего?
>>> ага
>>>
заботала очень много
>>> вообще,
каждый день хочу больше ботать
>>> а что
сейчас самое тяжелое?
>>> ну не
знаю
>>> просто очень долго
получается
>>> там некоторые методы
оптимизации надо переписывать
>>>
так что я думаю, может, это не стоит оптимизировать
---
>>> hey
>>> what’s up?
>>> Hey, good, what about you?
>>> Are you already in Bangkok?
>>> yeah, already here
>>> what are your
plans for today?
>>> Well, I would like to
work on these pioneer tasks, if you don’t mind, we can do it
together
>>> Or you can skip, I’ll
finish the livechat tomorrow
>>> and what
about live chat?
>>> Damn right now I
launched the front on the prod without this shit that generates this
thing with a logo, and it swears, says to connect js from google
analytics
>>> I understand that this
is a fucking cringe, but I still tried to remember where to get
it
>>> yeah got it
>>> what
will you do on weekends?
>>> We are meeting
with friends here before leaving, tomorrow we need to go to the
hotel by 6 am to spend the day in the best traditions
>>> There’s another party in the
evening
>>> Actually, let’s have a
call, I’ll tell you
>>> прив
>>> как дела?
>>> Прив, норм, а у тебя?
>>> А ты в Бангкоке уже?
>>> ага, уже тут
>>> какие планы на
сегодня?
>>> Ну я хотел бы поработать над
этими пионерскими заданиями, если ты не против, то можно вместе
поделать
>>> Или можешь пока
сольёшься, я завтра буду допиливать лайвчат
>>> а чего там по лайвчату?
>>> Да
бля щас запустил на проде фронт без этого говна, которое генерирует
эту штуку с лого, и она ругается, говорит, чтобы подключить js от
google analytics
>>> Я понимаю, что
это пиздец кринж, но я пошел все равно вспоминать, где его
достать
>>> ага понял
>>> что
на выхах будешь делать?
>>> Встречаемся с
друзьями тут перед отъездом, завтра надо в отель к 6 утра поехать
чтобы день провести в лучших традициях
>>> Там еще одна тусовка вечером
>>> На самом деле давай колл, расскажу
---
>>> hey
>>> what’s up?
>>> Hi, haven’t you called yet?)
>>> Hey
>>>
ok)
>>> I’m going to play Dota
now
>>> Join if you want
to)
>>> mb a little later)
>>> what are your plans for weekends?
>>> Lot of gyms/flexes/parks
>>> + maybe come to your place
>>> hmm well mb mb)
>>> what’s with your job
btw?
>>> Well, I’m waiting for the
info
>>> did they say what deadlines
are?
>>> He said that there will be a call
next week to discuss it
>>> remind me what
this position is
>>> Data
Scientist
>>> The team is engaged in
recommending products and other shit
>>> Creating model
>>> As part of this position, you will have to
develop skills in machine learning, statistical data analysis,
programming and much more
>>> It is
how I understood it
>>> прив
>>> как дела?
>>> Прив, звонил еще нет?)
>>> Даров
>>>
Нормально)
>>> Сча в доту
иду
>>> Если хош
присоединяйся)
>>> мб чуть позже)
>>> какие планы на выхи?
>>> Дохуя
залов/флексов/парков
>>> + мона
залипнуть к тебе
>>> хм ну мб мб)
>>> а чего у тебя с работой в итоге?
>>> НУ я жду инфу
>>> а
какие сроки не сказали?
>>> Сказал что на
следующей неделе созвон где обсуждение будет
>>> а напомни что за позиция
>>>
Data Scientist
>>> Команда
занимается рекомендацией товаров и прочей хуеты
>>> Создания моделькеи
>>> В рамках этой позиции ты должен будет развивать
навыки машинного обучения, статистического анализа данных,
программирования и многое другое
>>>
Ну я так понял
---
Conversations have become more interesting and engaging, although there’s still a risk of losing context. Russian language performance has improved, but errors still occur. I believe that before fine-tuning for a specific task with limited data, like mine, it would be beneficial to first fine-tune the model unsupervised on a large corpus of Russian texts. Additionally, incorporating common conversation partners’ names as separate tokens might enhance the quality.
I wouldn’t say it has turned out to be significantly better than LoRA. It might be more effective to focus solely on a single person and calculate the loss based only on my responses (or someone else’s), instead of trying to learn about each and every conversational partner.
Certainly, I had to cherry-pick the results, not because most of the model’s replies were inadequate, but because many were simple responses like “I’ll call you later,” “busy,” and “ok,” which are naturally frequent in conversations. Despite this, it’s clear that the model excels in mimicking the style of the person it’s impersonating. It also captures the commonly discussed topics between two people. However, it significantly lacks context in conversations. Responding to queries like “yo, so?” or “what are your plans for the weekend” is challenging without having full context. Perhaps utilizing a system like Rewind, which captures everything the user does across the computer, could be beneficial.
You can find code for this project as well as instructions on how to replicate it yourself on your own Telegram dump in my github repo. Training logs may be accessed on WandB.