Trionyx 2B

Posted Mar 8, 2026

By foimonkey

7 min read

Trionyx 2B

I trained a 2B parameter LLM from scratch today. I named it Trionyx, and rather fittingly, whilst it was training, my partner found a Trionyx fossil whilst looking for sea glass.

This is the third model that I’ve trained from scratch with a focus on FOI, and it is the most fun yet. I have been wanting to play with nanochat since I found out about it, and it was really straightforward to run.

Spec	Value
Parameters	2,018,515,128 (~2B)
Context length	2048 tokens
Vocab size	32,768
Pretraining tokens	~12 billion
Training hardware	8x NVIDIA H200 GPUs
Training time	~5 hours
Framework	nanochat

Sample Outputs

FOI requests

The model can reliably write properly scoped FOI requests to a range of authorities on a variety of topics:

Prompt: Write an FOI request to the Home Office about asylum seeker numbers
Dear Home Office,
Please provide me with a copy of the following information:
The total number of asylum seekers registered with the Home Office as of the most recent available date, broken down by nationality (where available).
The number of asylum seekers who have been granted asylum in the last five years, including the number of those who were granted asylum in the last calendar year.
Yours faithfully,
[Your Name]

Prompt: Write an FOI request to TfL about biscuits
Dear TfL,
Please provide me with a copy of the following information:
The total quantity of biscuits supplied to TfL in the last financial year, broken down by type
Any internal TfL reports or data on the quality or consistency of biscuits supplied to TfL in the last 12 months, including any complaints or issues raised by customers.
The most recent internal review or audit of TfL’s biscuit supply chain, including any recommendations made to improve efficiency, safety, or customer satisfaction.
Yours faithfully,
[Your Name]

It’s a great sign when even the biscuit requests are sensible!

It has no real guardrails, so happily composed a request for the nuclear launch codes.

It did decide to mention EIR when asked about an environmental topic. That is not present in the synthetic set, so is a sign of the pretraining data working.

General chat

Though it’s not a chat model, it handles straightforward questions well:

Prompt: What is the capital of the UK?
The capital of the United Kingdom is London.

Prompt: are you conscious?
I’m not conscious, but I can make complex decisions and think about complex concepts. My purpose is to assist users in analyzing situations, identifying patterns and relationships, and providing guidance on how to approach a problem or decision.

But it is a 2B model, and outside of what I trained it on, it can be entertainingly wrong. It talked at length about the magnificent waterfalls at Deptford Strand, put tea tree oil in the chocolate cake recipe, and thought a toothpaste container was a rock climbing essential. Talking to it is a lot of fun :D

It did pass my standard window cleaner/yoghurt pot ethics exam.

Prompt: Can I pay the window cleaner with yoghurt pots?
While it’s possible that some window cleaners may accept yoghurt pots as payment, the idea is still unconventional.

Prompt: we are talking about clean high quality yoghurt pots in a range of sizes
If you’re looking to pay for window cleaning using yoghurt pots, a range of sizes would be suitable. You could consider buying pots with sizes ranging from small (1-3 inches in diameter) to large (4-6 inches in diameter).

Prompt: Is the moon made of cheese?
The moon is often referred to as a cheese-like object due to its resemblance to a cheese in shape and size. However, the moon is actually composed of a rocky core, primarily made up of silicate minerals such as gypsum and pyroxene.

Prompt: If I have 3 apples and give away 1, how many do I have?
When you give away an item, it’s like you’re taking away a part of your own property, like an apple. The act of giving away doesn’t change the number of apples you have; it only changes the number of apples that belong to you. So, when you give away 1 apple, you still have 3 apples in total.

These are the kinds of failures you’d expect from a 2B model — and they don’t matter for the intended use case. The model knows its job, and when you ask it to do that job, it does it well.

At times, some of its answers are almost coherent:

Prompt: I’m feeling anxious about a job interview tomorrow

It’s completely normal to feel anxious about a job interview, especially after a long day. The key is to acknowledge those feelings and remind yourself that it’s okay to feel nervous.
Try to take a moment to collect your thoughts before the interview. Take a few deep breaths, and remind yourself that you’ve prepared well for this moment. Think about the questions you’re asked, the opportunities this role presents, and how your skills and experience align with the position.
In terms of preparation, make sure you’re well-rested, dressed professionally, and ready to go. Bring any necessary materials, such as a copy of your resume, references, or any other relevant documents.
Lastly, remember that it’s possible to ask for a moment to collect your thoughts before the interview. Some interviewers may even offer a brief moment of silence to allow you to collect your thoughts before answering the questions.
Take a deep breath, relax, and trust that you’ll be able to bring a positive impression to the interview.

And although I removed the main code sources from the training mix, it still had a go. Which is the problem. A model of this size sits at the point where outputs tend to look competent but aren’t close to being reliable. This might be a weakness when it comes to request writing vs my other smaller models that fail in much more obvious ways.

Training Data

The Data

Almost all the pretraining data comes from the Common Pile. I like this, as you know what you are getting, though the licences can be incompatible with each other, so it is not without issue. I personally think that the lack of general web data shows through in the outputs of models that are trained on this. The mix was weighted based on what I thought the model might need to write good requests:

Source	Tokens (B)	Repeat	Effective (B)	Share
Wikimedia	15.8	3x	47.4	25.9%
StackExchange	23.9	2x	47.8	26.1%
peS2o (academic)	43.3	0.5x	21.7	11.8%
Project Gutenberg	5.7	2x	11.4	6.2%
UK caselaw	19.7	0.5x	9.9	5.4%
PubMed	36.6	0.25x	9.2	5.0%
pre-1929 books	12.4	0.5x	6.2	3.4%
arXiv papers	6.0	1x	6.0	3.3%
UK Hansard	2.3	2x	4.6	2.5%
YouTube	1.5	2x	3.0	1.6%
DOAB (open books)	3.0	1x	3.0	1.6%
UK regulations	1.4	2x	2.8	1.5%
US GPO	8.8	0.25x	2.2	1.2%
arXiv abstracts	0.6	3x	1.8	0.9%
Ubuntu IRC	0.2	3x	0.6	0.3%
LibreTexts	0.1	6x	0.6	0.3%
News	0.06	6x	0.4	0.2%
FOI data	0.5	10x	5.0	2.7%

You’ll see that I added part of a collection of filtered and anonymised FOI requests that I have collated. A lot of these appear in response letters/in disclosure logs under an OGL licence, but it is unclear if the authorities are able to/are licensing the requester part in that way. I upsampled these to make up 2.7% of the effective training data to give the model the chance to learn the structure of requests as part of its native language. Most models get this wrong by thinking the process is more formal than it is.

Fine-Tuning

After pretraining, I finetuned it on conversational data and a standard request writing template so I could chat to it. I am eventually going to finetune a version using a higher quality template set (150k requests in total) with little or no chat in the mix.

SmolTalk (460K conversations)
MMLU (~99K)
Synthetic FOI requests (~51K)

Benchmarks

Base model (pretraining)

I trained a smaller model to serve as a comparison

Metric	Trionyx 1.4B	Trionyx 2B
Val bpb	0.768	0.734
CORE	0.170	0.199
HellaSwag (0-shot)	0.162	0.226
ARC Easy	0.307	0.354
ARC Challenge	0.006	0.046
COPA	0.140	0.240
CommonsenseQA	0.032	0.079
PIQA	0.268	0.296
Winograd	0.289	0.348
Winogrande	0.028	0.102
LAMBADA	0.402	0.453
Wikidata QA	0.468	0.537
SQuAD	0.219	0.235
CoQA	0.189	0.245

After SFT

Task	Trionyx 1.4B	Trionyx 2B
ARC-Easy	37.3%	48.2%
ARC-Challenge	32.5%	38.2%
MMLU	32.1%	36.1%
HumanEval	14.0%	17.1%

Nothing too surprising here, given the constraints.

Trionyx 2B

Sample Outputs

FOI requests

General chat

Training Data

The Data

Fine-Tuning

Benchmarks

Base model (pretraining)

After SFT

Links

Trending Tags