# Parsing Roman Numerals using C#

**Date:** 2017-08-30  
**Author:** Kees C. Bakker  
**Categories:** .NET / C#  
**Original:** https://keestalkstech.com/parsing-roman-numerals-using-csharp/

![Parsing Roman Numerals using C#](https://keestalkstech.com/wp-content/uploads/2017/08/4998921795_71d7d60baf_o.jpg)

---

Lately I’ve become fascinated with the Latin language. I’m working on a project that converts photographs of Latin inscriptions on medieval statues into translated text. One of the challenges is parsing years, usually expressed in the form of Roman Numerals.

After building a parser class I noticed that it had a lot of nice characteristics: parsing, operator overloading, implicit conversions. A really nice way to play around with C#. Read here [Calculations with Roman Numerals using C#](https://keestalkstech.com/2017/08/calculations-with-roman-numerals-in-csharp/).

The original code was written in 2017. Currently it is 2024, so I've updated the code to .NET 8. To make things easier, we made sure that we cannot parse a RomanNumeral to `null`, to make `nullable` easier to implement.

[outline]

## Let's define some numerals

When you read about Roman Numerals you'll notice that there are different styles. Last week I visited the [Gros Horloge in Rouen, France](https://www.wikiwand.com/en/Gros_Horloge). Look at the numerals of the clock:

![Gros Horloge in Rouen, France](https://keestalkstech.com/wp-content/uploads/2017/08/Le_Gros_Horloge_Rouen_Normandy.jpg)
*Gros Horloge in Rouen, France.*

Notice how the numeral for 4 isn't IV, but IIII. It turns out that it is the (perfectly fine) long notation called the *additive form*. Long notations like VIIII are a bit harder to read, that's why the [subtractive notation](https://www.wikiwand.com/en/Subtractive_notation) IX is used most times. In some documents, both long and short notations are used. The V might be written as IIIII and LX as XXXXXX.

Did you know there's no *zero* in Roman Numerals? It wasn't invented yet ([for more on that here](https://www.livescience.com/27853-who-invented-zero.html)). The writers used *NULLA,* which is Latin for *nothing.*

Ow... and it turns out that 18 can be written as both XIIX and IIXX - you got to love those Romans.

### Constants

So let's define these values as constants:

```cs
// 0, nothing, nada. The zero wasn't invented yet ;-)
public const string NULLA = "NULLA";

//values - a readonly dictionary where the numerals are the keys to values
public static readonly IReadOnlyDictionary<string, int> VALUES = new Dictionary<string, int>
{
    {"I",       1 },
    {"IV",      4 },
    {"V",       5 },
    {"IX",      9 },
    {"X",       10 },
    {"XIIX",    18 },
    {"IIXX",    18 },
    {"XL",      40 },
    {"L",       50 },
    {"XC",      90 },
    {"C",       100 },
    {"CD",      400 },
    {"D",       500 },
    {"CM",      900 },
    {"M",       1000 },

    //alternatives from Middle Ages and Renaissance
    {"O",       11 },
    {"F",       40 },
    {"P",       400 },
    {"G",       400 },
    {"Q",       500 }
}
.AsReadOnly();
```

In the Middle Ages and the Renaissance, more numerals were added. They are not used anymore, but might be useful in parsing - that's why they are included.

## Storage

First, we need an object to store the data and perform operations on. Let's create something simple that is nullable (a `class`).

```cs
public class RomanNumeral
{
    private readonly int _number;

    public int Number => _number;

    public RomanNumeral(int number)
    {
        if (number < 0)
        {
            throw new ArgumentOutOfRangeException(
                nameof(number),
                "Number should be positive."
            );
        }

        _number = number;
    }
}
```

## Parsing a Roman Numeral string into a number

A Roman Numeral is an ordered string of characters. The values of the string are in descending order. So XI means 10 + 1. MDCIX means 1000 + 500 + 100 + 9. Notice that IX does not mean 1+10, but 10-1 (it's subtractive!). Let's build an array with all the numerals a descending order of their numerical value:

```cs
// all the options that are used for parsing, in their order of value
public static readonly string[] NUMERAL_OPTIONS =
{
    "M", "CM", "D", "Q", "CD", "P", "G", "C", "XC", "L",
    "F", "XL", "IIXX", "XIIX", "O", "X", "IX", "V", "IV", "I"
};
```

We can use this array to parse a string. We need to do:

1. Start with `resultNumber` with value `0`.
2. Find the first numeral from the `NUMERAL_OPTION` that matches the start of the string.
3. If the numeral cannot be found, the input string is invalid and `NULL` must be returned (and the routine must be terminated).
4. Otherwise add the value of the numeral (stored in the  `VALUES` array) to the `resultNumber`.
5. Remove the numeral from the input string.
6. Repeat until the input string is empty. Make sure that *only* the same or lower numerals are used in the repetitions. Higher numerals are invalid.

### The code

This will result in the following code:

```cs
public static RomanNumeral Parse(string str)
{
    var strToRead = str;
    var resultNumber = 0;

    // start with M and iterate through the options
    var numeralOptionPointer = 0;

    // continue to read until the string is empty 
    // or the numeral options pointer has exceeded all options
    while (
            !String.IsNullOrEmpty(strToRead) && 
            numeralOptionPointer < NUMERAL_OPTIONS.Length)
    {
        // select the current numeral
        var numeral = NUMERAL_OPTIONS[numeralOptionPointer];

        // read numeral -> check if the numeral is used,
        // otherwise move on to the next option
        if (!strToRead.StartsWith(numeral))
        {
            numeralOptionPointer++;
            continue;
        }

        // add the vaue of the found numeral
        var value = VALUES[numeral];
        resultNumber += value;

        // remove the letters of the numeral from the string
        strToRead = strToRead.Substring(numeral.Length);
    }

    // if the whole string is read, return the numeral
    if (String.IsNullOrEmpty(strToRead))
    {
        return new RomanNumeral(resultNumber);
    }

    // string is invalid
    throw new InvalidCastException("The string is not a valid Roman numeral.");
}
```

### Let's be nice

To harden the code, we could add some extra processing rules:

- An empty string is numeral `0`.
- The input string should be uppercased.
- The string `NULLA` is numeral `0`.
- The character `U` can be converted to `V`.
- Some medical recipes will end in a `J` instead of an `I`. Replace this end-character.
- Check if the whole string is in the `VALUES` array and if so, return that value (small optimization).

Just add the following to the method:

```cs
if (String.IsNullOrEmpty(str))
{
    return new RomanNumeral(0);
}

// upper case the string
var strToRead = str.ToUpper();

// nulla? means nothing 0 wasn't invented yet ;-)
if (strToRead == NULLA)
{
    return new RomanNumeral(0);
}

// if ends in J -> replace it to I (used in medicine)
if (strToRead.EndsWith("J"))
{
    strToRead = strToRead.Substring(0, strToRead.Length - 1) + "I";
}

// if a U is present, assume a V
strToRead = strToRead.Replace("U", "V");

// check simple numbers directly in dictionary
if (VALUES.ContainsKey(str))
{
    return new RomanNumeral(VALUES[str]);
}
```

## How about writing Roman Numerals?

First, we need to distinguish between the two notations - the additive and the subtractive:

```cs
// subtractive notation uses these numerals
public static readonly string[] SUBTRACTIVE_NOTATION =
{
    "M", "CM", "D", "CD", "C", "XC", "L", 
    "XL", "X", "IX", "V", "IV", "I"
};

// the additive notation uses these numerals
public static readonly string[] ADDITIVE_NOTATION =
{
    "M", "D", "C", "L", "X", "V", "I"
};
```

Notice that the values of the numerals in the arrays are both in descending order. With these ordered arrays, we can device the following algorithm:

1. Use the array of the notation.
2. Traverse each item of the array.
3. If a numeral item *value *is smaller than the number, *subtract* the number and add the numeral to the result string. Try again.
4. If a numeral does not fit, move on to the next.
5. Repeat until the number is `0`.
6. Return the result as the full Roman Numeral.

### ToString() implementation

Converted into code the algorithm will look like this:

```cs
public override string ToString()
{
    return ToString(Notation);
}

public string ToString(RomanNumeralNotation notation)
{
    if (Number == 0)
    {
        return NULLA;
    }

    //check notation for right set of characters
    string[] numerals =
        notation == RomanNumeralNotation.Additive
        ? ADDITIVE_NOTATION
        : SUBTRACTIVE_NOTATION;

    var resultRomanNumeral = "";

    //start with the M and iterate back
    var position = 0;

    //substract till the number is 0
    var value = Number;

    do
    {
        var numeral = numerals[position];
        var numeralValue = VALUES[numeral];

        //check if the value is in the number
        if (value >= numeralValue)
        {
            //substract from the value
            value -= numeralValue;

            //add the numeral to the string
            resultRomanNumeral += numeral;

            //multiple numeral? advance position because things like 'IVIV' does not exist
            bool isMultipleNumeral = numeral.Length > 1;
            if (isMultipleNumeral)
            {
                position++;
            }

            continue;
        }

        position++;
    }
    while (value != 0);


    return resultRomanNumeral;
}
```

The advantage of parsing the notation array is how subtractive numerals are handled. `XL` is parsed before `X`. This prevents the algorithm from reading 10+50 instead of 40. It also has the advantage of discarding wrongly formed numerals.

## Let's see it works

I've created some xUnit tests to show how the Roman Numeral implementation works:

```cs
Assert.Equivalent(1910, RomanNumeral.Parse("MDCCCCX").Number);
Assert.Equivalent(1910, RomanNumeral.Parse("MCMX").Number);

Assert.Equivalent("CXVIIII", new RomanNumeral(119).ToString(RomanNumeralNotation.Additive));
Assert.Equivalent("CXIX", new RomanNumeral(119).ToString());
```

## Summary

It is fun to write a parser for Roman Numerals, and it is not that hard. In the [next blog](https://keestalkstech.com/2017/08/calculations-with-roman-numerals-in-csharp/) I'll show you [how to implement calculation operators](https://keestalkstech.com/2017/08/calculations-with-roman-numerals-in-csharp/) (`+`, `-`, `*`, `/`, `%`) and implicit casting (to `int` and `string`).

The full project can be found on GitHub: [roman-numerals](https://github.com/KeesCBakker/keestalkstech-code-gallery/tree/main/05.roman-numerals). I've published the various partial classes.

## Changelog

- 2024-12-01: rewrite to .NET 8.
- 2017-08-30: initial article.
