Parsing Roman Numerals using implicit casting in C#

Lately I’ve become fascinated with the Latin language. I’m working on a project that converts photographs of Latin inscriptions on medieval statues into translated text. One of the challenges is parsing years, usually expressed in the form of Roman Numerals.

After building a parser class I noticed that it had a lot of nice characteristics: parsing, operator overloading, implicit conversions. A really nice way to play around with C#. Read here Calculations with Romand Numerals using C#.

Intro
Let's define some numerals
1. Constants
Storage
Parsing a Roman Numeral string into a number
1. The code
2. Let's be nice
How about writing Roman Numerals?
1. ToString() implementation
Let's see it works
Summary
Comments

Let's define some numerals

When you read about Roman Numerals you'll notice that there are different styles. Last week I visited the Gros Horloge in Rouen, France. Look at the numerals of the clock:

Notice how the numeral for 4 isn't IV, but IIII. It turns out that it is the (perfectly fine) long notation called the additive form. Long notations like VIIII are a bit harder to read, that's why the subtractive notation IX is used most times. In some documents, both long and short notations are used. The V might be written as IIIII and LX as XXXXXX.

Did you know there's no zero in Roman Numerals? It wasn't invented yet (for more on that here). The writers used NULLA, which is Latin for nothing.

Ow... and it turns out that 18 can be written as both XIIX and IIXX - you got to love those Romans.

Constants

So let's define these values as constants:

// 0, nothing, nada. The zero wasn't invented yet ;-)
public const string NULLA = "NULLA";

// numerals are the keys to values
public static readonly IReadOnlyDictionary<string, int> VALUES = 
    new ReadOnlyDictionary<string, int>(new Dictionary<string, int>
{
    {"I",       1 },
    {"IV",      4 },
    {"V",       5 },
    {"IX",      9 },
    {"X",       10 },
    {"XIIX",    18 },
    {"IIXX",    18 },
    {"XL",      40 },
    {"L",       50 },
    {"XC",      90 },
    {"C",       100 },
    {"CD",      400 },
    {"D",       500 },
    {"CM",      900 },
    {"M",       1000 },

    // alternatives from Middle Ages and Renaissance
    {"O",       11 },
    {"F",       40 },
    {"P",       400 },
    {"G",       400 },
    {"Q",       500 }
});

In the Middle Ages and the Renaissance, more numerals were added. They are not used anymore, but might be useful in parsing - that's why they are included.

Storage

First, we need an object to store the data and perform operations on. Let's create something simple that is nullable (a class).

public class RomanNumeral
{
    private readonly int _number;

    public int Number => _number;

    public RomanNumeral(int number)
    {
        if (number < 0)
        {
            throw new ArgumentOutOfRangeException(
                nameof(number),
                "Number should be positive."
            );
        }

        _number = number;
    }
}

Parsing a Roman Numeral string into a number

A Roman Numeral is an ordered string of characters. The values of the string are in descending order. So XI means 10 + 1. MDCIX means 1000 + 500 + 100 + 9. Notice that IX does not mean 1+10, but 10-1 (it's subtractive!). Let's build an array with all the numerals a descending order of their numerical value:

// all the options that are used for parsing, in their order of value
public static readonly string[] NUMERAL_OPTIONS =
{
    "M", "CM", "D", "Q", "CD", "P", "G", "C", "XC", "L",
    "F", "XL", "IIXX", "XIIX", "O", "X", "IX", "V", "IV", "I"
};

We can use this array to parse a string. We need to do:

Start with resultNumber with value 0.
Find the first numeral from the NUMERAL_OPTION that matches the start of the string.
If the numeral cannot be found, the input string is invalid and NULL must be returned (and the routine must be terminated).
Otherwise add the value of the numeral (stored in the VALUES array) to the resultNumber.
Remove the numeral from the input string.
Repeat until the input string is empty. Make sure that only the same or lower numerals are used in the repetitions. Higher numerals are invalid.

The code

This will result in the following code:

public static RomanNumeral Parse(string str)
{
    var strToRead = str;
    var resultNumber = 0;

    // start with M and iterate through the options
    var numeralOptionPointer = 0;

    // continue to read until the string is empty 
    // or the numeral options pointer has exceeded all options
    while (
            !String.IsNullOrEmpty(strToRead) && 
            numeralOptionPointer < NUMERAL_OPTIONS.Length)
    {
        // select the current numeral
        var numeral = NUMERAL_OPTIONS[numeralOptionPointer];

        // read numeral -> check if the numeral is used,
        // otherwise move on to the next option
        if (!strToRead.StartsWith(numeral))
        {
            numeralOptionPointer++;
            continue;
        }

        // add the vaue of the found numeral
        var value = VALUES[numeral];
        resultNumber += value;

        // remove the letters of the numeral from the string
        strToRead = strToRead.Substring(numeral.Length);
    }

    // if the whole string is read, return the numeral
    if (String.IsNullOrEmpty(strToRead))
    {
        return new RomanNumeral(resultNumber);
    }

    // string is invalid
    return null;
}

Let's be nice

To harden the code, we could add some extra processing rules:

An empty string is numeral 0.
The input string should be uppercased.
The string NULLA is numeral 0.
The character U can be converted to V.
Some medical recipes will end in a J instead of an I. Replace this end-character.
Check if the whole string is in the VALUES array and if so, return that value (small optimization).

Just add the following to the method:

if (String.IsNullOrEmpty(str))
{
    return new RomanNumeral(0);
}

// upper case the string
var strToRead = str.ToUpper();

// nulla? means nothing 0 wasn't invented yet ;-)
if (strToRead == NULLA)
{
    return new RomanNumeral(0);
}

// if ends in J -> replace it to I (used in medicine)
if (strToRead.EndsWith("J"))
{
    strToRead = strToRead.Substring(0, strToRead.Length - 1) + "I";
}

// if a U is present, assume a V
strToRead = strToRead.Replace("U", "V");

// check simple numbers directly in dictionary
if (VALUES.ContainsKey(str))
{
    return new RomanNumeral(VALUES[str]);
}

How about writing Roman Numerals?

First, we need to distinguish between the two notations - the additive and the subtractive:

// subtractive notation uses these numerals
public static readonly string[] SUBTRACTIVE_NOTATION =
{
    "M", "CM", "D", "CD", "C", "XC", "L", 
    "XL", "X", "IX", "V", "IV", "I"
};

// the additive notation uses these numerals
public static readonly string[] ADDITIVE_NOTATION =
{
    "M", "D", "C", "L", "X", "V", "I"
};

Notice that the values of the numerals in the arrays are both in descending order. With these ordered arrays, we can device the following algorithm:

Use the array of the notation.
Traverse each item of the array.
If a numeral item value is smaller than the number, subtract the number and add the numeral to the result string. Try again.
If a numeral does not fit, move on to the next.
Repeat until the number is 0.
Return the result as the full Roman Numeral.

ToString() implementation

Converted into code the algorithm will look like this:

public override string ToString()
{
    return ToString(RomanNumeralNotation.Substractive);
}

public string ToString(RomanNumeralNotation notation)
{
    if (Number == 0)
    {
        return NULLA;
    }

    // check notation for right set of characters
    string[] numerals;
    switch (notation)
    {
        case RomanNumeralNotation.Additive:
            numerals = ADDITIVE_NOTATION;
            break;
        default:
            numerals = SUBTRACTIVE_NOTATION;
            break;
    }

    var resultRomanNumeral = "";

    // start with the M and iterate back
    var position = 0;

    // substract till the number is 0
    var value = Number;

    do
    {
        var numeral = numerals[position];
        var numeralValue = VALUES[numeral];

        // check if the value is in the number
        if (value >= numeralValue)
        {
            // substract from the value
            value -= numeralValue;

            // add the numeral to the string
            resultRomanNumeral += numeral;

            // subtractive numeral?
            // advance position because IVIV does not exist
            bool isSubtractiveNumeral = numeral.Length > 1;
            if(isSubtractiveNumeral)
            {
                position++;
            }

            continue;
        }

        position++;
    }
    while (value != 0);

    return resultRomanNumeral;
}

The advantage of parsing the notation array is how subtractive numerals are handled. XL is parsed before X. This prevents the algorithm from reading 10+50 instead of 40. It also has the advantage of discarding wrongly formed numerals.

Let's see it works

I've created some unit tests to show how the Roman Numeral implementation works:

Assert.AreEqual(1910, RomanNumeral.Parse("MDCCCCX").Number);
Assert.AreEqual(1910, RomanNumeral.Parse("MCMX").Number);

Assert.AreEqual("CXVIIII", new RomanNumeral(119).ToString(RomanNumeralNotation.Additive));
Assert.AreEqual("CXIX", new RomanNumeral(119).ToString());

Summary

I've shown how one can parse Roman Numerals with an easy C# class. You can download the full class from GitHub and use it in your projects. In the next blog I'll show you how to implement calculation operators (+, -, *, /, %) and implicit casting (to int and string).