The Symptoms

I have spent my spare time in the past few days debugging a problem that simply wouldn't go away. Every few seconds, there would be a random byte that was simply skipped out of the UART. Fortunately, the checksum was protecting bad data from getting through, but there were times when more than 10% of the frames were being corrupted!

What's more, the symptoms weren't very consistent with one another, leading me down a few rabbit trails. For instance, compiling with a 64 byte UART buffer was causing problems, but compiling with a 32 byte buffer was not. This lead me to believe that the issue was related to memory map, which took a while to get through before I was convinced that it wasn't a memory issue.

The full symptoms are documented in Issue 7 and Issue 8 of Dispatch.

The Problem

After all of that messing around, it turns out that the UART driver had a concurrency issue with buffer read/write access. This was a total amateur mistake which I should have caught when writing the UART library, but I didn't.

Lets look at the problem code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void UART_write(uint8_t* data, uint16_t length){
    uint32_t i = 0;

    while(i < length){
        while(UART_writeable() == 0);   /* wait for any current writes to clear */

        BUF_write8((Buffer*)&txBuf, data[i]);
        i++;
    }

    /* if the transmit isn't active, then kick-start the
     * transmit; the interrupt routine will finish sending
     * the remainder of the buffer */
    while((U1STAbits.UTXBF == 0) && (BUF_status((Buffer*)&txBuf) != BUFFER_EMPTY)){
        U1TXREG = BUF_read8((Buffer*)&txBuf);
    }
}

void _ISR _U1TXInterrupt(void){
    /* read the byte(s) to be transmitted from the tx circular
     * buffer and transmit using the hardware register */
    while((BUF_status((Buffer*)&txBuf) != BUFFER_EMPTY)
            && (U1STAbits.UTXBF == 0)){

        U1TXREG = BUF_read8((Buffer*)&txBuf);
    }

    IFS0bits.U1TXIF = 0;
}

We have two functions above, UART_write() and _U1TXInterupt(). The purpose of UART_write() is to load the circular buffer and, if necessary, load the U1TXREG register to get transmit interrupts started. The purpose of _U1TXInterrupt() is to empty the circular buffer and to move the contents to the hardware UART transmit register.

For those that are experienced, you may have already spotted the problem. Both of these functions have full and unblocked access to the same buffer! In other words, In the middle of a UART_write() function, an interrupt could occur and the interrupt would modify the same variables before the UART_write() was complete. This is NOT a thread-safe design and must be fixed.

The Fix

Fortunately, the fix is simple. We are going to introduce a locking mechanism that limits access to the variable any time the non-interrupt code is modifying the circular buffer.

The fixed code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
volatile static uint8_t writeLock = 0;

void UART_write(uint8_t* data, uint16_t length){
    uint32_t i = 0;

    while(i < length){
        while(UART_writeable() == 0);   /* wait for any current writes to clear */

        writeLock = 1;        
        BUF_write8((Buffer*)&txBuf, data[i]);
        writeLock = 0;

        i++;
    }

    /* if the transmit isn't active, then kick-start the
     * transmit; the interrupt routine will finish sending
     * the remainder of the buffer */
    writeLock = 1;
    while((U1STAbits.UTXBF == 0) && (BUF_status((Buffer*)&txBuf) != BUFFER_EMPTY)){
        U1TXREG = BUF_read8((Buffer*)&txBuf);
    }
    writeLock = 0;
}

void _ISR _U1TXInterrupt(void){
    if(writeLock == 0){
        /* read the byte(s) to be transmitted from the tx circular
         * buffer and transmit using the hardware register */
        while((BUF_status((Buffer*)&txBuf) != BUFFER_EMPTY)
                && (U1STAbits.UTXBF == 0)){

            U1TXREG = BUF_read8((Buffer*)&txBuf);
        }
    }

    IFS0bits.U1TXIF = 0;
}

The writeLock variable is used to provide access to the buffer. Note that only non-interrupt code will modify writeLock while the interrupt code will only read writeLock. Also note that the writeLock is active for the shortest period of time possible, which is just before and just after the buffer accesses.

This case has one additional wrinkle. The interrupt is the primary method of getting bytes to the transmit register, but we have written the writeLock in such a way that the entire _U1TXInterrupt() is basically skipped if writeLock is active... how do we know that everything will be transmitted? The first writeLock occurs at lines 9-11, which is where the bulk of the parameter array is shifted into the circular buffer. This is the likely place for a collision between the two buffer accesses. If you look to lines 19-23, we have an additional load of the U1TXREG if the transmit registers are empty and the circular buffer is not empty. In this way, we ensure that the process continues even if an interrupt was skipped during the writeLock event.

Summing Up

It is not necessary to make all of your code thread-safe, but it is absolutely critical to make any variables that are modified by both interrupt and main-line code thread-safe. If you have random problems that don't seem very consistent, this is an area to focus on.

There are books written about this issue and the approach that I took to solve the problem - though valid - is not very sophisticated. For more sophisticated methods, you might want to utilize an RTOS or other similar architectures that support semaphores and limited access.



© by Jason R. Jones 2016
My thanks to the Pelican and Python Communities.