Typo-detection
We try to detect possible typos in wire names.
Verilog implementations allow the use of implicit wires. Because
of this, a typo in a wire name might go undetected. As part of our use-set analysis, we now try to detect wires that might be typos.
How do we know whether a wire name is actually misspelled, and is not simply
some implicit wire that a logic designer is using? It is not clear that there
is any hard and fast way to tell. Our goal, then, is to develop a heuristic
method to identify as many typos as possible (and as few non-typos as
possible).
Here is our approach at a high level.
- First, we will only consider wires that are not explicitly defined, since
these wires seem to be the obvious starting place to look for typos. Note that
it is quite easy to identify these wires, e.g., see make-implicit-wires,
which adds declarations to the modules to make these wires explicit.
- Next, we will only consider the subset of these wires that are either
unused or unset, per our ordinary use-set analysis. The idea behind
this restriction is that typos are probably relatively rare, and it is unlikely
that someone would misspell the name in both contexts.
- The names of the remaining wires are then analyzed and compared with the
names of existing, defined wires. The idea here is that most typos are
probably minor things: character transpositions or omissions, incorrect case,
improper underscores, etc. Only if the wire's name is somewhat similar to
another wire do we report it as a potential typo.
Analyzing Wire Names
How can we determine if one wire name is similar to another? This problem
is reminiscent of spell-checking, where approaches such as Soundex are used to
determine if say that two words sound the same.
The reason that Soundex codes work well for spell checking prose is that,
even though people may not know how to spell the words they want to write, they
usually know how to say them. So, if someone wants to write the word
Cholera, they might at least write something like Colara instead.
The two "words" are phonetically similar, so by assigning and comparing
phonetic codes, we can easily determine which words to suggest.
Soundex codes do not seem particularly appropriate for our task. Actually,
our job seems easier. First, let us look at some of the wire names used
throughout the design. There are some all-uppercase signals such as:
- STPCLKB
- THERMDC
- THERMTRIPB
And there are some all-lowercase names, such as:
But mostly what we find are mixed-case names, often with underscores, such
as:
- mmSnoopDataValid_CX_P
- msBusWriteLOCK_C1_T0_P
- rrT0McTm1PdgClr_P
- orvHi
- x1I3_ReadGflags_X
- rnRomEnSel_A
- matchb39_6b
- bcDWCBAEnt_C0_P
Wire Name Partitioning
These names are such nonphonetic garble that ordinary algorithms like
Soundex would probably not produce anything meaningful. On the other hand,
these names seem quite easy to split up into pieces. And, probably, as someone
trying to type one of these names, I am at least going to get most of these
pieces right. For instance, if I am trying to write rnRomEnSel_A, I might
forget the abbreviations used and type something like rnRomEnableSel_A, or
forget the underscore and type rnRomEnSelA, or miscapitalize and type
something like rnRomEnsel_A.
So, our first move is to split up the wire names into lists of pieces.
Then, we can compare signal names on a piece-by-piece basis. To carry out this
partitioning, we adopt the following strategy:
- Consider numbers as lowercase characters.
- We split into pieces upon encountering any underscore or other
non-alphanumeric characters, and these special characters are simply dropped.
For instance, given a name like rnRomEnSel_A, we will split into
rnRomEnSel and A.
- We split on every transition from a lowercase character to an uppercase
character. For instance, in rnRomEnSel_A, this rule leads us to split
into rn, Rom, En, and Sel_A (which is further split into
Sel and A by the previous rule).
- If we see two uppercase characters followed by a lowercase character, we
split between the uppercase characters. For instance, in bcDWCBAEnt_C0_P,
this rule leads us to split between bcDWCBA and Ent_C0_P (which are
further split by the previous rules into bc, DWCBA, Ent,
C0, and P.
Comparing Partitioned Names
Now, suppose we have an implicit wire, a, and an explicit wire
b, and we want to decide whether a might be a typo and the
designer really meant to write b. We begin by partitioning both wire
names; call the resulting partitionings x and y.
If x and y are equal, then a and b differ only
in that one might have underscores where the other does not. We think this is
very likely to be a typo.
Otherwise, we compare the pieces of x and y in order. If
exactly one piece differs, then we may wish to consider whether one to be a
typo of the other after some further analysis. If more than a single piece is
different, we think x and y are sufficiently distinct from one
another that we will not consider x to be a typo of y.
This further analysis is motivated by looking at examples of matches.
- First, we require that the two pieces have the same (case insensitive)
leading character. The motivation here is to prevent considering a wire like
bcDWCBAEnt_C0_P to be a typo of bcDWCBAEnt_D0_P.
- We additionally require that adding a trailing b or B to either
name does not cause them to become identical.
- Finally, if the pieces have the same length, and are identical except that
they end with distinct numbers, we reject the match. This is intended to
prevent matching between signals like bcDWCBAEnt_C0_P and
bcDWCBAEnt_C1_P.
Subtopics
- Typo-read-lowercase-part
- Read as many lowercase characters as possible, but stop early if a special
is encountered.
- Typo-read-part
- Read the first "part" of a wire name.
- Typo-read-uppercase-part
- Read as many uppercase characters as we can.
- Typo-read-special
- Typo-detect-aux
- We build an alist that might associate some of the wires to the
lists of wires we think they could be typos of.
- Typo-detect
- Build an alist associating any "bad" wires with any "good" wires
that they may be typos for.
- Typo-find-plausible-typos1
- Walk down the partitioning alist and gather the names of all signals
that Part might be a typo for.
- Typo-mismatch-plausibly-typo-p
- X and Y are single pieces that are mismatched. Do they satisfy our
criteria for being considered "possibly a typo"?
- Typo-partitioning-alist
- Build an alist mapping strings to their partitionings.
- Typo-partition
- Fully partition a wire name.
- Vl-typo-first-mismatch
- Given two same-length partitionings, return their first mismatch as a
pair.
- Typo-partitions-plausibly-typo-p
- X and Y are whole-partitionings. Do they satisfy our criteria
for being considered "possibly a typo"?
- Vl-typo-count-mismatches
- Given two same-length partitionings, determine how many of their
pieces are mismatched.
- Typo-numbers
- *typo-special-substrings-chars*
- *typo-numbers*