I’m looking to improve the word count implementation. Goals are:
-
It should ignore syntax characters such as leading ‘#’ in heading or ‘**’ in bold inline syntax.
-
It should work as correctly as possible with all languages (CJK) in particular.
-
It probably won’t all work correctly to start with, but the implementation should me it pretty easy to add fix cases.
From what I can tell there are two reasonable approaches to doing word count. You can focus on what you want to be words and count those ranges, or you can focus on the things that you want to be word breaks and count everything that doesn’t match as a word. So as a simple example of those two cases:
- Count all non whitespace ranges as words
- Or count all alphanumeric ranges as words
I think to get CJK correct you must take the approach of counting what you want to be words (or do multiple passes) since they don’t use whitespace. With that in mind this is is the most appealing framework of a solution that I’ve come across is:
var wordRegex = new RegExp(
'[A-Za-z0-9_'\u00C0-\u017F]+|'+ // ASCII letters +accents
'[\u3040-\u309F]+|'+ // Hiragana
'[\u30A0-\u30FF]+|'+ // Katakana
'[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]', // Single CJK ideographs
'g');
}
It’s pretty simple, and generality does what I want. The problem is that I’m a bit concerned about how many unicode ranges I’m going to have to put in there to get things right. And I don’t know where to find a simple list of all the ranges that I’ll need.
Can anyone who knows about this sort of things make a recommendation? Should I take this bruit force implementation of listing all ranges that I want counted as words?
The other approach that I can think of is do two passes. In the first pass match languages that don’t have word breaks like CJK and just replace each match with ’ a '. Then make a second pass where I look for spaces and other common word break characters. This “might” make things easier since I wouldn’t need to list ever language, and I assume most languages DO use spaces and such for word breaks.
Anyway, any thoughts, links, etc would be helpful.
Thanks!