I thought this might be interesting to post. After the simple "Hello World" Silverlight example a week or so ago I wanted to see if I could start to post up some of the interesting little apps I'd put together over the past few years. This one came to mind. I'll post a couple of reminder notes as gems for the little bits I'd fogotten about getting this working.
Here's a screenshot with some results (the app to play with is at the end of the article). I'm afraid I can't post source for this one, but if you're interested drop me a note and I can explain how I went about it and some more details.
You can see a ranked list of guesstimates of what the algorithm thinks are the most likely languages. In this case German.
I can't quite remember how I stumbled upon this, but I've always had an interest in languages and had a need to be able to distinguish between a set of textual items to find out what the main body of the text is. I'll not go into specific details, but mostly this was focused on transcription of spoken words, so the algorithm probably isn't so good at literary or other specific writing.
The algorithm is based on statistical frequency analysis. I got a leg-up initially from some helpful corpus (I can't remember which, but if it comes to mind I'll update with a reference and due credit). Eventually I processed my own set of word frequencies which I've used here.
The app is a simple version of what I finally used for testing. This never actually got used in anger as things moved on. However, I had tested it with a test corpus of over 40,000 items, it's pretty speedy, and had an accuracy of just over 95%. It did also identify a number of mis-categorised items in the test corpus I used :-)
The languages that were hardest to discrimintate were unsurprisingly the languages with the closest similarity in linguistic terms - so Croatian (HRV) and Serbian (SRP) and Russian (RUS) and Ukranian (UKR). However, to my surprise, it could actually do a reasonable job of distinguishing between Portuguese and Brazilian Portuguese. I guess this might have been more likely from the test corpus, word frequency choices and focus on spoken materials.
Anyway, it's a fun little tool, play away and see how well it does spotting languages. The sample I've put online has the following 34 languages:
Albanian (ALB), Portuguese (Brazilian) (BPT), Bulgarian (BUL), Chinese (CHI), Czech (CZE), Danish (DAN), Dutch (DUT), English (ENG), Estonian (EST), Finnish (FIN), French (FRE), German (GER), Greek (GRE), Hebrew (HEB), Croatian (HRV), Hungarian (HUN), Icelandic (ICE), Indonesian (IND), Italian (ITA), Lithuanian (LIT), Latvian (LAT), Macedonian (MAC), Norwegian (NOR), Persian (PER), Polish (POL), Portuguese (POR), Romanian (RUM), Russian (RUS), Slovakian (SLO), Slovenian (SLV), Spanish (SPA), Serbian (SRP), Swedish (SWE), Turkish (TUR)
I have got an addtional 26 languages to add to this list but not all discriminate as well due to the size of the test corpus I had, so let me know if you're interested in others. Obviously Korean and Japanese discrimitate well without frequency analysis due to their scripts.
Have fun!!
No comments:
Post a Comment