Calculating lexical similarities

A forum for discussing linguistics or just languages in general.
Post Reply
ThatAnalysisGuy
rupestrian
rupestrian
Posts: 9
Joined: 01 Mar 2018 00:09

Calculating lexical similarities

Post by ThatAnalysisGuy »

Hello,
I am interested in calculating lexical similarity between languages and dialects. I wonder what type of mathematics (calculus, algebra, etc.) would be used for these calculations.

Did I put this in the right board? Anyways, thanks for reading this.

User avatar
sangi39
moderator
moderator
Posts: 2702
Joined: 12 Aug 2010 01:53
Location: North Yorkshire, UK

Re: Calculating lexical similarities

Post by sangi39 »

ThatAnalysisGuy wrote:
04 Nov 2019 14:02
Hello,
I am interested in calculating lexical similarity between languages and dialects. I wonder what type of mathematics (calculus, algebra, etc.) would be used for these calculations.

Did I put this in the right board? Anyways, thanks for reading this.
There's lexicostatistics, I think, which basically takes some set word list (this usually consists of words that are thought to be "universal", or fairly universal, and commonly used in every-day speech, so things like "hand", "leg", "eye", "man", "woman", "to walk", "one", "two", etc., although exactly what words are taken into account differs from list to list) and then looks at which words on that particular list are cognates between the set of languages being compared. I'm not sure on the maths that goes on afterwards, but the general idea behind it is that by comparing the number of cognates between languages, it's possible to construct family trees (the idea being that over time, more and more of this "core vocabulary" is lost, so languages that share more of that core vocabulary probably split apart more recently that some other language that they both share less vocabulary with).

That only takes into account cognates, though, not whether those cognates are more or less recognisable in the spoken language as is. For example, German has "Hund" which is fairly recognisable when looking at its cognate in English "hound" (although this has largely been replaced by "dog" in most contexts, so whether you count is as "cognate" on the word list or not is probably debatable), but then the French cognate is "chien" involves some digging to even realise it's a related word because it sounds that much different. And honestly, I'm not 100% how you'd quantify that mathematically (maybe something like 100% of shared phonological innovations?)
You can tell the same lie a thousand times,
But it never gets any more true,
So close your eyes once more and once more believe
That they all still believe in you.
Just one time.

Zé do Rock
cuneiform
cuneiform
Posts: 155
Joined: 12 Jul 2018 18:22

Re: Calculating lexical similarities

Post by Zé do Rock »

REFORMEE

Mi conta na momento similaritees inter europano linguas, ma solo nacionale linguas. Co 2 exepcion: i zel au plattdüütsh un braziliano portugalian als eigene spraken - dat sein eben die spraken/dialekte dat i kenn: wil ik brasilianer bin, snack ik brasiliaansh portugeesh, un ik heb en "metaplatt" kreert, das het de wört de in de meeste wörtböker un texten zo finden sind, un wo ok de körz en roll speelt.

It is all básd on google transláter, wich is dêfinitly not flawless, but i didnt se anuther alternativ. Mi canè fazer meyor co la linguas ki mi coness, ma lu serè tendencial in relacion au linguas ki mi parla no.

Yo conta la letras comun inter la palavras in 2 lingua com un punt, si una letra cambia un posicion (es is die 3. letre in die word in ein sprak un die 4. o die 2. letter in el andru), i count haf a point, if it chánjes bi mor than 1 posicion i doant count it as a pozzitiv point, but not as a negativ wun either.
Pois ay letras similar - na verdade sones, ja ki yo solo conta la sones - zu exemple /i/ un /y/, /b/ un /v/, /d/ un /t/. Moast of them ar realy similer, sum ar quít difrent but stil the asociation is ézy, for instanse /g/ and /dZ/: lus è bene diferent, meme so mi pensa no ki pro exemplo la pople in Britain havè problemas pro comprende la deutshe palavra 'geografie', meme si lu hav a reale /g/ e no /dZ/.
De cualker manera similare sones conta como medio punt, si lus is un posicion "falso", lus no conta como positivo mas tamben no conta como negativ. Wenn de plus as ein posicion versetzet bi, cele de as negative punt. Or as haf a negativ point, sinse negativ points ar alwàs counted as haf points. Bon, mi conta la comon letras e poi mi substrai half a punto pra cada letra ki la palavras ha non in comun. Na fin la fraccionale numeros no cont, asie ki cuando la 2 palabras ha 2,5 punto, yo conta lus como 2.

Klar, die metod is cimli subjectiv, es cann nur subjectivo bi. But it werks for all languajes, so it dusnt realy matter, or if it matters, then quít rarely.

La 2 maxi proximo linguas ata pre poco dei serou norgish e sverigish, mas agora portugalian e "brazilian" è plu proxim. Talvez lus serè mais proximo de cualker maner, ma lu podè ha relacion co la facto ki google traduyor no ha 2 lingua la, inton lu podè se pasar ki ai 2 traduiciones posible pra una palabra tant in norgish com in sverigish, e google prend una palabra pra norgish y un otra pra sverigish, meme si la 2 palabras ha la meme signific. Dat cann bei portugalian un braziliano nit passee, da es oben nur ein übasetzu gib.

I doo it quít sloly, evry 2 or 3 dás wun werd. Mi finish agora la primera sentens, co 24 palavra. Ata la momento noi ha:

(bi è belaruski, et eestiano, ju serbo-krovatski, mag hungary, mak makedonian, ne hollish, pla bacho deutsh (plattdeutsh), shi albaniano, sko slovakiano, sno sloveniano, su suomiano (finnish), tü türkian)

(se liste unden in el inglishe deil - the ferst nám on the lín is the languaj with wich we compare the uthers)


ENGLISH

I've been counting similarities between european languages, but only national languages. With 2 exceptions: i also count low german and brazilian portuguese as separate languages, because they're the only ones i know: being brazilian i can speak brazilian portuguese, and i created a meta-low german (metaplatt), which has the words that appear in most dictionaries and texts, and where brevity also plays a role.

It is all based on google translator, which is definitely not flawless, but i didnt see another alternative. I could do better for the languages i know, but that would be biased towards the languages i dont know.

I count the common letters between the words in 2 languages as one point, if a letter changes one position (for instance it is the 3rd letter in the word in one language and the 4th or the 2nd letter in the other) i count half a point, if it changes by more than 1 position i dont count it as a positive point, but not as a negative one either. Then there are similar letters - actually sounds, since i just count the sounds - for instance /i/ and /y/, /b/ and /v/, /d/ and /t/. Most of them are really similar, some are quite different but still the association is easy, for instance /g/ and /dZ/: they're quite different, still i dont think many people in say Britain would have problems to understand the german word 'geografie', even if it has a real /g/ and no /dZ/. Anyway, similar sounds count as half a point, if they're one position "wrong" they dont count as positive but dont count as negative either. If they're more than one position wrong, they count as a negative point. Or as half a negative point, since negative points are always counted as half points. Well, i count the common letters and then i substract half a point for every letter the words dont have in common. In the end no fractional numbers are counted, thus when the 2 words have 2,5 points, i count them as 2.

Certainly this is a subjective method, it can only be subjective. But it works for all languages, so it doesnt really matter, or if it matters, then quite rarely.

The nearest languages until a few days ago were norwegian and swedish, but now portuguese and "brazilian" are nearer. Maybe they would be nearer anyway, but it could have to do with the fact that google translator doesnt have 2 languages there, so it could happen that there are 2 possible translations for a word both in norwegian and swedish, and google takes for norwegian one word and for swedish another word, even if both words have the same meaning. This cant happen between brazilian and portuguese, since there is only one translation.

I do it quite slowly, every 2 or 3 days one word. I'm finishing the first sentence, with 24 words. So far we have:

(bi is belarussian, et estonian, ju serbo-croatian, mag hungary, mak macedonian, ne dutch, pla low german (plattdeutsh), shi albanian, sko slovakian, sno slovenian, su finnish, tü turkish)

(the first name on the line is the language with which we compare the others)


bi-bra3-bu10-cze12-da-de-en-es2-et-fra2-gre-is-it-ju17-la3-li5-mag-mak13-mal-ne-no-pla-pol17-por-ro1-ru21-shi-sko14-sno13-su-sve2-tü-uk15

bra-bu1-cze-da-de-en4-es30-et-fra12-gre-is-it22-ju1-la-li-mag-mak1-mal7-ne-no-pla-pol3-por44-ro8-ru5-shi3-sko2-sno-su-sve-tü-uk3

bu-cze11-da2-de-en2-es5-et-fra2-gre-is-it3-ju12-la4-li2-mag-mak29-mal-ne-no-pla-pol10-por2-ro1-ru11-shi-sko13-sno8-su-sve-tü-uk17

cze-da1-de-en1-es1-et-fra1-gre-is-it-ju12-la3-li2-mag-mak11-mal-ne-no-pla-pol11-por-ro-ru14-shi-sko20-sno11-su-sve-tü-uk9

da-de16-en11-es1-et-fra1-gre1-is4-it-ju-la-li2-mag-mak1-mal-ne10-no32-pla15-pol-por1-ro-ru-shi-sko-sno-su-sve26-tü-uk

de-en11-es-et-fra-gre-is10-it-ju-la1-li3-mag-mak-mal-ne28-no13-pla23-pol-por-ro-ru-shi-sko-sno-su-sve11-tü-uk -

en-es5-et-fra5-gre-is5-it3-ju-la1-li-mag-mak1-mal3-ne16-no12-pla12-pol-por4-ro-ru-shi-sko-sno1-su-sve12-tü-uk

es-et-fra12-gre1-is-it26-ju2-la1-li1-mag-mak2-mal7-ne-no-pla-pol2-por31-ro9-ru4-shi3-sko1-sno1-su-sve1-tü-uk5

et-fra-gre-is-it-ju-la-li-mag1-mak-mal-ne-no-pla-pol-por-ro-ru-shi-sko-sno-su18-sve-tü-uk

fra-gre2-is-it15-ju2-la-li-mag-mak2-mal4-ne-no1-pla-pol1-por12-ro6-ru4-shi2-sko3-sno1-su-sve1-tü-uk5

gre-is-it-ju-la-li-mag-mak-mal-ne1-no1-pla1-pol-por1-ro-ru-shi1-sko-sno-su-sve1-tü-uk

is-it-ju-la-li-mag-mak-mal-ne13-no14-pla11-pol-por-ro-ru-shi-sko-sno-su-sve16-tü-uk

it-ju-la-li-mag-mak-mal3-ne-no-pla-pol-por24-ro8-ru2-shi-sko-sno-su-sve-tü-uk3

ju-la3-li2-mag1-mak17-mal-ne1-no1-pla-pol11-por-ro-ru19-shi-sko8-sno21-su-sve1-tü-uk13

la-li15-mag-mak3-mal-ne-no-pla2-pol2-por-ro-ru3-shi1-sko3-sno2-su-sve-tü-uk2

li-mag-mak2-mal-ne2-no1-pla1-pol3-por-ro-ru3-shi1-sko4-sno1-su-sve1-tü-uk4

mag-mak1-mal-ne-no-pla-pol1-por-ro-ru1-shi-sko-sno1-su1-sve-tü-uk

mak-mal-ne-no-pla-pol10-por-ro-ru15-shi-sko8-sno14-su-sve3-tü-uk14

mal-ne-no-pla-pol-por6-ro-ru-shi-sko-sno-su-sve-tü-uk

ne-no17-pla28-pol-por-ro-ru-shi-sko-sno2-su-sve13-tü-uk

no-pla18-pol-por-ro-ru-shi-sko-sno1-su-sve32-tü-uk-

pla-pol-por-ro-ru-shi-sko-sno-su-sve15-tü-uk

pol-por-ro-ru18-shi-sko15-sno10-su-sve2-tü-uk16

por-ro8-ru2-shi4-sko-sno-su-sve-tü-uk3

ro-ru3-shi1-sko-sno1-su-sve-tü-uk3

ru-shi-sko14-sno15-su-sve3-tü-uk17

shi-sko-sno-su-sve-tü-uk

sko-sno8-su-sve-tü-uk18

sno-su-sve4-tü-uk9

su-sve-tü-uk

sve-tü-uk2

tü-uk


Esu es encora po dato, in kelke mes mi va saber plu...

I stil hav few data, in a fu munths i'l no mor...

yangfiretiger121
sinic
sinic
Posts: 312
Joined: 17 Jun 2018 03:04

Re: Calculating lexical similarities

Post by yangfiretiger121 »

Zé do Rock wrote:
05 Dec 2019 13:38
REFORMEE

Mi conta na momento similaritees inter europano linguas, ma solo nacionale linguas. Co 2 exepcion: i zel au plattdüütsh un braziliano portugalian als eigene spraken - dat sein eben die spraken/dialekte dat i kenn: wil ik brasilianer bin, snack ik brasiliaansh portugeesh, un ik heb en "metaplatt" kreert, das het de wört de in de meeste wörtböker un texten zo finden sind, un wo ok de körz en roll speelt.

It is all básd on google transláter, wich is dêfinitly not flawless, but i didnt se anuther alternativ. Mi canè fazer meyor co la linguas ki mi coness, ma lu serè tendencial in relacion au linguas ki mi parla no.

Yo conta la letras comun inter la palavras in 2 lingua com un punt, si una letra cambia un posicion (es is die 3. letre in die word in ein sprak un die 4. o die 2. letter in el andru), i count haf a point, if it chánjes bi mor than 1 posicion i doant count it as a pozzitiv point, but not as a negativ wun either.
Pois ay letras similar - na verdade sones, ja ki yo solo conta la sones - zu exemple /i/ un /y/, /b/ un /v/, /d/ un /t/. Moast of them ar realy similer, sum ar quít difrent but stil the asociation is ézy, for instanse /g/ and /dZ/: lus è bene diferent, meme so mi pensa no ki pro exemplo la pople in Britain havè problemas pro comprende la deutshe palavra 'geografie', meme si lu hav a reale /g/ e no /dZ/.
De cualker manera similare sones conta como medio punt, si lus is un posicion "falso", lus no conta como positivo mas tamben no conta como negativ. Wenn de plus as ein posicion versetzet bi, cele de as negative punt. Or as haf a negativ point, sinse negativ points ar alwàs counted as haf points. Bon, mi conta la comon letras e poi mi substrai half a punto pra cada letra ki la palavras ha non in comun. Na fin la fraccionale numeros no cont, asie ki cuando la 2 palabras ha 2,5 punto, yo conta lus como 2.

Klar, die metod is cimli subjectiv, es cann nur subjectivo bi. But it werks for all languajes, so it dusnt realy matter, or if it matters, then quít rarely.

La 2 maxi proximo linguas ata pre poco dei serou norgish e sverigish, mas agora portugalian e "brazilian" è plu proxim. Talvez lus serè mais proximo de cualker maner, ma lu podè ha relacion co la facto ki google traduyor no ha 2 lingua la, inton lu podè se pasar ki ai 2 traduiciones posible pra una palabra tant in norgish com in sverigish, e google prend una palabra pra norgish y un otra pra sverigish, meme si la 2 palabras ha la meme signific. Dat cann bei portugalian un braziliano nit passee, da es oben nur ein übasetzu gib.

I doo it quít sloly, evry 2 or 3 dás wun werd. Mi finish agora la primera sentens, co 24 palavra. Ata la momento noi ha:

(bi è belaruski, et eestiano, ju serbo-krovatski, mag hungary, mak makedonian, ne hollish, pla bacho deutsh (plattdeutsh), shi albaniano, sko slovakiano, sno sloveniano, su suomiano (finnish), tü türkian)

(se liste unden in el inglishe deil - the ferst nám on the lín is the languaj with wich we compare the uthers)


ENGLISH

I've been counting similarities between european languages, but only national languages. With 2 exceptions: i also count low german and brazilian portuguese as separate languages, because they're the only ones i know: being brazilian i can speak brazilian portuguese, and i created a meta-low german (metaplatt), which has the words that appear in most dictionaries and texts, and where brevity also plays a role.

It is all based on google translator, which is definitely not flawless, but i didnt see another alternative. I could do better for the languages i know, but that would be biased towards the languages i dont know.

I count the common letters between the words in 2 languages as one point, if a letter changes one position (for instance it is the 3rd letter in the word in one language and the 4th or the 2nd letter in the other) i count half a point, if it changes by more than 1 position i dont count it as a positive point, but not as a negative one either. Then there are similar letters - actually sounds, since i just count the sounds - for instance /i/ and /y/, /b/ and /v/, /d/ and /t/. Most of them are really similar, some are quite different but still the association is easy, for instance /g/ and /dZ/: they're quite different, still i dont think many people in say Britain would have problems to understand the german word 'geografie', even if it has a real /g/ and no /dZ/. Anyway, similar sounds count as half a point, if they're one position "wrong" they dont count as positive but dont count as negative either. If they're more than one position wrong, they count as a negative point. Or as half a negative point, since negative points are always counted as half points. Well, i count the common letters and then i substract half a point for every letter the words dont have in common. In the end no fractional numbers are counted, thus when the 2 words have 2,5 points, i count them as 2.

Certainly this is a subjective method, it can only be subjective. But it works for all languages, so it doesnt really matter, or if it matters, then quite rarely.

The nearest languages until a few days ago were norwegian and swedish, but now portuguese and "brazilian" are nearer. Maybe they would be nearer anyway, but it could have to do with the fact that google translator doesnt have 2 languages there, so it could happen that there are 2 possible translations for a word both in norwegian and swedish, and google takes for norwegian one word and for swedish another word, even if both words have the same meaning. This cant happen between brazilian and portuguese, since there is only one translation.

I do it quite slowly, every 2 or 3 days one word. I'm finishing the first sentence, with 24 words. So far we have:

(bi is belarussian, et estonian, ju serbo-croatian, mag hungary, mak macedonian, ne dutch, pla low german (plattdeutsh), shi albanian, sko slovakian, sno slovenian, su finnish, tü turkish)

(the first name on the line is the language with which we compare the others)


bi-bra3-bu10-cze12-da-de-en-es2-et-fra2-gre-is-it-ju17-la3-li5-mag-mak13-mal-ne-no-pla-pol17-por-ro1-ru21-shi-sko14-sno13-su-sve2-tü-uk15

bra-bu1-cze-da-de-en4-es30-et-fra12-gre-is-it22-ju1-la-li-mag-mak1-mal7-ne-no-pla-pol3-por44-ro8-ru5-shi3-sko2-sno-su-sve-tü-uk3

bu-cze11-da2-de-en2-es5-et-fra2-gre-is-it3-ju12-la4-li2-mag-mak29-mal-ne-no-pla-pol10-por2-ro1-ru11-shi-sko13-sno8-su-sve-tü-uk17

cze-da1-de-en1-es1-et-fra1-gre-is-it-ju12-la3-li2-mag-mak11-mal-ne-no-pla-pol11-por-ro-ru14-shi-sko20-sno11-su-sve-tü-uk9

da-de16-en11-es1-et-fra1-gre1-is4-it-ju-la-li2-mag-mak1-mal-ne10-no32-pla15-pol-por1-ro-ru-shi-sko-sno-su-sve26-tü-uk

de-en11-es-et-fra-gre-is10-it-ju-la1-li3-mag-mak-mal-ne28-no13-pla23-pol-por-ro-ru-shi-sko-sno-su-sve11-tü-uk -

en-es5-et-fra5-gre-is5-it3-ju-la1-li-mag-mak1-mal3-ne16-no12-pla12-pol-por4-ro-ru-shi-sko-sno1-su-sve12-tü-uk

es-et-fra12-gre1-is-it26-ju2-la1-li1-mag-mak2-mal7-ne-no-pla-pol2-por31-ro9-ru4-shi3-sko1-sno1-su-sve1-tü-uk5

et-fra-gre-is-it-ju-la-li-mag1-mak-mal-ne-no-pla-pol-por-ro-ru-shi-sko-sno-su18-sve-tü-uk

fra-gre2-is-it15-ju2-la-li-mag-mak2-mal4-ne-no1-pla-pol1-por12-ro6-ru4-shi2-sko3-sno1-su-sve1-tü-uk5

gre-is-it-ju-la-li-mag-mak-mal-ne1-no1-pla1-pol-por1-ro-ru-shi1-sko-sno-su-sve1-tü-uk

is-it-ju-la-li-mag-mak-mal-ne13-no14-pla11-pol-por-ro-ru-shi-sko-sno-su-sve16-tü-uk

it-ju-la-li-mag-mak-mal3-ne-no-pla-pol-por24-ro8-ru2-shi-sko-sno-su-sve-tü-uk3

ju-la3-li2-mag1-mak17-mal-ne1-no1-pla-pol11-por-ro-ru19-shi-sko8-sno21-su-sve1-tü-uk13

la-li15-mag-mak3-mal-ne-no-pla2-pol2-por-ro-ru3-shi1-sko3-sno2-su-sve-tü-uk2

li-mag-mak2-mal-ne2-no1-pla1-pol3-por-ro-ru3-shi1-sko4-sno1-su-sve1-tü-uk4

mag-mak1-mal-ne-no-pla-pol1-por-ro-ru1-shi-sko-sno1-su1-sve-tü-uk

mak-mal-ne-no-pla-pol10-por-ro-ru15-shi-sko8-sno14-su-sve3-tü-uk14

mal-ne-no-pla-pol-por6-ro-ru-shi-sko-sno-su-sve-tü-uk

ne-no17-pla28-pol-por-ro-ru-shi-sko-sno2-su-sve13-tü-uk

no-pla18-pol-por-ro-ru-shi-sko-sno1-su-sve32-tü-uk-

pla-pol-por-ro-ru-shi-sko-sno-su-sve15-tü-uk

pol-por-ro-ru18-shi-sko15-sno10-su-sve2-tü-uk16

por-ro8-ru2-shi4-sko-sno-su-sve-tü-uk3

ro-ru3-shi1-sko-sno1-su-sve-tü-uk3

ru-shi-sko14-sno15-su-sve3-tü-uk17

shi-sko-sno-su-sve-tü-uk

sko-sno8-su-sve-tü-uk18

sno-su-sve4-tü-uk9

su-sve-tü-uk

sve-tü-uk2

tü-uk


Esu es encora po dato, in kelke mes mi va saber plu...

I stil hav few data, in a fu munths i'l no mor...
Dude, Linguistics & Natlangs, probably, isn't a good place to post stuff in your conlangs because postings here care more about English answers. But, if you insist on doing it, please post the English first for that reason. Also, most of us don't know language abbreviations offhand. So, that part of the post will, likely, confuse the OP.
Alien conlangs (Font may be needed for Vai symbols)

User avatar
sangi39
moderator
moderator
Posts: 2702
Joined: 12 Aug 2010 01:53
Location: North Yorkshire, UK

Re: Calculating lexical similarities

Post by sangi39 »

In all fairness to Zé do Rock, they did take on the advice of Mods in including a "standard" English version of what they were writing in their posts. While I agree that placing the English version first would involve less "slog", the presentation style is much better than it was 16-17 months ago. As I said then "for anyone familiar with the now-banned user Epaqasnwqar, things could be much worse".
You can tell the same lie a thousand times,
But it never gets any more true,
So close your eyes once more and once more believe
That they all still believe in you.
Just one time.

User avatar
sangi39
moderator
moderator
Posts: 2702
Joined: 12 Aug 2010 01:53
Location: North Yorkshire, UK

Re: Calculating lexical similarities

Post by sangi39 »

As for the language abbreviations part, yeah, that's poorly presented. I'm having trouble making sense of it clearly. I assume a sort of table presentation would make more sense, but I'm not going to try and make that.
You can tell the same lie a thousand times,
But it never gets any more true,
So close your eyes once more and once more believe
That they all still believe in you.
Just one time.

Post Reply