Someone's bound to mention This Artifexian video, and while it does touch on a few things I did, they still use a primarily articulatory approach, and they still attempt to make the language approximately pronounceable by humans, which for me is a big no no.
Now in the case of Commonthroat, it's still using an oral medium that humans can interpret, even if they can't reproduce it. It's perfectly possible to come up with a language that uses a medium that humans have no access to at all, like modulated radio waves or releasing pheromones, but if you're working on something that humans can at least perceive, it makes it a little easier.
The first thing to do is to come up with a very high-level qualitative impression of the language. How would a person who is hearing or otherwise perceiving this language for the first time describe how it sounds? For example, Mandarin is chalk full of sibilants and is famously tonal, Russian paletalizes everything, (American) English is abundantly nasal. I wanted Commonthroat to sound like the noises a dog makes when it's dreaming.
OK, so we have a high-level description, the next question is, how do you come up with a phonology that gives that impression? The video linked above asks "How would a speaker of this language make this particular sound?" complete with IPA-esque charts listing manners and places of articulation. This was my first approach as well. I spent some time googling "dog vocal tract", but didn't get much I could use, especially since I'm neither a veterinarian nor a linguist.
The light bulb moment for me was when I realized that I was asking the wrong question. Instead of asking "How does a dog make a particular sound?" I should ask "what does each phoneme sound like?" without worrying about the anatomy needed to generate those sounds. That's a much simpler question to answer. Instead of slogging through scientific journals that I can't understand, I just had to listen to my dog as he slept, only supplementing that information with a few popular articles on dog vocalization to tie up loose ends.
So, for example, if you're creating a language for sapient neutron stars, you merely need to do some light research into what sorts of things neutron stars do, and ask yourself, at a high level, which of those things you could hammer out into a language. Neutron stars have jets of X-rays that they emit from their poles, and also experience star-quakes, so you could potentially turn those phenomena into a language. The key isn't to ask, "how could a sentient neutron star produce such and such?" but "How would an observer describe the patterns of X-rays the star is emitting?" or similar. No need to worry about the articulatory mechanisms at play.
Back to my doggo, my next step was to listen to the sorts of sounds he made while sleeping. He's quiet to a fault wile awake, to the point that I've forgotten about him in the back yard for hours because he wouldn't bark to be let back in, just sit silently at the door. (Don't worry, I've installed a camera pointing at the door so I can tell if he's ready to come in.) Anyway, he may be a mime when awake, but he's extremely vocal while asleep. After a few nights of observation, I came up with the following different noises: whines, yips, growls, and sighs through the nose.
I didn't feel that was quite enough to go on, so I thought about other sounds I've heard dogs make. My first guide dog, a golden retriever, would make these happy grunting noises whenever she greeted a human she recognized. That sounded like it could fit into the overall gamut of sounds without compromising the "dreaming dog" quality of the language, so I decided to add grunts as a category of sound. Of course, yinrih aren't just dogs. They're aliens that happen to look sort of like dogs, so I wasn't strictly limited to only sounds dogs can make. Tigers make a sound called a chuff that I find pleasant, so I decided to add that to the list.
Our bird's eye view of the phonology now consists of six sounds: whines, growls, grunts, sighs (which I call "huffs"), chuffs, and yips. But six sounds isn't a lot. That's still just over half the size of the smallest phoneme inventory for a human language. (Piraha and Rotokas I believe have something like 10 or 11 phonemes, depending on who's counting.) Herein lies the other aha moment for me. Instead of thinking in terms of atomic segments, think in terms of a feature space.
The first step here is admittedly a bit of a lazy shortcut, I decided to think in terms of syllables, specifically which sounds can serve as syllable nuclei (vowels) and which cannot (consonants). There's no reason why a xenolang would even have the concept of syllables. Indeed, it seems even in human linguistics the concept of a syllable is a "You know it when you see it" kind of thing. Anyway, I decided that huffs, chuffs, and yips shall serve as consonants, and whines, growls, and grunts shall serve as vowels.
Huffs, chuffs, and yips shall therefor be considered atomic, with no internal features beyond a vague qualitative description. Huffs are a sigh through the nose, chuffs are like huffs, but trilled, and yips are quiet little barks. Here's where some people may find my approach a little unsatisfying, especially if you want to produce audio samples of your language. We know what yipping sounds like in general, but at some point an obvious question comes up, how does a yip effect the sounds around it? And how is it effected in turn? This technique doesn't really help answer that question, and I'm left with the somewhat disappointing fact that I honestly can't say how a yip sounds on a technical level.
On to the vowels. We've got three broad vowel qualities, which I call "phonations": whines, growls, and grunts. The vowels are where the concept of a feature space really comes into play. What do I mean by feature space? Think of how you specify colors on a computer. The most common way is to specify how much red, green, and blue a particular color contains. Theoretically, you can define any color by specifying values for these three axes. So we need to think of axes that would define our vowels. Phonation itself can be considered an axis with three values: whine, growl, and grunt. Dogs can also change the pitch and volume of their vocalizations, so we can add two more axes to our feature space: tone (pitch) and strength (volume). You can have as many values on each axis as you like, but I decided to go with a rather coarse two values for each, with high and low tones, and strong and weak volumes (strengths). Remember, we're going for a high level qualitative approach. Don't worry about exactly how high or how loud. In the yinrih's case, they're fairly quiet even at their loudest, so even strong (loud) vowels are quiet by human standards. But is there some other feature we could add as an axis? Of course, length! It's everywhere in human language, and it's trivial to toss it in as a feature, with two values of short and long.
The feature space now has four axes: two lengths (short and long), two tones (low and high), two strengths (weak and strong), and three phonations (whine growl and grunt). This gives us a grand total of 24 vowels. With our three consonants (huffs chuffs and yips) pushing us up to a total phoneme inventory of 27 phonemes. Not too shabby!
But as the late Billy Mays would say: "I'M NOT DONE YET!. We've got our phoneme inventory, so it's time to start thinking about phonetactics. Let's circle back to the concept of syllables. Internally, syllables consist of an onset, a nucleus, and a coda. That nucleus need not be a single solitary vowel. We can dramatically increase our syllable count by using diphthongs, or as I call them in Commonthroat, Contours. There's no reason you can't just say any two vowels can form a contour, indeed, there's no reason you have to limit it to two vowels, but I wanted to be able to easily describe qualitatively how a syllable sounds, even if I can't tell you the nitty-gritty of how the sound is generated. I decided to come up with some phonetactic constraints to limit the number of possible contours. You can use whatever criteria you want when coming up with constraints, but my goal was to make it easy to programmatically generate a list of every possible syllable. With that in mind, I decided that there are two rules that govern which vowels can form contours. First, two vowels may not form a contour if they differ only in length. A short low weak growl and a long low weak growl cannot form a contour. Second, the two vowels must have the same phonation type. A short low weak growl and a short high weak growl can form a contour, but a whine and a growl cannot.
Since this process is all about getting the general vibe of how a language sounds, we should probably come up with a concise way of describing contours as well as simple vowels. We have two vowels, and each vowel has four features, one of which (phonation) will always be the same between them. So let's say that if the two vowels have the same value for a particular feature, we can simply describe them like a simple vowel with that feature. If both vowels are short, the contour can simply be described as short. If both vowels are high, the whole contour is high, and so on. If we want to be nit picky, we could clarify that a long contour is probably quantitatively longer than a long simple vowel, but the key to this approach is to use broad strokes, not get into the phonological weeds.
Contours with different tones are trivial to describe, since human linguistics already has a way of describing them. Low to high is rising, and high to low is falling. It took me a bit to think of simple descriptors for contours of the other axes. I dropped the terms "volume", "quiet", and "loud" for describing the loudness of a vowel because I wanted to maintain the impression that the language is always spoken at a comparitively quiet volume. So the category is called "strength", with quiet instead being called "weak" and loud being called "strong". With new qualitative terms for the
Now we have nice, qualitative descriptions for our simple vowels and contours, their timing, tone, strength, and phonation, from short low weak whine to long high strong grunt, and early falling weakening growl to late rising strenghthening whine.
Nuclei aren't the only part of a syllable, we still need to think about our consonants--onsets and codas. Since my goal is to keep things simple to program, I've settled on a very simple syllable structure of (C)V(C). Since I can't imagine how a yip would sound like at the end of a syllable, I'll restrict yips to onsets only. So we have three possible onsets (huff, chuff, and yip), with an empty onset bumping it up to four, and two codas (huff and chuff), three counting open syllables.
One quick Python script later and I have a list of every possible syllable in Commonthroat. 2016, it turns out.
That's our phonology done and dusted. The TL;DR is that you want to think of how a language sounds to the listener, not how it's produced by the speaker. You also want to keep a high-level qualitative view of the phonology--what impression does it give to the listener overall, and you want to think in terms of an abstract space of feature axes that combine to make a phoneme, and not simply limit yourself to atomic segments.