If done at user level, sections being protected by a mutex or waiting for a semaphore operations are written by the user and can take arbitrarily long time.
It is possible to use "busy" waiting: loop while testing a shared integer variable(s) to implement mutexes, etc., but it is wasteful of processor cycles.
How can the semaphore operations be made atomic and efficient?
Answer: Make these be system calls. The operating system can change the state of a thread to blocked and not schedule it to use a processor until its state is changed to unblocked.