Provided sensors and actuators, it must process and combine its sensory inputs, and generate choreographed actuator-activation signals that tell its parts what to do.
Spatial representation is essential. It must have a representation of empty space, things in space, itself and its parts and their relationships in space, the evolving relations between itself and its parts and other things in space, the characteristics such as density, texture, softness, which must be encoded with space. The physical model of course applies to everything in space, and that spatial representation ought to be close to a learnability of physical action.
I have two: the DST from George Mou which has tremendous, even universal value, and SRNNs, which I made up, and may not be useful, or if useful may not be different from other ideas. But at least I'd like to have a clear idea, so I've been working on it, and we'll get there in a bit.
Space is part of SLAM (Simultaneous Localization and Mapping).
Imagine a sequence of LIDAR scan frames at times t0, t1, etc. Each point is taken as a distance and two angles relative to the orientation of the scanner. Perhaps 50 or 200 points complete a single scan.
Or imagine a stereo video camera, which similarly provides distance and two angles for each pixel for each frame across a sequence of (equal-timed) times.
The path task would be to find the small-dimensional transform of camera through space between frames. The transform involves a 3D translation for movement of its center and a unit quaternion for orientation of its focal axis and "up" normal vector, thus 7 values to estimate between each successive pair of 50-4000 angle/distance frames.
Starting in the original reference frame at t0, call it tA, and assert a list of points according to angle and distance from there, re-encoding them as Cartesian 3D points relative to the t0 space.
Next, the successor frame at time tA+1, call it tB.
Re-encode tB's frame of angle/distance values to 3D cartesian values, as if the camera hadn't moved in the last frame, that is, translated/reoriented according to the accumulated translation/reorientation path from t0 to tA.
Set tA-tB translate and rotate variables to null values.
Next, for each point in tB, find the nearest point in tA. Use DST for speed.
(This whole job is mainly a search-for-the-closest-point problem.)
Next, do a least-squares distance minimization across that set of tA-tB point pairs, as a function of the single transform encoding frame-to-frame camera translation and reorientation. Some setting of transform variables will minimize the sum of nearest-pair distances. Let that be the camera transform tA-tB.
Back out the history transform from the tB data so it is all in the t0 reference frame space, which is your updated map.
Add this tB data to the mapped universe of points.
A few matrix multiplies and you'll have your least-squares estimates for the camera path. It should be real time.
Code this up, Tom, it'll be fun!
(What would it take for a logic machine to do space, is the reverse question.)
The correspondence with logic can be extended to (or from) spatial representations, which are intrinsically logic-bearing. If something is there it's not here, and vice versa: space is logical. An organism that cannot respond similarly to similar circumstances in different locations or directions, or respond differently to similar circumstances in different locations or directions, cannot be said to functionally encode space in its internal representations. On the other hand, one that can, has the capacity to employ logic, to, as already stated, by guiding itself based on its effectively-spatial encoding. It is essentially able to reason, in exactly our normal sense of logical deduction, from location to response, and to generalize its responses across locations. A predator that cannot reason from proximity to probability of success in the hunt, will soon starve. But a predator that has spatial representation capability can just look at its representations (food on the hoof, too far to catch) and do the right thing. We would call that, drawing a logical inference, and having spatiality as a characteristic of perceptual or cognitive representations implies that the organism has access to a kind of groundedness in logic, that we ourselves experience when we believe chains of reasoning with certainty, no differently from how we believe our eyes.
Responding similarly to similars wherever they are (thus generalizing) and differently to different things whether here or there, is both logical, and spatial. I'm saying, high-level reasoning is based on the prior evolved qualia of spatial perception. Because if the dog is in the house, then the dog is not out of the house.