Things to add for next release

Discussion:

Anders Magnusson

2014-10-15 12:43:34 UTC

Hi,

here are some things that I have on the wish list for next release.
Comments/ideas/additions?

-- Ragge

- Optimizations:
* Move liveness analysis and dce to optim().
* Add constant propagation, cse, value tracing and strength
reduction in loops.
* Add Peter K:s type size reductions.
* Let stack space be assigned in pass2 to allow for structs etc in
registers
* Use graph coloring for stack space assignment. May give
interesting interaction with register allocation.

- Cpp
* Clean up the use of a large heap for macros (leftover since V6).
* Use recursive macro evaluation instead of rescanning (faster).

- C language
* Add a bunch of C11 functions (already exists but not official).

- Pass1
* Simplify handling of function calling conventions
* Use node attributes for passing extended information between
pass1 and 2.
* Better handling of complex numbers (and vector calculations).
* Floating point evaluation in software in the compiler to help
cross compiling.

- Pass2
* Let mkext generate code for COLORMAP(), tedious to do manually.
* Share constant evaluation code with pass1.

- C2
* Add support for instruction interleaving/delays (for RISC platforms).
* Simpler peephole optimization
* Merge constant data (like strings).

Peter Kuschnerus

2014-10-15 16:32:59 UTC

Permalink

Post by Anders Magnusson
* Move liveness analysis and dce to optim().

That is great.
I had problems with dce now being in regs.c
Things I did in myoptim()
were clobbered by dce, because dce runns after.
So I could not use dce.
I am interested that dce runns before myoptim()

Do you mean to move dce to optimize() in optim2.c
or to optim() in optm.c ?

Then I suggest also to consider the "case switch by MRST"
that I placed in JIRA PCC-425

Then I have some other code snipets to enhance optim.c using COMOP.

Regards
Peter Kuschnerus

Anders Magnusson

2014-10-16 11:52:28 UTC

Permalink

Post by Peter Kuschnerus

Post by Anders Magnusson
* Move liveness analysis and dce to optim().

Hm, clobbered by dce? That sounds like a bug since it only should
remove stuff that are unused.

Post by Peter Kuschnerus
Do you mean to move dce to optimize() in optim2.c
or to optim() in optm.c ?

Optim2. It is a pass2 function. There are a number of functions that
should be iterated until nothing more is altered.

Post by Peter Kuschnerus
Then I suggest also to consider the "case switch by MRST"
that I placed in JIRA PCC-425

That's on the way, I just didn't have it in the list of things :-)

Post by Peter Kuschnerus
Then I have some other code snipets to enhance optim.c using COMOP.

Please send it to Jira.

-- Ragge

Peter Kuschnerus

2014-10-16 14:20:18 UTC

Permalink

Post by Anders Magnusson
Hm, clobbered by dce? That sounds like a bug since it only should
remove stuff that are unused.

No, it is not a bug in dce.
Due to stack-operations, I mangle the trees in a special way
that is compatible with ngenregs(), but not understood by dce.
But this is another story.

Regards
Peter Kuschnerus

Peter Kuschnerus

2014-10-16 14:53:46 UTC

Permalink

Post by Anders Magnusson
Hm, clobbered by dce? That sounds like a bug since it only should
remove stuff that are unused.

Michael Rock

2014-10-16 17:54:17 UTC

Permalink

On Wed, 15 Oct 2014 14:43:34 +0200

Post by Anders Magnusson
Hi,
here are some things that I have on the wish list for next release.
Comments/ideas/additions?
-- Ragge
* Move liveness analysis and dce to optim().
* Add constant propagation, cse, value tracing and strength
reduction in loops.

This reminds me - I still have some code in that part in my version,
but I had not managed to get it in shape and I had very little time in
the last years. So - sorry for not piping up sooner. But maybe I have
the time now.

Post by Anders Magnusson
* Add Peter K:s type size reductions.
* Let stack space be assigned in pass2 to allow for structs etc
in registers
* Use graph coloring for stack space assignment. May give
interesting interaction with register allocation.

It could be integrated into register allocation since it would only be
a kind of special case.

Post by Anders Magnusson
- Cpp
* Clean up the use of a large heap for macros (leftover since V6).
* Use recursive macro evaluation instead of rescanning (faster).
- C language
* Add a bunch of C11 functions (already exists but not official).
- Pass1
* Simplify handling of function calling conventions
* Use node attributes for passing extended information between
pass1 and 2.
* Better handling of complex numbers (and vector calculations).
* Floating point evaluation in software in the compiler to help
cross compiling.
- Pass2
* Let mkext generate code for COLORMAP(), tedious to do manually.
* Share constant evaluation code with pass1.
- C2
* Add support for instruction interleaving/delays (for RISC platforms).
* Simpler peephole optimization
* Merge constant data (like strings).
_______________________________________________
Pcc mailing list
http://lists.ludd.ltu.se/cgi-bin/mailman/listinfo/pcc

s***@yaccman.com

2014-10-16 18:11:16 UTC

Permalink

Let me share a prejudice about register allocation. I'm pretty convinced
that graph coloring is the wrong way to do it, at least as far as the
textbooks describe it. The general approach is to take the parse tree, do
a simple tree walk to linearize the operations, and then do a graph
coloring on the intervals of liveness of the variables to allocate the
registers.

The problem with this is the first step -- doing a blind tree walk.
Suppose you don't have enough registers, and will have to spill a
computation to a temporary and then reload it later to continue. If you
knew what you were going to need to spill, then the best code comes if you
compute and spill it when all the registers are available -- the code
would be the most efficient. You keep doing this, making the tree smaller
and simpler, until you can do the remaining tree with no spills using the
available registers.

There's a JACM paper I wrote with Al Aho that discusses this problem, and
we prove that if the tree has no common subexpressions and all the
registers are interchangeable, you can generate optimal code in time that
is linear in the size of the tree, with effectively any reasonable
instruction set from RISC to the VAX. If there are common subexpressions,
even if the only ones are of the form leaf op leaf, then the problem is
NP-Complete, even if there are an infinite number of registers!

The bottom line is that register allocation is not the hard part of code
generation -- it's selecting the correct order in which to do the
operations in the tree.

Just my 2c

Steve

Post by Michael Rock
On Wed, 15 Oct 2014 14:43:34 +0200

Post by Anders Magnusson
* Use graph coloring for stack space assignment. May give
interesting interaction with register allocation.

It could be integrated into register allocation since it would only be
a kind of special case.

Michael Rock

2014-10-17 07:18:20 UTC

Permalink

Hello,

you raise some very valid points here.

On Thu, 16 Oct 2014 11:11:16 -0700

There is at least one compiler I know of which did some backtracking in
code generation for this problem. If the registers were not enough, it
would check which alternative produced the code best matching the
requirements fromt the user. Somethimes the peepholer might get rid of
a temporary register, or there were other things like code size or
speed to be considered. Then it choose from the results the decision
which matched best. An interesting approach, but only really of
interest when you gun for the last byte and the last cycle to be gained
IMHO.

Post by s***@yaccman.com
The bottom line is that register allocation is not the hard part of
code generation -- it's selecting the correct order in which to do the
operations in the tree.

Indeed. I am currently toying with the idea to generate code "the wrong
way", starting with the last instruction to be executed and providing
the inputs needed. This would get rid of dead code, also. But it would
enable the compiler to see into the future (because that is what it
already generated) and decide which operation to do now, knowing the
latencies of pipelines and memory loads for example. So a load from
memory would be delayed if possible until the value would be available
without wait states, only that delaying would be moving the load
"upwards". I'm sure this has been done already somewhere, but I am not
aware of it being done on a wide scale.

Post by s***@yaccman.com
Just my 2c
Steve

Post by Michael Rock
On Wed, 15 Oct 2014 14:43:34 +0200

Post by Anders Magnusson
* Use graph coloring for stack space assignment. May give
interesting interaction with register allocation.

It could be integrated into register allocation since it would only
be a kind of special case.

Regards
Mike

Anders Magnusson

2014-10-17 17:07:59 UTC

Permalink

Morning Steve,

Post by s***@yaccman.com
Let me share a prejudice about register allocation. I'm pretty convinced
that graph coloring is the wrong way to do it, at least as far as the
textbooks describe it. The general approach is to take the parse tree, do
a simple tree walk to linearize the operations, and then do a graph
coloring on the intervals of liveness of the variables to allocate the
registers.

I partly agree with you. I think that graph coloring has its use, but
not necessarily in the situation given above.

Post by s***@yaccman.com
The problem with this is the first step -- doing a blind tree walk.
Suppose you don't have enough registers, and will have to spill a
computation to a temporary and then reload it later to continue. If you
knew what you were going to need to spill, then the best code comes if you
compute and spill it when all the registers are available -- the code
would be the most efficient. You keep doing this, making the tree smaller
and simpler, until you can do the remaining tree with no spills using the
available registers.
There's a JACM paper I wrote with Al Aho that discusses this problem, and
we prove that if the tree has no common subexpressions and all the
registers are interchangeable, you can generate optimal code in time that
is linear in the size of the tree, with effectively any reasonable
instruction set from RISC to the VAX. If there are common subexpressions,
even if the only ones are of the form leaf op leaf, then the problem is
NP-Complete, even if there are an infinite number of registers!
The bottom line is that register allocation is not the hard part of code
generation -- it's selecting the correct order in which to do the
operations in the tree.
Just my 2c

I like this type of 2c opinions, since there are not so much peole
having knowledge about this :-)
Some background of current graph-coloring implementation;

When I wrote the graph coloring code almost 10 years ago I did quite
some reading on register allocations first to find out what would be
best to implement to generate better code from a register-allocation
point of view.

When I looked at what came out of pass1 when compiling some large
programs, a number of things were notable (from my vague memory):
- lot of jump and conditional branch statements, usually with only small
expression trees (like additions).
- common use of many variables inside functions (which will conflict
with expression trees in register needs).
- expression trees consists of many casts, function calls and
indirection ops. This may require registers from many different classes.
- Lots of moves between temporary variables.

Dynamic programming is an interesting way to handle the expression
trees, but there were two main reasons that I did not implement that
back then:
- All different obscure needs (special instructions, register classes
etc) made implementing it difficult.
- Another allocation strategy is nevertheless needed for the long living
variables used in a function, and it was simpler to use only one strategy.

So far, the parts of the graph coloring code that has proved itself to
work very well are:
- Allocation and use of multiple overlapping register classes at the
same time.
- Coalescing of temporaries that have moves between them.

A part that need to be improved in the current register-allocation code
implementation is that it use the simplest form of Sethi-Ullman
computation to find out in which order to evaluate trees, which may
result in bad register allocation and unneccessary spill since it do not
take into account different register classes.

Currently the instruction selection code goes top-down and tries to find
an instruction that covers as large tree as possible, with some eventual
help from target-dependent functions (like offstar). Then the SU number
is computed, evaluation order determined and each node given a temporary
register number.
This number can be thought of as the destination register of the
instruction which the register allocator tries to coalesce with the
actual result register if possible.

This could be replaced with the dynamic programming algorithm and
probably be of great benefit for code quality, but some non-trivial
things must be solved:
- Interaction of the expression tree reg requirement with temporaries
live at the same time (i.e. spill problem if we run out of registers).
- Teach it to handle both instructions with special needs (clobbered
regs, specific in-out regs) and multiple overlapping register classes.

Interesting examples here are the i386 cltd or shl (register
requirements), 32-bit arch long long (usually two paired regs) or m68k
address regs.
Most risc (and vax!) have general regs that can handle everything :-)
Another challenge is to deal with spills in an efficient way. For
example, targets may need an extra reg to evaluate a spill if it ends up
too far from the frame pointer making the spill inefficient. Another
example is where there are fast-regs that can only be used for saving
other regs, but may be clobbered by calls.

Hm, this reply became far too long. I'm wondering if someone did care
to read all of it? :-)

-- Ragge

u***@aetey.se

2014-10-18 11:10:46 UTC

Permalink

On Fri, Oct 17, 2014 at 07:07:59PM +0200, Anders Magnusson wrote:
...

Hm, this reply became far too long. I'm wondering if someone did care to
read all of it? :-)

Yes.

:)

Rune

Anders Magnusson

2014-10-18 08:51:06 UTC

Permalink

Post by Michael Rock
On Wed, 15 Oct 2014 14:43:34 +0200

Post by Anders Magnusson
* Use graph coloring for stack space assignment. May give
interesting interaction with register allocation.

It could be integrated into register allocation since it would only be
a kind of special case.

It would actually be a very special case. Since "available regs" are
infinite on the stack it is much much less code involved.
Also, the register classes do not apply, only different chunk sizes that
may or may not be live at the same time.

Having this in place will be a huge benefit for targets with little
stack space.

-- Ragge

s***@yaccman.com

2014-10-17 19:43:38 UTC

Permalink

Post by Anders Magnusson
Hm, this reply became far too long. I'm wondering if someone did care

to read all of it? :-)

Post by Anders Magnusson
-- Ragge
Indeed. I am currently toying with the idea to generate code "the
wrong way", starting with the last instruction to be executed
and providing the inputs needed.

Regards
Mike

=======

In fairness, that paper with Aho was written quite a while ago. Some
things are easier now (register usage has, on the whole, gotten much more
uniform) and some things are worse (the timing models in today's CPU's are
baroque and bizarre beyond reasonable description...).

Indeed, the heart of the linear code generation algorithm is to start with
the last instruction:

for( <all possible last instructions> ) {
<for each instruction determine what operands are necessary
execute it and where the operands must be located>
for( <each evaluation order for the operands> ) {
for( <each operand, in order> ) {
<determine the resources (registers) available to compute
the operand where it needs to go>
<add the cost of computing this operand with the
resources available to the cost for this eval order>
<mark the resources used as busy>
}
<if this evaluation order is lowest cost so far, remember it>
}
<add the cost of executing the instruction to the operand cost, and
remember it if it is lower cost than the best previous cost>
}

A dynamic programming model can compute the costs for the subtrees
(operands) as needed, or do it for all possible resource sets in a bottom
up fashion.

One good thing about this is that the instruction description template can
be used to guide this computation, making the compiler able to handle many
different types of instruction. And, by the way, making the compiler more
reliable.

As far as common subexpressions, out-of-order execution, multiple parallel
functional units, cache effects, and the bazillion strange stalls today's
CPUs sport, one can only sacrifice a few chickens and pray...

Steve

Michael Rock

2014-10-18 07:53:41 UTC

Permalink

On Fri, 17 Oct 2014 12:43:38 -0700

Post by s***@yaccman.com
In fairness, that paper with Aho was written quite a while ago. Some
things are easier now (register usage has, on the whole, gotten much
more uniform) and some things are worse (the timing models in today's
CPU's are baroque and bizarre beyond reasonable description...).

This is a very polite way to describe the situation. ;-)

Post by s***@yaccman.com
Indeed, the heart of the linear code generation algorithm is to start

There is nothing new under the sun, as they say. And yes, it was some
time ago you wrote the paper. I'm not sure if I was able to read
english back then...

This is how it should be done for a tree, and no mistake. My intend was
to use it for a complete program (no backtracking, tough), do the
register allocation and scheduling on the same time, and see if the
generated code differs much from the code generated by the "traditional"
approaches. Todays software gets bigger and bigger, and I have doubt
that it gets more reliable as well. That's what makes me like pcc
(among other compilers). You can read it and have a chance to grasp
what it does. It is small enough to understand it being only one person.

Post by s***@yaccman.com
One good thing about this is that the instruction description
template can be used to guide this computation, making the compiler
able to handle many different types of instruction. And, by the way,
making the compiler more reliable.

And that is the elephant in the room, as far as I am concerned. What
good does it do when the compiler can shave some more cycles from
my code when it, even once, produces incorrect code? Does the time
saved by the faster binary equal the time spent finding the problem?

I prefer to spend that time finding a simpler way to get better results,
but only when I can not spend the time playing with junior.

Post by s***@yaccman.com
As far as common subexpressions, out-of-order execution, multiple
parallel functional units, cache effects, and the bazillion strange
stalls today's CPUs sport, one can only sacrifice a few chickens and
pray...

Amen to that. I had more success with other kinds of offerings - the
gods of speed seem to prefer liquids in this nick of the woods.

Post by s***@yaccman.com
Steve

Mike

Alan Cox

2014-10-18 18:56:27 UTC

Permalink

Post by s***@yaccman.com
things are easier now (register usage has, on the whole, gotten much more
uniform) and some things are worse (the timing models in today's CPU's are
baroque and bizarre beyond reasonable description...).

They reflect the way the hardware really works that's all. On the other
hand increasingly the input doesn't matter as much. If you throw a bunch
of instructions at the CPU in a reasonably sane order it'll probably do a
reasonable job at extracting parallelism. It has to because the rules
change each CPU and nobody is going to recompile their OS for each
specific core unless they are running a really huge HPC setup or they run
Gentoo Linux.

Alan