| Tweetovi |
|
Liran Alon
@Liran_Alon
|
31. sij |
|
I later saw this great talk: youtube.com/watch?v=Ii_pEX… that explains how RISC may achieve CISC-like perf with clever micro-arch tricks. Main concepts is Macro-Fusion in which decoder generates a single uOp for multiple MacroOps & CISC having both 2 and 4 bytes instructions. twitter.com/Liran_Alon/sta…
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
31. sij |
|
It would've been great to see a technical paper comparing x86, ARM & RISC-V and explicitly calling out the advantages and disadvantages of each and the various factors and tradeoff decisions made to reach to these specific ISA designs. Otherwise, it's just handwaving discussions.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
27. sij |
|
What makes these considered as new vulns is that current microcode MD_CLEAR implementation don't flush the micro-arch buffers that may theoretically propagate to MDS-leakable buffers post executing MD_CLEAR. Thus, mitigation requires microcode update but no software change.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
RS entries have operand fields with either data or tagged with PSrc such that they will capture their data when it is sent on common bus by functional units executing dependent uOps producing PSrc data. RS entries are dispatched to ports only once all operand are resolved...
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
I don't think that the RS is what performs the load uOps re-dispatch. The RS entry for the load uOps is suppose to deallocate once dispatched to load-port (i.e. AGU and then MOB). The load-buffer entry in MOB is what it's re-dispatched to DCU once DTLB resolve physical address.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
The MCA generated uOps that implements page-walks are not "stuffed loads". They have valid PDst and have an allocated RS and ROB entry. i.e. The non-speculative page-walk that happens at retirement do not perform "stuffed loads". This is described in patent.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Well, more precisely they first go through AGU to generate linear address and then go to MOB. Which enforce proper memory ordering, perform linear address translation, dispatched to DCU, support store-to-load-forwarding and etc.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
When uOps are dispatched from FrontEnd Allocator to OoO BackEnd engine, it creates ROB and RS entries (After register renaming and etc.). RS holds uOps to be dispatched to functional-units via ports when their operands are ready. load/store uOps are dispatched from RS to MOB.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Right. It's a good question (now that I understand it). I believe PMH is triggered anyway because of hardware simplicity (It's a slow-path not worth optimising). DTLB only needs wires to MOB and not ROB to mark entry with MCA. PMH needs such functionality anyway to trigger PF.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Yes it doesn't interrupt OoO flow of course. i.e. Other load/stores in MOB that doesn't require PMH. Yes, the load remains in the MOB (Not RS) until the translation arrives and when then it can be dispatched to DCU.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Maybe you're asking why in case of D==0 (In contrast to A==0 that may need to set multiple A bits in multiple PT levels), the DTLB doesn't at this point just mark uOp with MCA at retirement and skip trigger PMH. It's because D may be 1 in mem (DTLB stale) and also not worth opt.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
What do you mean by parallel? In case of store when dirty-bit is cleared, TLB just asserts a miss signal that triggers PMH. Similar to not having a TLB match at all. Thus, load/store-entry in MOB wasn't given physical address which prevents MOB to dispatch it to DCU.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
I'm not sure I understand the question. Every speculative execution path that is eventually not retired is by definition a perf hit. As all ROB is flushed and all uOps that have executed speculatively and not retired was just a waste of CPU cycles. Nothing special for this case..
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
That part is consistent yes.
My "not exactly" was referring to the separate handling you described between access and dirty bits. You must avoid setting either of them in case of speculative exec and they don't have different type of MCAs.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
The first walk done by PMH while uOps was executing speculatively, could have been aborted when PMH deduced that it's about to access non-speculative memory. In this case, PMH marks parent uOp to redispatch only at retirement which would re-trigger PMH as-well.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
The dirty-bit is cached in DTLB for trivial reason. It's not different than perm bits. i.e. When dirty-bit is cleared, loads from linear address should hit DTLB while stores should miss and trigger PMH. Similarly, Shadow MMU implementations mark SPTE RO if D-bit is cleared.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
When retirement circuit try to retire parent uOp, it checks if uOp is marked to trigger MCA (Signalled by PMH if need to set A/D bits). If yes, ROB is flushed and MCA generates new uOps that implements page-walk and set A/D bits. Else, PMH performs page-walk as usual.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Not exactly. DTLB just reports miss for either load/store when A-bit 0 or store when D-bit 0. This triggers PMH to PT-walk (May use PTE caches). If mem is non-speculative (By MTRR/PAT) or requires to modify A/D bits, PMH aborts walk and signal uOp to re-dispatch at retirement.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
23. sij |
|
Read this Intel patent: patents.google.com/patent/US56805…. In short, the access & dirty bits are set by the PMH that executes uOps dispatched by microcode-assist (MA) which is triggered on parent uOp retirement. The parent uOps is set to trigger the MA on retirement by the PMH as-well.
|
||
|
|
||
|
Liran Alon
@Liran_Alon
|
18. sij |
|
@rsinghal1 This discussion may interest you.
|
||
|
|
||