Moving Physical Verification Runs from Night to Coffee Breaks

October 15, 2019 -- A year ago we were told about successful running of an EDA job on a thousand CPUs. Now, at the first glance it may seem a real breakthrough. Perhaps it is, but only for a very limited number of design companies.

The heart of the matter

Employing multiple CPUs has its own pros and cons. The obvious negative factor with multi-CPUs is the price (hardware either purchased or rented in a cloud) and additional overhead. The more CPUs you involve, the more money it costs, the more parallel tasks are running, each requiring splitting and merging operations as well as communication between processors. Note that those communications, splitting, and merging are not verification per se. They are only needed to gain time, and the price is that after dividing your task between n CPU cores you will never get n times shorter runtime. If you take, e.g. one of the approaches proposed to speedup simulation tasks you will see that in many cases the speedup is far less than the number of used cores. So called fine-grain parallelism (basically paralleling at the level of mathematical operations), or FGP, which probably is the only practical way to employ a thousand cores, implies bigger overhead than other ways to parallelize.

Besides, although adding 2-4 cores to your hardware configuration costs next to nothing, adding a thousand of them will cost money and a lot of it. Also, it would be a challenge by itself to find that many cores ready to be exploited in public clouds, and you will want them ready on demand - not permanently but at some sporadic moments. A few A-players may indeed require a processing power that huge. For the rest of IC design community the 1000 CPUs support announcement looks more like putting the cart before the horse. In the real world, we get a task first, not the means to solve it.

Do you need 1000 surplus CPUs ?

Should we avoid using hundreds of cores then? There is no one-fit-all answer. First, there are other ways to parallelize. Your PV tool may break your task down into independent blocks of rules or into independent chunks of the layout and process each independent portion on a separate CPU. The efficiency with such approaches will depend not only on the hardware being used but also on the nature of your rules and design type respectively. However, you may expect far lower overhead than with FGP meaning higher efficiency. Second, going too coarse in task partitioning has its own dark sides. It may result in unbalanced processing, when at some point one or more cores become merely idle because there may be large inherently serial chunks. Also, the number of possible partitions that may be processed independently in parallel – either those of layout or rules – is limited, and past some point adding more CPUs gives either no effect or negative one due to overhead. That saturation limit varies heavily, depending on the design you check and your tool capabilities, and ranges from 8 cores in worst cases to hundreds in best.

How many CPUs do the end users really need?

Well, opting for multiple CPUs is always a matter of tradeoff. When you are close to the fab tapeout deadline, and every hour matters, you may want to pay some extra just to gain time. However, on the earlier stages of chip design you will probably allow for lengthier yet cheaper verification. So, what is really needed in terms of involving multiple CPU cores for your tasks is flexibility. Not to acquire costly hardware but use it temporarily when the need arises. That is, to rent it. That flexibility is provided by clouds where you may use both software (SaaS) and hardware (IaaS) on pay-per-use or pay-as-you-go basis exactly as your task requires at the moment. It is even more flexible than you would expect as it saves you from the necessity to buy extra tool seats for a year just to cover peak loads.

So what is there for us mere mortals not dealing with monster chips but now and then facing the need to turn, say, a three-hour verification into 20 minutes at most? If you are looking for a 10x speedup then - mind inevitable overhead - your best choice is a 12-16 CPUs configuration. With 32 CPUs you are likely to make it 7-8 minutes. Once again, the eventual duration will depend on the task, the capacity of chosen hardware (primarily CPUs number and clock rate, memory size), and your layout specifics.

Fastest standalone DRC engine on the market

Let’s have a look at multi-CPU results obtained with PVCLOUD SaaS solution that has recently become available at several European foundries for process nodes down to 40nm. The solution is based on PowerDRC/LVS verification tool from POLYTEDA CLOUD and employs the company’s proprietary One-Shot technology for highest efficiency (CPU/RAM/disk) per one rule check. PowerDRC/LVS benefits from parallel processing of independent groups of rules, independent parts of layout (strips), or both combined. The toolset allows running verification tasks (DRC, LVS, RCX, XOR, dummy fills, etc.) as SaaS in a cloud. Its engine runs flat natively (optionally in the hierarchical mode) thus providing sign-off accuracy yet with run times same or better than those of its natively hierarchical competitors.

The benchmarking diagram below shows DRC results obtained with foundry certified sign-off runsets 4xCPUs 18 cores Intel Xeon CPU E5-2686 V4 running @ 2.3GHz with 64 GB RAM. All test layouts are 180nm Manhattan. The speedup shown was measured against single CPU runs of the same task. Shorter bars correspond to better performance. PowerDRC/LVS run time had been already on par with the leading EDA tools. Due to certain heuristic algorithms implemented in the latest version, it was reduced by 20-50% without any impact on accuracy. As beta testing of this “Turbo mode” is approaching its completion, we find it helpful to show PowerDRC/LVS 2.6 run time both in regular and turbo modes.

PowerDRC/LVS is really stable in multi-CPU mode when verifying so-called any-angle designs containing complex curvilinear layout structures (nano and image sensors, MEMs, etc). The DRC result below was obtained on an 8x AMD Opteron CPU running @ 3GHz with 32 GB RAM:

Test name

Test type

Design size

CPU cores

Multi-CPU speedup

Run time,

hh:mm:ss

Competitor1

Competitor2

Chip B

Any-angle,

350nm

2235.890 by 5610.030 um

4.5

00:06:53

Crashed

Stopped with software errors

Since PowerDRC/LVS employs one and the same engine for both DRC checks and fill layers generation, the latter equally benefits from using multiple CPU cores. The table below shows the results of fill layers generation for 180 nm real-world chips.

Test name	Test type	GDS size, MB	CPU cores	Mode	Speedup	Run time, hh:mm:ss
Chip C	Power management, 4 metals	160	1	Flat	1	00:12:00
			8	Flat	3.4	00:03:34
			16	Flat	5.7	00:02:06
			32	Flat	7.4	00:01:38
			64	Flat	11.6	00:01:02
Chip D	Standard logic+PADs, 5 metals	390	4	Flat	2.1	00:02:39
Chip E	Standard logic+PADs, 6 metals	660	1	Flat	1	03:08:20
Chip E	Standard logic+PADs, 6 metals	660	8	Flat	3.1	01:02:04

1 | 2 Next Page »