The program models a physical object, which gets partitioned and divided over the processors.
Each processor performs the same task on different data
e.g. Matrix operations
There is a list of tasks and processors cycle through the list until it is exhausted.
Each processor performs a different task on the same data
e.g. parameter tuning when training machine learning models
3548 CUDA cores
1417~1531 MHz
12 GB GDDR5X Memory @ 10 Gbps
384-bit memory interface width runs 480GBps
250 W
12,000 million transistors
11 TeraFlops @ FP32
https://computing.llnl.gov/tutorials/parallel_comp/
http://nci.org.au/user-support/training/
https://www.tacc.utexas.edu/documents/13601/114120/2-Intro_Parallel_Comp_sept_2011.pdf
https://www.irisa.fr/alf/downloads/collange/cours/gpuprog_ufmg/gpuprog_1.pdf
http://wouterkoolen.info/Talks/ComputerArchitecture.pdf
http://www.edgefxkits.com/blog/difference-between-von-neumann-and-harvard-architecture/
https://forum.beyond3d.com/threads/nvidia-pascal-announcement.57763/page-102