Squeezing Every Second: How I Reduced Today's Leetcode Runtime By 2x
Introduction
I particularly enjoy it when Leetcode provides test cases that are able to push your code to the limit, allowing for you to perform micro-optimizations to your code to shave off precious milliseconds. Today's daily problem, Maximum Square Area by Removing Fences From a Field is one of them.
In this article, I will outline how I reduced the runtime of my submission from 1839ms to 696ms without changing the overall time complexity of the algorithm. By just adding two lines, I was able to reduce the runtime by 120ms (15% improvement)! But first, let's walk through the solution.
Intuition
Today's solution requires you to enumerate through the answers to find two pairs of fences such that you can form a square with the maximum size. Using the first example given (by the question), you will choose (1, 3) from both hFences and vFences to get a square of length 2.
Using this, our algorithm should look something like this:
- Add the start (
1) and endpoints (mandn) intohFencesandvFences. - Generate all the possible edges of
hFencesand all the possible edges ofvFences, and store them in anunordered_setfor O(1) lookup. We will call themhEdgesandvEdges. - For each edge in
hEdges, check if the edge exists invEdges. If it does, we update the result if the pair is greater. - Return the result :D
Implementation and First run (1839ms)
Implementing the algorithm, we get the following code below:
class Solution {
public:
int maximizeSquareArea(int m, int n, vector<int>& hFences,
vector<int>& vFences) {
const int MOD = 1e9 + 7;
hFences.push_back(1);
vFences.push_back(1);
hFences.push_back(m);
vFences.push_back(n);
sort(hFences.begin(), hFences.end());
sort(vFences.begin(), vFences.end());
unordered_set<int> hEdges;
unordered_set<int> vEdges;
for (int i = 0; i < hFences.size() - 1; i++) {
for (int j = i + 1; j < hFences.size(); j++) {
hEdges.insert(hFences[j] - hFences[i]);
}
}
for (int i = 0; i < vFences.size() - 1; i++) {
for (int j = i + 1; j < vFences.size(); j++) {
vEdges.insert(vFences[j] - vFences[i]);
}
}
int res = -1;
for (auto &hEdge: hEdges) {
if (vEdges.contains(hEdge)) {
res = max(res, hEdge);
}
}
if (res == -1)
return res;
return (long long)res * res % MOD;
}
};
Running the code, we get a runtime of 1839ms, beating only 21.36%. There is definitely plenty of room for improvement.
| Time complexity of our algorithm is O(h2 + v2).
Removing unncessary unordered_set: 1839ms -> 820ms
The initialization of the second unordered_set was unnecessary. We can compare directly to hEdges when we generate the pairs, giving us the code below:
class Solution {
public:
int maximizeSquareArea(int m, int n, vector<int>& hFences,
vector<int>& vFences) {
constexpr int MOD = 1e9 + 7;
hFences.push_back(1);
vFences.push_back(1);
hFences.push_back(m);
vFences.push_back(n);
sort(hFences.begin(), hFences.end());
sort(vFences.begin(), vFences.end());
unordered_set<long long> hEdges;
for (int i = 0; i < hFences.size() - 1; i++) {
for (int j = i + 1; j < hFences.size(); j++) {
hEdges.insert(hFences[j] - hFences[i]);
}
}
long long res = -1;
for (int i = 0; i < vFences.size() - 1; i++) {
for (int j = i + 1; j < vFences.size(); j++) {
long long diff = vFences[j] - vFences[i];
if (diff > res && hEdges.contains(diff))
res = diff;
}
}
if (res == -1)
return res;
return (long long)res * res % MOD;
}
};
Although inserting elements into an unordered_set is constant (O(1)) time, there is a hidden cost every time an edge is hashed and added into the unordered_set. By removing the second set, we reduced the runtime by more than 50%!.
In addition to removing the second unordered_set, we can make MOD a constexpr (constant expression) since the value of MOD is known at compile time.
Reserving space in hEdges: 820ms -> 758ms
Each time we add an edge to hEdges, the application must allocate memory, resulting in overhead. If we know (approximately) how many elements we will insert into hEdges (spoiler alert: we do), we can reserve space in hEdges to reduce the overhead. Adding the following line reduces the runtime to 758ms!
unordered_set<long long> hEdges;
hEdges.reserve(hFences.size() * hFences.size()); // reserve h^2 spaces in hEdges
| Side note: this works for vectors too!
pragma GCC optimize ("03") 758ms -> 696ms
Ok this is practically cheating. By adding this line, we are telling the GCC compiler to optimize aggressively (with O3 level optimization). Adding the following to the top of the code speeds up the runtime by approximately 60ms.
#pragma GCC optimize ("O3")
The last two lines alone reduced the runtimes by about 120ms, or 15%.
Comparing largest pairs first 696ms -> 755ms.
What is we iterate through the back in the inner loop of hFences, so that we can get the largest pairs first and return early? Intuitively, we might think that the following code might be faster as there will be fewer iterations:
for (int i = 0; i < vFences.size() - 1; i++) {
for (int j = i + 1; j < vFences.size(); j++) {
long long diff = vFences[j] - vFences[i];
if (diff > res && hEdges.contains(diff))
res = diff;
}
}
Surprisingly, the runtime increased to 755ms. There are a few possible reasons that we can attribute this to. The most likely reason is that CPUs are designed to read memory moving forward more efficiently than moving backward, due to features such as hardware prefetcher, branch prediction, and loop vectorization (using SIMD instructions).
Other possible optimizations
There are other optimizations which I could have tried to make the code even faster, but I did not (due to the interest of time).
Initialize hEdges for the smaller vector
Since adding into the unordered_set is an expensive operation, we can compare the sizes of hFences and vFences, and initialize hEdges for the smaller one.
int maximizeSquareArea(int m, int n, vector<int>& hFences,
vector<int>& vFences) {
// Swap hFences and vFences, so that hFences will always be the smaller one
if (hFences.size() > vFences.size())
return maximizeSquareArea(n, m, vFences, hFences);
// truncated
}
< Back to blog