|
My (rad) research
team at EuroSys 2024; from left to right, Ehsan, me, Robert, Ties. After I
introduced them to Gustavo Alonso, he asked if I had a height requirement for my
PhD students. |
I have already written about the rejections
in academia once (in a post titled Rejections).
This is a never-ending story for most academics. Recently, two papers from my group got published and
presented at conference workshops after 5 and 4 rejections from conferences,
respectively. Hence, I decided to revisit the topic of rejections by focusing
on these two papers.
Before I start with the
individual papers, I would like to acknowledge a few things.
First, I have bias with respect
to my own work. The main reason I work on what I work on is because I find the
topic exciting and important. Otherwise, I wouldn’t be working on it.
Second, like everyone I know, I
have a drive to share what I find exciting and important with others. That is
why I welcome any opportunity to present our work and enjoy writing about it in
the form of academic papers. Through this dissemination process, we share and
build knowledge, get constructive criticism from our peers to improve the
work, and start collaborations and new research directions. Not everyone shares
the same level of excitement on the same research topics, though, and there is
always room for improvement in a work. As a result, sometimes the feedback you
receive sounds discouraging. This discouragement combined with the high
dependency of one’s career on publications makes rejections difficult even
though we all know that they are inevitable in our profession.
Third, I am aware that my health
is the most important thing, and nothing I do or accomplish at work makes me as
happy as the time I spend with the people that make me feel at home or at a
movie theater or the beach. But I don’t want to diminish people’s career
ambitions, especially in a world where women’s career ambitions are still
under-supported, by over-emphasizing these cliché-but-true health
and happiness statements.
An Analysis of Collocation on
GPUs for Deep Learning Training
Ties Robroek, Ehsan
Yousefzadeh-Asl-Miandoab, Pınar Tözün
EuroMLSys 2024 - https://dl.acm.org/doi/10.1145/3642970.3655827
This paper characterizes the
performance of the different task collocation methods available on NVIDIA GPUs
for deep learning training. The motivation came after realizing that not
everyone that trains deep learning models is <insert your favorite big tech company
here>. Thus, not every model training needs many GPUs or even the entire
resources of a single GPU. This means that if we always train one model at a
time on a GPU, that GPU is likely a wasted resource. Wasting hardware resources
is a waste of money and energy inefficient. Studying how deep learning tasks
can effectively share the resources of a GPU, therefore, made sense and was a
relatively under-researched subject at the time.
We started back in September 2021
when my first PhD student
Ties Robroek joined my
1-person team. A couple of MSc students, Anders Friis Kaas and Stilyan Petrov
Paleykov, were also interested in the topic for their MSc thesis project. The
initial team was formed.
We started with an investigation
into the MIG (multi-instance GPU) technology, since it was the newest thing
offered by NVIDIA GPUs at the time. MIG allows a GPU to be split into smaller
units enabling task collocation with isolation guarantees.
1st reject: The
MSc students finished their thesis in June 2022. SoCC (ACM Symposium on Cloud Computing)
paper submission deadline was around that time. We submitted the work from
their thesis there. It got rejected with overall constructive and encouraging
reviews. The main issue was people thought the paper didn’t have enough
lessons-learned to warrant a SoCC publication. However, we got clear ideas for
improving the paper for a possible resubmission. The key suggestion was to
expand the study beyond MIG and add a comparison to other collocation methods
on NVIDIA GPUs namely multi-streams and multi-process service (MPS).
2nd reject: For
this submission, we included EhsanYousefzadeh-Asl-Miandoab, my second PhD student, in the study. The entire paper was almost redone. We submitted the
outcome to MLSys (Conference
on Machine Learning and Systems) in fall 2022. It got rejected again.
While the reviews were slightly less encouraging than SoCC, they were overall
constructive. The reviewers didn’t find the results surprising enough, asked
for deeper analysis on some experiments, suggested adding more diverse deep
learning models to the study, and asked for scenarios that involve multiple
GPUs.
3rd reject: Following
the MLSys reviews, for the next resubmission, we added more diverse models, dug
deeper into certain results, changed the metrics we report to give a
finer-grained picture for the GPU resource utilization, and wrote clearer
guidelines for when it makes sense to use each collocation mechanism. The last
point was to address the “no surprising result” comment. Since we cannot create
surprising results out of nowhere, wrapping them up in clearer “take-away
messages” made more sense. Finally, we deemed multi-GPU case out-of-scope for
this study, since I strongly believe in the importance of optimizing things for
the smaller scale as much as the big scale.
The resulting paper was submitted
to ASPLOS (ACM
International Conference on Architectural Support for Programming Languages and
Operating Systems) in Spring of 2023 and was rejected once again. The workload
diversity was praised by some reviewers, while some others asked for
alternative workloads. However, the unsurprising results and the lack of deeper
insights were once again the main issues.
4th reject: I
was overall optimistic after the first two rejects, because we had clear ideas for
improving the paper for a resubmission. I think the paper indeed got substantially
better as a result of those resubmissions. However, after the 3rd
reject, I didn’t know how to improve the paper anymore. We couldn’t mockup
unstraightforward results. Further in-depth analysis wasn’t easy due to not
every GPU hardware detail being openly shared by the vendor. Of course, one can
always apply extra analysis through more profiling and add more workloads to
the study if one has infinite time. However, I thought it would be better for
the students to move onto the next stage/work in their PhD at this point. They
also had the desire to move on. So, we decided to resubmit the paper to HPCA (IEEE International Symposium on
High-Performance Computer Architecture) during the summer of 2023 without
making extensive changes to it this time around. It got rejected with similar
reviews to ASPLOS.
5th reject: In
one last attempt, we resubmitted the paper to SIGMOD
(ACM International Conference on Management of Data) in fall of 2024 with extra
results but not substantial changes. I wasn’t sure if SIGMOD was the right
venue for this type of work, but as a SIGMOD reviewer I have seen papers on
utilizing GPU resources better for deep learning being welcomed by some, if not
most, of the program committee. I also thought that the insights we deliver on
GPUs may be interesting to the data systems community. We got rejected again
mainly due to straightforward lessons-learned and the topic being a borderline
fit for SIGMOD.
Accept: Finally, I decided
to stop trying to force this paper into a conference. Even though the amount of
work we put into it was a lot, our findings were clearly not enough for a
conference publication. In my team, we really like the EuroMLSys workshop (Workshop
on Machine Learning and Systems) that is collocated with the EuroSys
conference. Therefore, it was a natural choice for us, and the paper got
accepted with a presentation slot at EuroMLSys
2024.
Reaching the Edge of the Edge:
Image Analysis in Space
Robert Bayer, Julian Priest, Pınar
Tözün
DEEM 2024 – https://dl.acm.org/doi/10.1145/3650203.3663330
This paper characterizes the
performance of several resource-constrained hardware devices to determine their
suitability for an image-filtering task on a small (hence, extra-constrained)
satellite.
The roots of this paper also go
back to 2021, though, the actual work on our end didn’t start till Spring 2022.
In 2021, Julian Priest joined our
lab. He is the main representative of the DISCO
(Danish Student CubeSat Program) at our university. DISCO is an educational
project that involves several Danish universities. It gives the students the
opportunity to design and operate a small satellite. The target use case is
Earth Observation; more specifically taking images of Earth from the satellite
and analyzing them. The challenge with this use case is that the communication
link between the Earth and the satellite isn’t your typical on-Earth internet
connection; it is weak and temporary. Hence, sending all the images captured on
the satellite is not an option. There is a need for image filtering on the
satellite to send to Earth only the images that are of substantial interest.
This need for filtering images leads to a follow-up challenge: the computation
power that can be deployed on a small satellite is also small due to both space
and power restrictions of the satellite. Hence, there was a need to identify
the hardware device(s) to deploy on such a satellite that can satisfy the
required size, power, and image filtering latency.
Since I joined ITU, I have also
been interested in analyzing the performance of a variety of small hardware
devices. In general, I always look for good excuses for benchmarking hardware.
:) Hence, DISCO was a fantastic excuse. We also had the perfect student to lead
the work, Robert Bayer, who was a
student assistant with me then and is now one of my PhD students.
1st reject: The
hardware benchmarking for DISCO started in Spring 2022. I thought it could be
interesting to write up about the results and submit something to CIDR (Conference on Innovative Data
Systems Research) 2023. CIDR values papers on interesting and challenging data
systems, and in my opinion the image processing pipeline of DISCO fits into
this category. The reviewers, however, didn’t agree with me on the data systems
connection, so the paper got rejected. Two out of three reviewers had a
positive tone, otherwise.
2nd reject: After
CIDR rejection, I thought the SIGMOD 2024 Data-Intensive Applications track could be a fit for this topic. This was also suggested
by one of the positive CIDR reviewers. We added one more hardware device to our
study, re-measured power consumption on all devices with a more precise
external device, included details on the satellite components, and submitted
the paper. Around the paper submission time, April 2023, the first DISCO
satellite, built based on the results presented in the submitted paper, was
launched in space. I thought this submission was the best paper I had ever
co-authored in my entire career (no offense to the co-authors of my other papers), but no
one else agreed. The paper got rejected once again mainly due to being a misfit
for SIGMOD’s data management focus.
3rd reject: After
two trials with data management venues, I thought it is better to target a
systems venue as also suggested by some of the reviewers who rejected the
paper. Thus, we made minor adjustments to the paper based on the feedback from
previous reviews and submitted it to ASPLOS 2024’s summer
2023 round. We got more detailed feedback, since no one thought the paper was a
misfit to ASPLOS. However, overall, the reviewers found the results not novel
and surprising enough for ASPLOS and the focus on a single application too
narrow, even though they all appreciated the motivation of the work. Hence, the
paper was rejected once again.
10 days after receiving this
rejection, Robert won the best Computer Science MSc thesis award in Denmark for
the same work.
4th reject: When
we received the 3rd reject, the submission deadlines for MLSys 2024
and EuroSys 2024 were already over. They would have been other relevant systems
venues for this work. The other option, which was also recommended by one of
the CIDR reviewers, was MobiSys, but this was a whole different world for me,
and I wasn’t sure if I wanted to jump into a third community while already
doing a bad job juggling data management and systems communities. Therefore, I
recommended Robert to target VLDB’s (International Conference on Very Large
Databases) Scalable Data Science track. Based on the call for papers, both Robert and I thought the
paper’s topic fits there. We were wrong once again. The paper got rejected
mainly due to being unfit for VLDB.
Accept: This paper was
tied to a real-world application deployment; the first DISCO satellite. Hence,
there wasn’t much room to improve the work to please conference reviewers. We
could do more benchmarking, but the satellite was already in space based on our
existing results. The paper as is had closure and real-world impact. Thus, to
avoid delaying the publication further, I once again gave up on conferences and
started to think about relevant workshops. Robert also needed to move on. I
thought the DEEM (Data
Management for End-to-End Machine Learning) workshop, which I like very much,
collocated with SIGMOD conference would be a nice venue for this work. I
emailed the workshop chairs to double-check the suitability of the topic for
the workshop to avoid another “this is unfit” rejection. They kindly confirmed
that the topic is in scope for them. So, we submitted the paper to DEEM 2024,
and it got accepted.
I personally enjoy and value some
conference workshops more than the main conference. Workshops gather the subset
of people in a research community with similar research interests. They can be
way more effective for exposing your work to the right audience than the
conference itself. Similarly, the talks at a workshop in your research area are
usually more relevant for you content-wise. So, I am happy that my students had
a chance to present their hard work at these workshops that I very highly
regard.
However, a workshop publication
unfortunately doesn’t count as much as a conference publication on one’s CV
when people evaluate you for academic positions or grant submissions. A couple
of years ago, a postdoc candidate I wished to hire mentioned that I didn’t seem
to have so many publications recently. This wasn’t the main reason he declined
my offer in the end, but it was something he noted down, and I am sure others
do the same. This is how our profession works.
It has been more than 6 years
since I joined ITU and almost 3 years since I had my first PhD student. I still
don’t have a conference paper with my own PhD students. If ITU had a more
traditional tenure-track scheme, I wouldn’t have gotten the tenure. Earlier
this year, I went down the rabbit hole trying to figure out what I was doing
wrong and what I can do better in the future. The list was too long, but none
of the answers were soothing. Deeper into the hole I questioned whether I was a
shit advisor or a complete failure at my job. Tori Amos’ Crucify played over and
over in my head, especially the lines “Nothing I do is good enough for you, so
I crucify myself every day.” and “got enough guilt to start my own religion.”
Luckily, Crucify ends with
“Never going back again to crucify myself every day.”
I know I made mistakes and
misjudgments and will likely keep making them. I know the struggle is partly
due to changing my research field and trying to build up my own research group
from scratch without any starting funding. I know systems work takes
time to get published; 2 years or more is the common case. I know the 3-year
PhD duration in Denmark freaks me out as a result and makes me more impatient
than I should be for publications. I know everyone’s papers get rejected; even
the works of the people I admire. I know one of my favorite conferences, CIDR,
was founded by people whose work was underappreciated and rejected by VLDB and
SIGMOD. I know I still get invited for talks, and when I present my team’s work
to others, I get positive feedback overall, unless people are lying to my face.
I know many colleagues at ITU appreciate me. Most importantly, I know, at my
job, regardless of the rejections, I learn a lot and get the most fulfillment
from the work I do with my students.