
Abstract: Convolutional Neural Networks (CNNs) have revolutionized computer vision, particularly image classification and semantic segmentation, across everyday life, healthcare, and industrial domains. However, existing models often underutilize middle-level features that balance fine-grained details and semantic information, hindering performance in specialized domains. This research introduces a Bracket-style CNN and its variants to address those limitations. Regarding image semantic segmentation, the proposed model leverages middle-level features by exhaustively pairing adjacent ones through Cross Attentional Fusion modules, effectively almagamating semantically-rich information with finely-patterned details. Consequently, the proposed methodology achieves competitive performance on PASCAL VOC 2012 (83.6% mIoU), CamVid (76.4%), and Cityscapes (78.3%) datasets. About image classification tasks for Diabetic Retinopathy grading and facial expression recognition, a variant called sCAB-Net integrates channel-wise attentional features, attaining 85.6% quadratic weighted kappa on the Kaggle DR Detection and 79.37% mean class accuracy on the RAF-DB datasets. Overall, the proposed networks enhance image understanding for visual applications in practice.
Keywords: Bracket-style Convolutional Neural Network; Image Semantic Segmentation; Image Classification
INTRODUCTION
Deep learning techniques have revolutionized computer vision tasks thanks to the advancements in computational power and the availability of large visual datasets. Convolutional Neural Networks (CNNs) have achieved remarkable success in tasks ranging from image classification [1,2] to semantic segmentation, which involves labeling every pixel in an image [3,4].
Image-level classification is utilized in visual applications like disease progression identification [5]. Meanwhile, semantic segmentation is applied for tasks requiring pixel-wise categorization such as medical image analysis, augmented reality, and autonomous driving.
Existing image semantic segmentation approaches often utilize CNNs originally designed for image classification, such as ResNet [1] as the backbone. In these networks, shallow (early) layers capture finely detailed but less semantic features due to their limited receptive fields, while deeper counterparts extract more abstract and semantically-rich features but with coarser spatial resolution due to subsampling and larger receptive fields.
This progression leads to a decrease in spatial resolution and an increase in channel dimensions, successively extracting local details and global contextual information. The key challenge is designing an optimal decoding strategy that effectively combines fine-grained details from shallow layers with semantic richness from deeper layers to produce accurate pixel-wise predictions.
In terms of architectural topology, existing image semantic segmentation approaches can be broadly categorized into symmetrically-structured networks [3,4,13-18] and asymmetrically-structured networks [19-26].
The former typically uses skip connections to fuse features from corresponding encoding and decoding layers, enhancing the embedding of semantic information from coarse to dense resolution. The latter focuses on the deepest feature maps, applying spatial pyramid pooling to exploit multi-scale contexts but often neglecting middle-level features.
It can be observed that middle-level features, which strike a balance between spatial detail and semantic information, are underutilized in these architectures. They are either used minimally or not at all in the decoding process, which limits the potential for improving segmentation accuracy.
Accordingly, this paper introduces the Cross-Attentional Bracket-style Convolutional Neural Network (CAB-Net) to leverage middle-level features more comprehensively. The hypothesis is that middle-level features can refine pixel-wise context and reduce ambiguities due to their balanced representation. In the proposed architecture, every feature map (except the highest-resolution one) is upsampled and exhaustively paired with adjacent higher-resolution feature maps through attention-embedded combination modules. This iterative process forms a bracket-like structure that refines the prediction map round by round until the final output is achieved.
A Cross-Attentional Fusion (CAF) mechanism, inspired by SENet [2] and SCA-CNN [6] is employed to combine feature maps of different resolutions effectively. This mechanism properly fuses semantically rich information from lower-resolution inputs with finely patterned features from higher-resolution counterparts. Clearly, the middle-level features play dual roles as both semantically richer representations and finer-grained details within the combination modules. This exhaustive utilization is expected to enhance segmentation accuracy compared to existing architectures.
The Bracket-style concept is also extended to image classification tasks in specialized domains like Diabetic Retinopathy (DR) grading and Facial Expression Recognition (FER), where spatial details are crucial. Conventional CNNs may lose important spatial structures due to sequential downsampling. The proposed Single-mode Cross-Attentional Bracket-style CNN (sCAB-Net) integrates channel-wise attentional features from multiple levels of a pretrained CNN. By refining and aggregating features from different resolutions, sCAB-Net constructs a robust final feature vector, enhancing recognition performance effectively.
The core architectures mentioned above are graphically demonstrated in Figure 1, with corresponding evaluation results on different applicaction areas. Furthermore, here are the major contributions of the proposed methodology:
We proposed a novel Bracket-shaped CNN architecture that leverages middle-level feature maps by exhaustively pairing adjacent ones through attention-embedded combination modules, iteratively refining the segmentation map.
We introduced an effective scheme called Cross-Attentional Fusion which densely involves adjacent feature maps of different resolutions for cooperating semantically-rich information with finely-patterned features.
The proposed Bracket-style networks demonstrate promising capabilities in both semantic segmentation (pixel-level labeling) and image classification (image-level labeling), contributing to practical computer vision applications.
We trained and evaluated the proposed model on well-known datasets including PASCAL VOC 2012 [7], CamVid [8], Cityscapes [9], achieving performance competitive with state-of-the-art methods.
We also demonstrated that the architecture can be flexibly manipulated through round-wise feature aggregation to perform efficient per-pixel labeling on datasets with heavily imbalanced classes, such as the DRIVE dataset [10] for retinal blood vessel segmentation.
Especially, the Bracket-shaped concept can be extended to handling image classification by integrating channel-wise attentional features of semantically rich (high-level) information into finely patterned (low-level) details in a feedback-like manner, enabling the aggregation of spatially rich representations.
The Bracket-structured variants achieved remarkable benchmark results in speicalied domains where spatial richness is vital for classification decisions, such as DR recognition on the Kaggle DR Detection dataset [11] and FER on the RAF-DB dataset [12].
II. RELATED WORK
A. Symmetrically-structured Networks
These networks follow an encoder-decoder framework where a pre-trained CNN encoder extracts features from local to global scales, and a decoder reconstructs the segmentation map by progressively integrating semantic context into finer details. Incorporating features from the encoder into the decoder significantly enhances performance.
U-Net [3] employed simple concatenation of encoder and decoder features along the channel dimension. SegNet [4] utilized max-pooling indices from the encoder (i.e., backbone CNN) to guide upsampling, preserving important features but possibly disrupting spatial correlations. G-FRNet [13] and GFF [14] introduced specialized modules like Gate Units to modulate encoded features for dense labeling. Feature Pyramid Network (FPN) [15] and LDN [16] used Lateral Connection Modules where upsampled features are element-wise added to encoder features before convolution. RefineNet [17] and DDSC [18] incorporated refinement units and shortcut connections to enhance contextual information across scales.
While effective, these methods can be computationally intensive due to large tensor sizes during training.
B. Asymmetrically-structured Networks
These networks focus on specialized upsampling strategies that aggregate contextual information without heavily involving multi-level encoder features.
Models like ParseNet [19] and HolisticNet [20] incorporated an additional network stream to capture global context alongside the main segmentation stream, reducing local ambiguities and noise. RLS [21] attached RNNs to the CNN to model dependencies in pixel-level information, enhancing object consistency in segmentation maps.
Meanwhile, models such as DeepLab [22], PSPNet [23] and used dilated (atrous) convolutions to expand the receptive field without increasing parameters, and employ spatial pyramid pooling to aggregate multi-scale features. On the other hand, PAN [24], EncNet [25], and DANet [26] integrated attention modules to emphasize semantically rich information. PAN uses spatial average pooling for global attention, EncNet combines dilated convolutions with context encoding, and DANet applies both spatial and channel attention in parallel.
Our work differs by adopting both channel-wise and spatial attention mechanisms in a cross-attentional manner, integrating these attentions across all feature levels during decoding. This approach aims to seamlessly combine semantically rich context with fine-grained features, enhancing segmentation accuracy compared to methods that focus attention only on the deepest feature maps.
III. METHODOLOGY
The proposed Bracket-shaped decoder can be seamlessly integrated with any classification-based CNN. In this work, ResNet-101 [1] pre-trained on the ImageNet dataset is employed as the backbone CNN to extract meaningful features from input images.
A. Bracket-style CNN for natural image semantic segmentaiton
As illustrated in Figure 2, the proposed architecture begins with a natural image input, which is processed through a backbone CNN, such as ResNet-101, to extract multi-scale feature maps. These feature maps, denoted as F0n (where n=1, 2, 3, 4), are produced from convolutional layers with increasing strides of 4, 8, 16, and 32, resulting in progressively smaller spatial dimensions and higher channel depths. This hierarchical representation provides a robust basis for further processing in the bracket-style network architecture.
The bracket-style network iteratively combines and refines these multi-scale feature maps through the CAF module, which can be expressed as follows.
Fnr=CFnr-1,Fn+1r-1
Where Fnr= nth feature map at rth round, C. represents the CAF modules. It is obvious that in the first round, the four initial feature maps are paired and fused, reducing the number of feature maps to three. This process continues iteratively, with each round reducing the number of feature maps by one while progressively increasing their spatial resolution through upsampling. Specifically, the second round generates two feature maps and the third round produceds a single feature map F13. These iterations leverage the CAF module to integrate features, suppressing ambiguous details and enriching semantic information.
Figure 2. Proposed CAB-Net Architecture. Note: Spa. Att. means Spatially Attentional Block; Cha. Att. means Channel-wisely Attentional Block; T. Conv. Means Transpose Convolutional Layer; Sep. Conv. Means Separable Convolution Layer; {x, d} means feature map having stride of x (i.e., its spatial dimension is 1/ x that of the input image) and d channels; x = - (dash) means spatial size equals to 1x1.
The CAF module plays a critical role in this process, combining adjacent feature maps by leveraging both channel-wise and spatial attention mechanisms. This cross-attentional fusion enables the integration of semantically rich, coarse-scale features with finer-grained patterns, ensuring that the resulting feature maps at each stage are progressively more informative. After the third round, the final feature map undergoes upsampling followed by convolutional operations for the inference of the final segmentation map (Ffinal), which matches the spatial resolution of the input image. A softmax classifier completes the procedure by assigning pixel-wise labels to the segmentation map.
This iterative, bracket-style approach offers several advantages. First, it effectively suppresses ambiguous details by refining each upsampled feature map with finer-grained information. Second, it enhances semantic richness through the dense integration of features across multiple scales. The result is a final segmentation map that balances fine-grained localization with global contextual understanding, providing accurate pixel-wise predictions.
B. Bracket-style CNN with Round-wise Feature Aggregation for medical image segmentation
Figure 3. CAB-Net with Round-wise Feature Aggregation (RFA-BNet) for medical image segmentation.
The flexibility of the proposed Bracket-style CNN architecture allows it to be adapted for various tasks, including retinal blood vessel segmentation in medical imaging. One of the main challenges in fundus images is the massive imbalance between blood vessel pixels and background pixels. This imbalance makes it difficult for traditional models to accurately label small structures like blood vessels.
Our hypothesis is that round-wise aggregation of finely-patterned features can effectively address this challenge by enhancing the labeling accuracy of small objects, even under severe class imbalance. This approach avoids the necessity of costly patch-based processing, which is common in other methods [27].
This RFA-Bnet variant, meaning CAB-Net with Round-wise Feature Aggregation, simplifies the feature combination process by using upsampling and element-wise summation for binary labeling, rather than the more complex cross-attentional fusion used in natural image segmentation. The result is a model capable of effectively segmenting fine structures like blood vessels with improved efficiency and accuracy, as shown in Figure 3.
C. Single-mode Bracket-style CNN for natural / medical image classification
Furthermore, building on the flexibility of the Bracket-style architecture, it can also be adapted as a single-mode variant for tackling image classification tasks. In specialized domains like diabetic retinopathy and facial expression recognition, the classification process relies on complex combinations of structural factors. For instance, in diabetic retinopathy, the severity is determined by features such as microaneurysms and hemorrhages, while in facial expression recognition, emotions are conveyed through combinations of muscle movements around the eyes, eyebrows, and mouth.
In traditional CNN architectures used for image classification, particularly in specialized domains like DR and FER, remains several disadvantages. Multiple downsampling stages in the feedforward process lead to the loss of spatial correlations between structural details, which are challenging to encode along the channel dimension. As a result, relying solely on the highest-level feature for classification is suboptimal, especially when target labels depend on an amalgamation of various spatial factors within the image.
Figure 4. sCAB-Net for Diabetic Retinopathy severity recognition or facial expression recognition.
This observation led to the hypothesis that pair-wise refinement of low-level features using the semantic context from high-level features could enhance recognition performance. By refining and then amalgamating these features, we can improve the model’s ability to accurately classify complex patterns. The proposed Single-mode Bracket-style CNN (sCAB-Net) is designed with this observation, enabling efficient and effective classification in these specialized domains.
The proposed Single-mode Bracket-style CNN (sCAB-Net) performs two key functions: (i) Channel-wise attentional information of semantically-rich features is integrated into finely-patterned counterparts in a reverse manner, ensuring that lower-level details are enhanced by high-level context; and (ii) Feature maps of different scales are amalgamated, bringing in spatially-rich representations that better inform the final predictions.
This approach allows for exhaustive utilization of contextual information in middle- and low-level features, making spatially-rich details accessible without ambiguities. The outcome is an improvement in classification accuracy for specialized domains like DR and FER, where fine-grained structural details play a crucial role.
Figure 4 provides a more detailed look at the Single-mode CAB-Net architecture, let’s break down its two major components: (i) backbone CNN is responsible for extracting multi-scale feature maps; and (ii) Channel-wisely Cross-Attentional Scheme processes these feature maps to produce the final classification label through three main stages as follows.
First, Self-Context Aggregation (SCA) is performed, wherein each feature map undergoes self-attention processing to generate self-attentional context vectors for capturing relevant information within each scale.
Second, Bracket-style Attention (BsA) is involved, wherein cross-level concatenation is applied to combine these self-attentional context vectors from adjacent scales. This step generates cross-attentional context vectors that further refine multi-scale information, followed by point-wise multiplication to produce refined multi-level feature maps.
Third, Multi-level Fusion (MLF) is utilized, wherein these refined multi-level features are fused through global average pooling and depth-wise concatenation to produce the final mixture of feature maps, which is then used for the classification decision.
Each of those stages ensures that both low-level and high-level contextual information is integrated, enhancing the model’s accuracy for specialized tasks like diabetic retinopathy severity classification and facial expression recognition.
IV. EVALUATION RESULTS
This section presents the evaluation results of our CAB-Net model for image semantic segmentation and classification across multiple datasets.
For common object segmentation on the PASCAL VOC 2012 dataset [7], CAB-Net achieved an mIoU of 83.6% as reported in Table 1, outperforming state-of-the-art competitive models. For street scene segmentation on the Cityscapes dataset [8], CAB-Net attained an mIoU of 78.4%, ranking among the top-performing models as presented in Table 2. On the CamVid dataset [9], which also focuses on street scenes, CAB-Net recorded an mIoU of 76.4% (see Table 3), showcasing consistent performance across various natural image segmentation tasks.
Table 1. Comparison of mIoU on test set of PASCAL VOC 2012 Dataset with existing approaches
Approach
mIoU (%)
FCN [28]
62.2
B-Net-VGG-LCM [29]
78.5
G-FRNet [13]
79.3
DDSC [18]
81.2
WideResNet [30]
82.5
PSPNet [23]
82.6
DANet [26]
82.6
DFN [31]
82.7
EncNet [25]
82.9
TKCN [32]
83.2
CAB-Net
83.6
Table 2. Comparison of mIoU on test set of Cityscapes Dataset with existing approaches
Approach
mIoU (%)
SegNet [4]
56.1
FSSNet [33]
58.8
FCN [28]
65.3
DeepLab-CRF [22]
70.4
RefineNet [17]
73.6
BiSeNet [34]
74.7
SwiftNetRN-18 [35]
75.5
B-Net-VGG-LCM [29]
75.9
DUC-HDC [36]
77.6
PSPNet [23]
78.4
CAB-Net
78.3
Table 3. Comparison of mIoU on test set of Cityscapes Dataset with existing approaches
Approach
mIoU (%)
SegNet [4]
60.1
DeepLab-LFOV [22]
61.6
Dilation8 [37]
65.3
Dilation+FSO-DF [38]
66.1
B-Net-VGG-LCM [29]
66.4
G-FRNet [13]
68.0
BiSeNet [34]
68.7
DDSC [18]
70.9
LDN121 162 [16]
75.8
CAB-Net
76.4
Table 4. Comparison of Sensitivity, Specificity, Accuracy, AUROC on test set of DRIVE Dataset with existing patch-based approaches
Approach
Sensitivity
Specificity
Accuracy
AUROC
Liskowski et al. [39]
0.7763
0.9768
0.9495
0.9720
Jiang et al. [40]
0.7540
0.9825
0.9624
0.9810
Feng et al. [41]
0.7811
0.9839
0.9560
0.9792
He et al. [42]
0.7761
0.9792
0.9519
N/a
Baseline (w/o RFA)
0.7807
0.9667
0.9484
0.9659
RFA-BNet
0.7932
0.9741
0.9511
0.9732
About retinal blood vessel segmentation on the DRIVE dataset [10], Table 4 shows that our RFA-BNet variant gained high quantiative performance. Specifically, it achieved a sensitivity of 79.3%, specificity of 97.41%, accuracy of 95.11%, and AUROC of 97.32%. These results compare favorably with existing methods, and our model shows robust performance without requiring additional patch-based pre-processing steps.
According to the experimental results of image semantic segmentaiton across the above-mentioned three datasets, CAB-Net demonstrates strong performance, leveraging its unique architectural features to handle different segmentation challenges effectively. This versatility highlights its potential in both natural and medical image segmentation tasks.
Table 5. Comparison of Mean Class Accuracy on test set of RAF-DB Dataset with existing approaches
Approach
Mean Class Accuracy (%)
DLP-CNN [12]
74.20
3DMFA [43]
75.73
ResiDen [44]
76.54
MRE-CNN [45]
76.73
Capsule-based Net [46]
77.48
Double Cd-LBP [47]
78.60
sCAB-Net (VGG-16)
78.81
sCAB-Net (ResNet-101)
79.33
sCAB-Net (DenseNet-161)
79.37
Furthermore, with respect to image classification topic, on the RAF-DB dataset [12], which includes 7 facial expression classes, our sCAB-Net variant achieved a mean class accuracy of 79.37% with DenseNet-161 as the backbone. This result outperforms existing approaches, including Double Cd-LBP [47] (78.6%) and Capsule-based Net [46] (77.48%). Notably, even with VGG-16 or ResNet-101 as the backbone, sCAB-Net consistently exceeds the performance of existing methods.
As for diabetic retinopathy severity classification on the Kaggle DR Detection dataset [11], sCAB-Net with DenseNet-161 backbone achieves a Quadratic Weighted Kappa (QWK) score of 85.6%, closely matching the best-performing model, Zoom-in-Net [51] (85.7%), but with significantly fewer parameters. Our model demonstrates competitive performance with 15-29% fewer parameters than Zoom-in-Net, illustrating its efficiency and effectiveness in handling large-scale medical image classification tasks.
Table 6. Comparison of Quadratic Weighted Kappa (QWK) on test set of Kaggle DR Detection Dataset
with existing approaches
Approach
No. params
QWK (%)
11-layer CNN [48]
0.93M
76.7
SI2DRNet-v1 [49]
10.6M
80.4
18-layer CNN [50]
8.9M
85.1
Zoom-in-Net [51]
55.8M
85.7
sCAB-Net (VGG-16)
15.59M
84.9
sCAB-Net (ResNet-101)
47.36M
85.4
sCAB-Net (DenseNet-161)
39.56M
85.6
V. CONCLUSIONS
In conclusion, this paper presents the Bracket-style Convolutional Neural Network (CAB-Net), a deep learning model family that leverages a round-by-round cross-attentional fusion mechanism. This approach effectively combines semantically-rich information from lower-resolution inputs with finely-patterned features from higher-resolution counterparts, maximizing the use of middle- and low-level contextual information for improved feature representation. The model offers several key contributions as follows: (i) it can be easily integrated with different backbone CNNs, supporting multi-scale feature representations; and (ii) the baseline model can be adapted into variants, such as Round-wise Feature Aggregation and Single-mode structures, to tackle diverse tasks in image semantic segmentation and classification. Through extensive experiments, CAB-Net has demonstrated state-of-the-art performance across various benchmark datasets. Notable achievements include an mIoU of 83.6% on PASCAL VOC 2012 for image segmentation and a mean class accuracy of 79.37% on RAF-DB for facial expression recognition. It is worth noting that the uniqueness of this work lies in its Bracket-shaped architecture and the Cross-Attentional Fusion mechanism, enabling the model to efficiently blend semantically-rich context with finely-patterned features, offering a powerful solution for both segmentation and classification tasks in computer vision.
REFERENCES
K. He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778.
J. Hu, L. Shen, and G. Sun. "Squeeze-and-Excitation Networks". In: 2018 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 2018, pp. 7132-7141.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation". In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. Springer International Publishing, 2015, pp. 234-241.
V. Badrinarayanan, A. Kendall, and R. Cipolla. "SegNet: A Deep Convolutional Encoder- Decoder Architecture for Image Segmentation". In: IEEE Transactions on Pattern Analysis and Machine Intelligence 39.12 (2017), pp. 2481-2495.
Cam-Hao Hua et al. "Bimodal Learning via Trilogy of Skip-connection Deep Networks for Diabetic Retinopathy Risk Progression Identification". In: International Journal of Medical Informatics (2019). ISSN: 1386-5056.
L. Chen et al. “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 6298–6306.
M. Everingham et al. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. 2012.
Gabriel J. Brostow et al. “Segmentation and Recognition Using Structure from Motion Point Clouds”. In: Computer Vision – ECCV 2008. Springer Berlin Heidelberg, 2008, pp. 44–57.
M. Cordts et al. “The Cityscapes Dataset for Semantic Urban Scene Understanding”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 3213–3223.
J. Staal et al. “Ridge-based vessel segmentation in color images of the retina”. In: IEEE Transactions on Medical Imaging 23.4 (2004), pp. 501–509. ISSN: 0278-0062.
“Kaggle: Diabetic Retinopathy Detection”. In: (https://www. kaggle.com/c/diabetic-retinopathydetection).
Shan Li andWeihong Deng. “Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition”. In: IEEE Transactions on Image Processing 28.1 (2019), pp. 356–370.
M. A. Islam et al. “Gated Feedback Refinement Network for Dense Image Labeling”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 4877–4885.
Xiangtai Li et al. “GFF: Gated Fully Fusion for Semantic Segmentation”. In: CoRR abs/1904.01803 (2019). arXiv: 1904.01803.
T. Y. Lin et al. “Feature Pyramid Networks for Object Detection”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 936–944.
Ivan Kreso, Josip Krapac, and Sinisa Segvic. “Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images”. In: CoRR abs/1905.05661 (2019). arXiv: 1905.05661.
G. Lin et al. “RefineNet: Multi-Path Refinement Networks for Dense Prediction”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), pp. 1–1.
Piotr Bilinski and Victor Prisacariu. “Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2018, pp. 6596–6605.
Wei Liu, Andrew Rabinovich, and Alexander C. Berg. “ParseNet: Looking Wider to See Better”. In: CoRR abs/1506.04579 (2015). arXiv: 1506.04579.
Hexiang Hu et al. “Recalling Holistic Information for Semantic Segmentation”. In: CoRR abs/1611.08061 (2016). arXiv: 1611.08061.
T. H. N. Le et al. “Reformulating Level Sets as Deep Recurrent Neural Network Approach to Semantic Segmentation”. In: IEEE Transactions on Image Processing 27.5 (2018), pp. 2393–2407.
L. C. Chen et al. “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 40.4 (2018), pp. 834–848.
H. Zhao et al. “Pyramid Scene Parsing Network”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 6230–6239.
Hanchao Li et al. “Pyramid Attention Network for Semantic Segmentation”. In: British Machine Vision Conference 2018, BMVC. 2018, p. 285.
H. Zhang et al. “Context Encoding for Semantic Segmentation”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, pp. 7151–7160.
Jun Fu et al. “Dual Attention Network for Scene Segmentation”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2019, pp. 3146–3154.
Z. Feng, J. Yang, and L. Yao. “Patch-based fully convolutional neural network with skip connections for retinal blood vessel segmentation”. In: 2017 IEEE International Conference on Image Processing (ICIP). 2017, pp. 1742–1746.
J. Long, E. Shelhamer, and T. Darrell. “Fully convolutional networks for semantic segmentation”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, pp. 3431–3440.
Cam-Hao Hua, Thien Huynh-The and Sungyoung Le, "Convolutional Networks with Bracket-style Decoder for Semantic Scene Segmentation", 2018 IEEE Conference on System, Man and Cybernetics (SMC), Oct 7-10, 2018.
Zifeng Wu, Chunhua Shen, and Anton van den Hengel. “Wider or Deeper: Revisiting the ResNet Model for Visual Recognition”. In: Pattern Recognition 90 (2019), pp. 119 –133. ISSN: 0031-3203.
Yu et al. “Learning a Discriminative Feature Network for Semantic Segmentation”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, pp. 1857–1866.
T. Wu et al. “Tree-Structured Kronecker Convolutional Network for Semantic Segmentation”. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). 2019, pp. 940–945.
X. Zhang et al. “Fast Semantic Segmentation for Scene Perception”. In: IEEE Transactions on Industrial Informatics 15.2 (2019), pp. 1183–1192.
Changqian Yu et al. “BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation”. In: Computer Vision – ECCV 2018. 2018, pp. 334–349. ISBN: 978-3-030-01261-8.
Marin Orsic et al. “In Defense of Pre-Trained ImageNet Architectures for Real-Time Semantic Segmentation of Road-Driving Images”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2019, pp. 12607–12616.
P. Wang et al. “Understanding Convolution for Semantic Segmentation”. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 2018, pp. 1451–1460.
Fisher Yu and Vladlen Koltun. “Multi-Scale Context Aggregation by Dilated Convolutions”. In: 4th International Conference on Learning Representations, ICLR. 2016.
Kundu, V. Vineet, and V. Koltun. “Feature Space Optimization for Semantic Video Segmentation”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 3168–3175.
P. Liskowski and K. Krawiec. “Segmenting Retinal Blood Vessels With Deep Neural Networks”. In: IEEE Transactions on Medical Imaging 35.11 (2016), pp. 2369–2380. ISSN: 0278-0062.
Zhexin Jiang et al. “Retinal blood vessel segmentation using fully convolutional network with transfer learning”. In: Computerized Medical Imaging and Graphics 68 (2018), pp. 1 –15. ISSN: 0895-6111.
Z. Feng, J. Yang, and L. Yao. “Patch-based fully convolutional neural network with skip connections for retinal blood vessel segmentation”. In: 2017 IEEE International Conference on Image Processing (ICIP). 2017, pp. 1742–1746.
Q. He et al. “Multi-Label Classification Scheme Based on Local Regression for Retinal Vessel Segmentation”. In: 2018 25th IEEE International Conference on Image Processing (ICIP). 2018, pp. 2765–2769.
F. Lin et al. “Facial Expression Recognition with Data Augmentation and Compact Feature Learning”. In: 2018 25th IEEE International Conference on Image Processing (ICIP). 2018, pp. 1957–1961.
S. Jyoti, G. Sharma, and A. Dhall. “Expression Empowered ResiDen Network for Facial Action Unit Detection”. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019). 2019, pp. 1–8.
Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. “Multi-region ensemble convolutional neural network for facial expression recognition”. In: International Conference on Artificial Neural Networks. Springer. 2018, pp. 84–94.
S. Ghosh, A. Dhall, and N. Sebe. “Automatic Group Affect Analysis in Images via Visual Attribute and Feature Networks”. In: 2018 25th IEEE International Conference on Image Processing (ICIP). 2018, pp. 1967–1971.
F. Shen, J. Liu, and P.Wu. “Double Complete D-LBP with Extreme Learning Machine Auto-Encoder and Cascade Forest for Facial Expression Analysis”. In: 2018 25th IEEE International Conference on Image Processing (ICIP). 2018, pp. 1947–1951.
M. C. A. Trivino et al. “Deep Learning on Retina Images as Screening Tool for Diagnostic Decision Support”. In: CoRR abs/1807.09232 (2018). arXiv: 1807.09232.
Y. Chen et al. “Diabetic Retinopathy Detection Based on Deep Convolutional Neural Networks”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, pp. 1030–1034.
S. M. S. Islam, Md M. Hasan, and S. Abdullah. “Deep Learning based Early Detection and Grading of Diabetic Retinopathy Using Retinal Fundus Images”. In: CoRR abs/1812.10595 (2018).
Z. Wang et al. “Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection”. In: Medical Image Computing and Computer Assisted Intervention, MICCAI 2017. 2017, pp. 267–275.