Abstract—Remote sensing image change detection is extensively utilized in various applications in the field of remote sensing, particularly in the realm of cropland conservation, where it plays a critical role in protecting the agro-ecosystem and ensuring global food security. However, the progressive improvement in resolution and size of remote sensing imagery has led to a 'scale gap' challenge in the detection of small building changes in cropland areas. To address this challenge, an innovative multi-scale feature fusion change detection network (M-Swin) based on transformer using hierarchical windows is proposed. In order to obtain clearer edges and better separation of the change results, a novel saimese transformer encoder (MSW encoder) is proposed, which can better capture the change information in small building through hierarchical windows and fuse the multi-scale feature obtained from different windows. To effectively reduce missed and mis-detected small-area of changing buildings, a novel bi-temporal image feature fusion module (BFFM) is proposed, which can enhance the features based on a priori guidance, thus improving the saliency of change regions. Additionally, a new remote sensing image change detection dataset for cropland, called LuojiaSET-CLCD, has been proposed. Experimentally demonstrates that M-Swin has good potential for highly accurate change detection of small buildings within cropland areas and outperforms several newly existing methods in three datasets (LEVIR, WHU-CD and LuojiaSET-CLCD). Our dataset will be publicly available at https://github.com/RSIIPAC/LuojiaSET-CLCD.