Show and Tell: A Neural Image Caption Generator

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that com

arxiv.org

01 Abstract

이미지의 내용을 설명하는 모델을 개발하는 연구에 대해서 설명하고 있음
문제점
- 이미지의 내용을 자동으로 설명하는 것은 중요한 문제임
- 이 문제는 컴퓨터 비전(이미지 처리)과 자연어 처리(문장 생성) 분야를 모두 포함
제안하는 모델
- 컴퓨터 비전과 기계 번역의 최신 기술을 결합한 깊은 순환 신경망(DRNN) 구조를 제안함
- 이 모델은 이미지가 주어졌을 때 자연스러운 문장을 생성하는 기능을 가지고 있음
훈련 방법
- 훈련 이미지에 대해 목표 설명 문장의 가능성을 최대화 하도록 훈련됨
- 주어진 이미지에 대해 가장 자연스럽고 정확한 문장을 생성하도록 학습됨
실험 결과
- 여러 데이터 셋(Pascal, Flickr30k, SBU, COCO)을 사용한 결과, 모델의 정확성과 생성된 문장의 유창함을 확인함
- Pascal, Flickr30k, SBU, COCO 데이터 셋에서 BLEU 점수가 크게 향상되었음
- BLEU 점수는 기계 번역의 품질을 평가하는 지표로, 점수가 높을수록 생성된 문장이 원래 문장과 유사함
성과
- Pascal 데이터 셋 → BLEU-1 점수가 25에서 59로 향상되었음 (인간의 성능은 약 69)
- 다른 데이터 셋에서도 BLEU 점수가 크게 개선되어, 제안된 모델의 우수성을 입증함

02 Introduction

문제 및 문제의 중요성
- 이미지의 내용을 자동으로 설명하는 것(Image description)은 매우 어려운 문제임
- 시각 장애인들이 이미지를 이해하는 데 도움을 줄 수 있다는 점에서 Image Captioning이 매우 중요한 역할을 함
기존 접근 방식의 한계
- 기존에는 개별적인 문제 해결 방법들을 결합하여 이미지를 설명하려고 했음
- 하지만 이는 여러 문제를 따로 해결해야 하므로 복잡하고 비효율적임
논문에서 제안하는 새로운 접근 방식
- 단일 통합 모델을 제안함
- 이 모델은 이미지를 입력으로 받아, 주어진 사전 단어에서 단어 시퀀스를 생성할 가능성을 최대화하도록 훈련됨
제안된 모델
- 제안된 모델의 이름 → Neural Image Caption (NIC)
- 제안된 모델에서는 인코더 RNN을 CNN으로 대체함
- CNN은 이미지를 고정 길이 벡터로 Embedding 한 후, 이 벡터를 RNN Decoder의 입력으로 사용하여 문장을 생성함
이미지 인식과 자연어 처리로 이미지에 대한 표현까지 학습해야 되기 때문에 어려움

03 Related Works

비디오 설명 생성
- 초기 연구들은 주로 비디오 데이터에서 자연어 설명을 생성하는 문제를 다루었음
- 이러한 시스템들은 복잡하고 손으로 설계된 부분이 많아 제한된 도메인에서만 유효했음
정적 이미지 설명 생성
- 최근에는 정적 이미지에 대한 설명 생성에 더 많은 관심이 집중되었음
- 객체 인식의 발전을 활용하여 텍스트를 생성하는 시스템들이 개발됨
- 하지만, 이러한 시스템들은 여전히 표현력이 제한적임
설명 순위 매기기
- 많은 연구들이 주어진 이미지에 대한 설명을 순위 매기는 문제를 다루었음
- 이미지와 텍스트를 동일한 벡터 공간에 공동 Embedding하여 설명을 검색하는 방식이 주로 사용되었음
- 하지만, 이러한 접근법은 새로운 설명을 생성하는 데 한계가 있음
제안된 접근법
- 이 연구에서는 이미지 분류를 위한 CNN과 시퀀스 모델링을 위한 RNN을 결합하여 단일 네트워크를 만들었음
- RNN은 이미지와 텍스트를 직접 입력 받아 설명을 생성할 수 있음
- 이러한 접근법은 기존 연구들보다 더 나은 성능을 보여줌

04 Contribution

Image description 문제를 푸는 end-to-end 시스템을 제안함
논문에서 제안하는 모델은 이미지(Vision)와 자연어(language) 모델 중에 SOTA인 모델들의 구조의 일부를 조합해서 만듬
기존 SOTA 보다 더 좋은 성능을 보여줌

05 Model

이미지로부터 자연스러운 문장을 생성하는 모델을 개발하는 것이 목표임
기계 번역의 시퀀스 생성 접근 방식을 사용하여 이미지 설명 문제에 적용함
이미지에서 올바른 설명의 확률을 최대화하는 방식으로 모델을 훈련함
RNN과 LSTM
- 순환 신경망(RNN)을 사용하여 단어 시퀀스를 처리함
- 은닉 상태 또는 메모리 ℎt를 사용하여 단어 시퀀스를 처리함
- LST은 소멸 및 폭발 그래디언트 문제를 해결할 수 있어 시퀀스 작업에 적합함
CNN
- 이미지의 고정 길이 벡터 표현을 생성하기 위해 합성곱 신경망(CNN)을 사용함
- CNN은 이미지 인식 및 분류에서 강력한 성능을 보임
- 배치 정규화를 위한 새로운 접근 방식을 사용하여 높은 성능을 보여줌
- CNN은 전이학습을 통해 다른 작업에도 일반화될 수 있음
모델은 RNN과 LSTM을 사용하여 단어 시퀀스를 처리하고, CNN을 사용하여 이미지를 고정 길이 벡터로 Embedding함 → 이렇게 하면 이미지로부터 자연스러운 문장을 생성할 수 있음
제안하는 모델 NIC(Neural Image Caption) 구조

이미지 입력
- CNN을 통한 이미지 처리 (ILSVRC 2014 분류 대회에서 최첨단 성능을 보인 CNN 모델 → GoogleNet)
  - 입력 이미지는 CNN을 통해 처리되며, 고정 길이 벡터로 변환됨
  - 이 벡터는 이미지의 주요 특징을 포함하고 있음
  - CNN은 이미지의 시각적 정보를 추출하여 이를 벡터 형식으로 표현함
- RNN에 이미지 벡터 전달
  - CNN에 생성된 이미지 벡터는 RNN의 초기 입력으로 사용됨
  - RNN은 이 벡터를 기반으로 순차적으로 단어를 생성하여 문장 만듦
입력 이미지에 대해 문장 생성
- RNN은 단어를 하나씩 생성하며, 각 단어는 이전 단어와 이미지 벡터를 기반으로 예측됨
- 이 과정은 문장의 끝을 나타내는 특별한 단어가 생성될 때까지 반복됨

06 LSTM

LSTM 동작 과정
입력 처리
- 입력 xt와 이전 출력 mt-1이 input gate(i), forget gate(f), output gate(o)로 들어감
- 각 gate는 sigmoid 함수를 사용하여 입력 값을 0과 1사이로 조절함
셀 상태 업데이트
- forget gate(f) ft는 이전 셀 상태 ct-1를 얼마나 잊을지를 결정함
- Input gate(i) it는 새로운 정보를 얼마나 받아들일지를 결정함
- 새로운 셀 상태 ct는 forget gate와 input gate를 통해 업데이트됨
출력 생성
- 출력 게이트(o) ot는 현재 셀 상태 ct를 기반으로 출력 mt를 생성함
- 출력 mt는 softmax 함수에 입력되어 다음 단어를 예측하는 데 사용함

LSTM 모델과 CNN 이미지 임베더 결합
- 이미지 입력 (Image Input)
  - 입력 이미지가 CNN에 주어짐
  - CNN은 이미지를 고정 길이 벡터로 변환함 (이 벡터는 이미지의 주요 특징을 포함)
- 단어 임베딩 (Word Embedding)
  - 각 단어는 임베딩 벡터로 변환됨
  - S0 : 문장의 시작을 나타내는 특별한 시작 단어
  - S_1, S_2, …, S_(n-1) : 문장의 실제 단어들
  - S_N : 문장의 끝을 나타내는 특별한 종료 단어
- LSTM 네트워크
  - CNN에서 생성된 이미지 벡터 x_(-1)는 LSTM의 초기 입력으로 사용됨
  - LSTM은 각 시간 단계 t에서 임베딩 x_t를 입력으로 받아 다음 단어 예측
  - 모든 LSTM은 동일한 파라미터 공유
- 단어 예측 (Word Prediction)
  - LSTM의 출력 p_(t+1)는 softmax 함수를 통해 다음 단어의 확률 분포로 변환됨
  - 각 단계에서 예측된 단어는 다음 단계의 입력으로 사용됨

07 Experiments

Evaluation Metrics
- 설명이 성공적인지 아닌지 평가하기 위해, 주관적인 점수를 부여하는 방법이 사용됨
- Amazon Mechanical Turk 실험을 통해 수행되었으며, 각 이미지는 두 명의 작업자가 평가함
- BLEU (Bilingual Evaluation Understudy)
  - 생성된 설명과 참조 설명 간의 n-gram 일치 비율을 측정하는 지표임
- METEOR
  - BLEU보다 인간 평가와 더 잘 일치하는 것으로 알려진 또 다른 기계 번역 평가 지표임
- CIDEr (Consensus-based Image Description Evaluation)
  - 여러 참조 설명과의 일치를 기반으로 하는 이미지 설명 평가 지표임
Datasets
- 이미지마다 5개의 문장 (SBU 제외)
- Pascal VOC 2008은 test로만 사용 (실험에서는 학습은 MSCOCO로 함)
- Flickr는 사진과 사진에 대한 글을 그대로 데이터로 사용
- SBU는 정확히 이미지를 설명하는 글이 아닌 사용자들이 올린 글이기 때문에 일종의 noise 역할을 하기를 기대함. (weakly labeled examples)

09 Results

논문에서 제안한 모델은 데이터 기반이고 End-to-end로 훈련되기 때문에, 데이터 셋이 풍부한 상황에서 아래와 같은 세 가지 질문에 답 하려함
- How dataset size affects generalization (데이터 셋 크기가 일반화에 어떻게 영향을 미치는지)
  - 데이터 셋의 크기가 클수록 모델이 다양한 상황에 대해 잘 일반화할 수 있는지 확인함
- What kinds of transfer learning it would be able to achieve (어떤 종류의 전이 학습을 달성할 수 있는지)
  - 모델이 다른 작업으로부터 학습한 지식을 새로운 작업에 얼마나 잘 적응할 수 있는지 평가함
- How it would deal with weakly labeled examples (약하게 라벨링 된 예시를 어떻게 처리할 것인지)
  - 라벨링이 불완전하거나 부정확한 데이터 셋에서 모델이 얼마나 잘 작동하는지 테스트
Training Details
- 주요 문제
  - 과적합 문제 : 데이터가 충분하지 않거나 고품질의 데이터가 부족한 경우 과적합(overfitting) 문제가 발생할 수 있음
  - 데이터 크기 : 고품질의 데이터 셋이 10만 개 이하인 경우가 많아 데이터가 부족한 상황이 발생함
  - 객체 분류보다 어려운 설명 생성 : 설명 생성 작업은 객체 분류보다 훨씬 어렵고, 많은 데이터가 필요함
- 해결 방법
  - 사전 훈련된 모델 사용 : ImageNet에서 사전 훈련된 CNN 모델의 가중치를 초기화하여 과적합을 줄임
  - 드롭아웃과 앙상블 : 모델의 용량을 조정하여 드롭아웃(dropout)과 앙상블 기법을 사용해 과적합을 방지하고, BLEU 점수를 몇 점 향상시킴
Generation Details
- Random, Nearest Neighbor, Human metric으로 평가함
- Human보다 다른 Matric 점수가 높은 경우가 있지만 실제 결과는 아님 → Metric에 대한 연구가 더 필요함

Transfer Learning, Data Size and Label Quality
- 전이 학습 : 한 데이터 셋에서 학습한 모델을 다른 데이터 셋에 적용
  - Flickr30k → Flickr8k : 더 많은 데이터로 BLEU 점수 4점 상승
  - MSCOCO : 더 많은 데이터 포함하지만 어휘 차이로 BLEU 점수 10점 하락
  - PASCAL : MSCOCO로부터의 전이 학습이 더 나은 성능
- SBU 데이터 셋 : 약한 라벨링과 더 많은 노이즈로 인해 MSCOCO 모델을 적용했을 때 성능 저하 (BLEU 28 → 16)
- 결론 : 더 많은 데이터와 더 나은 레벨이 성능 향상에 도움되지만, 데이터 셋 간 도메인 차이가 크면 성능 저하 발생
Generation Diversity Discussion
- 모델이 생성한 설명의 다양성을 평가함
- 생성된 설명은 다양하며, 훈련 데이터에 없는 새로운 설명도 생성됨
- BLEU 점수가 높아도 다양한 설명을 제공할 수 있음

Ranking Results
- NIC 모델은 이미지 설명 생성 및 이미지 검색 작업에서 높은 순위 성능을 보였음
- 다른 최신 모델과 비교해도 좋은 성능을 보임

Human Evaluation
- 사람이 직접 평가한 지표를 보여줌
- NIC 모델이 생성한 설명에 대한 Human 평가 결과를 분석
- BLEU의 한계: BLEU 점수는 NIC와 인간 설명 간의 차이를 잘 반영하지 못함.
- BLEU 지표가 완벽한 지표는 아님을 보여줌

Analysis of Embeddings
- 단어 임베딩(word embeddings)을 사용하여 이전 단어를 LSTM에 입력
- 학습된 임베딩이 언어의 의미적 관계를 잘 포착함
  - 예 : “말(horse)”과 “조랑말(pony)”, “당나귀(donkey)” 등이 가까운 관계로 학습됨
- 단어 간의 의미적 관계가 모델의 시각적 특징 추출에 도움을 줌
  - 예 : “말(horse)”, “유니콘(unicorn)”처럼 드문 단어도 의미적으로 관련된 임베딩을 통해 정보를 제공함

09 Conclusion

NIC 모델 – End-to-end 신경망 시스템으로 이미지를 보고 자동으로 설명을 생성함
다양한 데이터 셋에서 질적, 양적 평가 모두에서 우수한 성과를 보임
이미지를 설명으로 변환하는 과정에서 높은 정확도를 보여줌
더 큰 데이터 셋을 활용하면 성능이 더욱 향상될 것임
이미지와 텍스트 데이터를 활용한 비지도 학습을 통해 이미지 설명 접근 방식을 더욱 개선할 수 있을 것임

10 Code

train.py

from __future__ import print_function
import torch
from torchvision import datasets, models, transforms
from torch.autograd import Variable
from torch.nn.utils.rnn import pack_padded_sequence
import torch.optim as optim
import torch.nn as nn
import torch
import numpy as np
import utils
from data_loader import get_coco_data_loader
from models import CNN, RNN
from vocab import Vocabulary, load_vocab
import os
from tqdm import tqdm

def main(args):
    # hyperparameters
    batch_size = args.batch_size
    num_workers = 1

    # Image Preprocessing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])

    # load COCOs dataset
    IMAGES_PATH = 'data/train2014'
    CAPTION_FILE_PATH = 'data/annotations/captions_train2014.json'

    print("Loading vocabulary...")
    vocab = load_vocab()
    if isinstance(vocab, dict):
        print("Error: Loaded vocabulary is a dictionary, not a Vocabulary object")
        return

    print("Loading training data...")
    train_loader = get_coco_data_loader(path=IMAGES_PATH,
                                        json=CAPTION_FILE_PATH,
                                        vocab=vocab,
                                        transform=transform,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        num_workers=num_workers)

    print("Loading validation data...")
    IMAGES_PATH = 'data/val2014'
    CAPTION_FILE_PATH = 'data/annotations/captions_val2014.json'
    val_loader = get_coco_data_loader(path=IMAGES_PATH,
                                      json=CAPTION_FILE_PATH,
                                      vocab=vocab,
                                      transform=transform,
                                      batch_size=batch_size,
                                      shuffle=True,
                                      num_workers=num_workers)

    losses_val = []
    losses_train = []

    # Build the models
    ngpu = 0
    initial_step = initial_epoch = 0
    embed_size = args.embed_size
    num_hiddens = args.num_hidden
    learning_rate = 1e-3
    num_epochs = 3
    log_step = args.log_step
    save_step = 500
    checkpoint_dir = args.checkpoint_dir

    encoder = CNN(embed_size)
    decoder = RNN(embed_size, num_hiddens, len(vocab), 1, rec_unit=args.rec_unit)

    # Loss
    criterion = nn.CrossEntropyLoss()

    if args.checkpoint_file:
        print("Loading checkpoint...")
        encoder_state_dict, decoder_state_dict, optimizer, *meta = utils.load_models(args.checkpoint_file,args.sample)
        initial_step, initial_epoch, losses_train, losses_val = meta
        encoder.load_state_dict(encoder_state_dict)
        decoder.load_state_dict(decoder_state_dict)
    else:
        params = list(decoder.parameters()) + list(encoder.linear.parameters()) + list(encoder.batchnorm.parameters())
        optimizer = torch.optim.Adam(params, lr=learning_rate)

    if torch.cuda.is_available():
        encoder.cuda()
        decoder.cuda()

    if args.sample:
        return utils.sample(encoder, decoder, vocab, val_loader)

    # Train the Models
    total_step = len(train_loader)
    try:
        print("Starting training...")
        for epoch in range(initial_epoch, num_epochs):
            print(f"Epoch [{epoch + 1}/{num_epochs}]")
            with tqdm(total=total_step, desc=f"Epoch {epoch + 1}/{num_epochs}") as pbar:
                for step, (images, captions, lengths) in enumerate(train_loader, start=initial_step):
                    # Set mini-batch dataset
                    images = utils.to_var(images)
                    captions = utils.to_var(captions)
                    targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]

                    # Forward, Backward and Optimize
                    decoder.zero_grad()
                    encoder.zero_grad()

                    if ngpu > 1:
                        # run on multiple GPU
                        features = nn.parallel.data_parallel(encoder, images, range(ngpu))
                        outputs = nn.parallel.data_parallel(decoder, features, range(ngpu))
                    else:
                        # run on single GPU
                        features = encoder(images)
                        outputs = decoder(features, captions, lengths)

                    train_loss = criterion(outputs, targets)
                    losses_train.append(train_loss.item())
                    train_loss.backward()
                    optimizer.step()

                    pbar.update(1)

                    # Run validation set and predict
                    if step % log_step == 0:
                        print("\n")
                        print(f"Running validation at step {step}...")
                        encoder.batchnorm.eval()
                        # run validation set
                        batch_loss_val = []
                        for val_step, (images, captions, lengths) in enumerate(val_loader):
                            images = utils.to_var(images, requires_grad=False)
                            captions = utils.to_var(captions, requires_grad=False)

                            targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]
                            features = encoder(images)
                            outputs = decoder(features, captions, lengths)
                            val_loss = criterion(outputs, targets)
                            batch_loss_val.append(val_loss.item())

                        losses_val.append(np.mean(batch_loss_val))

                        # predict
                        sampled_ids = decoder.sample(features)
                        sampled_ids = sampled_ids.cpu().data.numpy()[0]
                        sentence = utils.convert_back_to_text(sampled_ids, vocab)
                        print('Sample:', sentence)

                        true_ids = captions.cpu().data.numpy()[0]
                        sentence = utils.convert_back_to_text(true_ids, vocab)
                        print('Target:', sentence)

                        print('Epoch: {} - Step: {} - Train Loss: {} - Eval Loss: {}'.format(epoch, step, losses_train[-1], losses_val[-1]))
                        encoder.batchnorm.train()

                    # Save the models
                    if (step + 1) % save_step == 0:
                        print("Saving model at step {}...".format(step + 1))
                        utils.save_models(encoder, decoder, optimizer, step, epoch, losses_train, losses_val, checkpoint_dir)
                        utils.dump_losses(losses_train, losses_val, os.path.join(checkpoint_dir, 'losses.pkl'))

    except KeyboardInterrupt:
        pass
    except Exception as e:
        print(f"An error occurred: {e}")

    finally:
        # Ensure step is defined before saving models
        if 'step' not in locals():
            step = initial_step  # Assign a default value if not defined
        print("Final save...")
        utils.save_models(encoder, decoder, optimizer, step, epoch, losses_train, losses_val, checkpoint_dir)
        utils.dump_losses(losses_train, losses_val, os.path.join(checkpoint_dir, 'losses.pkl'))

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint_file', type=str, default=None, help='path to saved checkpoint')
    parser.add_argument('--checkpoint_dir', type=str, default='checkpoints', help='directory to save checkpoints')
    parser.add_argument('--batch_size', type=int, default=128, help='size of batches')
    parser.add_argument('--rec_unit', type=str, default='gru', help='choose "gru", "lstm" or "elman"')
    parser.add_argument('--sample', default=False, action='store_true', help='just show result, requires --checkpoint_file')
    parser.add_argument('--log_step', type=int, default=125, help='number of steps in between calculating loss')
    parser.add_argument('--num_hidden', type=int, default=512, help='number of hidden units in the RNN')
    parser.add_argument('--embed_size', type=int, default=512, help='number of embeddings in the RNN')
    parser.add_argument('--num_epochs', type=int, default=3, help='number of epochs for training')
    args = parser.parse_args()
    main(args)

eval.py

from __future__ import print_function
import torch
from torchvision import datasets, models, transforms
from torch.autograd import Variable
from torch.nn.utils.rnn import pack_padded_sequence
import torch.optim as optim
import torch.nn as nn
import numpy as np
import utils
from data_loader import get_coco_data_loader, get_basic_loader
from models import CNN, RNN
from vocab import Vocabulary, load_vocab
import os

def main(args):
    # hyperparameters
    batch_size = args.batch_size
    num_workers = 2

    # Image Preprocessing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
        ])

    vocab = load_vocab()

    loader = get_basic_loader(dir_path=os.path.join(args.image_path),
                              transform=transform,
                              batch_size=batch_size,
                              shuffle=True,
                              num_workers=num_workers)

    # Build the models
    embed_size = args.embed_size
    num_hiddens = args.num_hidden
    checkpoint_path = 'checkpoints'

    encoder = CNN(embed_size)
    decoder = RNN(embed_size, num_hiddens, len(vocab), 1, rec_unit=args.rec_unit)

    encoder_state_dict, decoder_state_dict, optimizer, *meta = utils.load_models(args.checkpoint_file)
    encoder.load_state_dict(encoder_state_dict)
    decoder.load_state_dict(decoder_state_dict)

    if torch.cuda.is_available():
        encoder.cuda()
        decoder.cuda()

    # Train the Models
    try:
        results = []
        for step, (images, image_ids) in enumerate(loader):
            with torch.no_grad():
                images = utils.to_var(images)

                features = encoder(images)
                captions = decoder.sample(features)
                captions = captions.cpu().data.numpy()
                captions = [utils.convert_back_to_text(cap, vocab) for cap in captions]
                captions_formatted = [{'image_id': int(img_id), 'caption': cap} for img_id, cap in zip(image_ids, captions)]
                results.extend(captions_formatted)
                print('Sample:', captions_formatted)
    except KeyboardInterrupt:
        print('Ok bye!')
    finally:
        import json
        file_name = 'captions_model.json'
        with open(file_name, 'w') as f:
            json.dump(results, f)

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint_file', type=str,
            default=None, help='path to saved checkpoint')
    parser.add_argument('--batch_size', type=int,
            default=128, help='size of batches')
    parser.add_argument('--rec_unit', type=str,
            default='gru', help='choose "gru", "lstm" or "elman"')
    parser.add_argument('--image_path', type=str,
            default='data/test2014', help='path to the directory of images')
    parser.add_argument('--embed_size', type=int,
            default='512', help='number of embeddings')
    parser.add_argument('--num_hidden', type=int,
            default='512', help='number of embeddings')
    args = parser.parse_args()
    main(args)

test.py

from __future__ import print_function
import torch
from torchvision import transforms
from torch.autograd import Variable
import utils
from models import CNN, RNN
from vocab import load_vocab
import argparse
import os
import matplotlib.pyplot as plt
from PIL import Image


def load_image(image_path, transform=None):
    image = Image.open(image_path).convert('RGB')
    if transform is not None:
        image = transform(image).unsqueeze(0)
    return image


def main(args):
    # Image preprocessing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])

    # Load vocabulary
    print("Loading vocabulary...")
    vocab = load_vocab()
    if isinstance(vocab, dict):
        print("Error: Loaded vocabulary is a dictionary, not a Vocabulary object")
        return

    # Build the models
    embed_size = args.embed_size
    num_hiddens = args.num_hidden
    rec_unit = args.rec_unit

    encoder = CNN(embed_size)
    decoder = RNN(embed_size, num_hiddens, len(vocab), 1, rec_unit=rec_unit)

    # Load the trained model parameters
    print("Loading checkpoint...")
    checkpoint = utils.load_models(args.checkpoint_file)

    # Print keys to understand the structure
    print("Checkpoint keys:", checkpoint.keys())

    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])

    if torch.cuda.is_available():
        encoder.cuda()
        decoder.cuda()

    # Process images
    if os.path.isdir(args.image_path):
        image_paths = [os.path.join(args.image_path, img) for img in os.listdir(args.image_path) if
                       img.endswith(('.jpg', '.jpeg', '.png'))]
    else:
        image_paths = [args.image_path]

    for image_path in image_paths:
        # Prepare the image
        image = load_image(image_path, transform)
        image_tensor = utils.to_var(image)

        # Generate caption
        encoder.eval()
        decoder.eval()
        with torch.no_grad():
            feature = encoder(image_tensor)
            sampled_ids = decoder.sample(feature)
            if torch.is_tensor(sampled_ids):
                sampled_ids = sampled_ids.cpu().data.numpy().tolist()
            else:
                sampled_ids = [sampled_ids]

        # Ensure sampled_ids is a 2D list
        if isinstance(sampled_ids[0], int):
            sampled_ids = [sampled_ids]

        # Convert word_ids to words
        sampled_caption = []
        for word_id in sampled_ids[0]:
            word = vocab.idx2word[word_id]
            sampled_caption.append(word)
            if word == '<end>':
                break
        sentence = ' '.join(sampled_caption)

        # Print out the image and the generated caption
        print(f"Caption for {image_path}: {sentence}")

        # Save the result
        result_folder = args.result_folder
        if not os.path.exists(result_folder):
            os.makedirs(result_folder)
        result_path = os.path.join(result_folder, os.path.basename(image_path))

        # Convert image tensor to numpy array for plotting
        image_np = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
        # Reverse the normalization for imshow
        image_np = image_np * (0.229, 0.224, 0.225) + (0.485, 0.456, 0.406)
        image_np = image_np.clip(0, 1)

        plt.imshow(image_np)
        plt.title(sentence)
        plt.savefig(result_path)
        print(f"Result saved to {result_path}")


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint_file', type=str, required=True, help='path to the trained model checkpoint')
    parser.add_argument('--image_path', type=str, required=True, help='path to the image file or directory')
    parser.add_argument('--result_folder', type=str, default='result', help='folder to save the result')
    parser.add_argument('--embed_size', type=int, default=512, help='dimension of word embedding vectors')
    parser.add_argument('--num_hidden', type=int, default=512, help='number of hidden units in the RNN')
    parser.add_argument('--rec_unit', type=str, default='gru', help='RNN unit type (gru, lstm, elman)')
    parser.add_argument('--sample', default=False, action='store_true', help='just show result, requires --checkpoint_file')
    args = parser.parse_args()
    main(args)

BLEU_score.py

'''
python BLEU_score.py
--checkpoint_file C:\Users\Desktop\show-and-tell\checkpoint\model-3-3236.ckpt
--image_path C:\Users\Desktop\show-and-tell\data\test2014_25
--reference_folder C:\Users\Desktop\show-and-tell\references
'''
from __future__ import print_function
import torch
from torchvision import transforms
from torch.autograd import Variable
import utils
from models import CNN, RNN
from vocab import load_vocab
import argparse
import os
import matplotlib.pyplot as plt
from PIL import Image
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def load_image(image_path, transform=None):
    image = Image.open(image_path).convert('RGB')
    if transform is not None:
        image = transform(image).unsqueeze(0)
    return image

def calculate_bleu(reference_caption, generated_caption, weights):
    reference_caption = [nltk.word_tokenize(reference_caption)]
    generated_caption = nltk.word_tokenize(generated_caption)
    chencherry = SmoothingFunction()
    bleu_score = sentence_bleu(reference_caption, generated_caption, weights=weights, smoothing_function=chencherry.method1)
    return bleu_score

def main(args):
    # Image preprocessing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])

    # Load vocabulary
    print("Loading vocabulary...")
    vocab = load_vocab()
    if isinstance(vocab, dict):
        print("Error: Loaded vocabulary is a dictionary, not a Vocabulary object")
        return

    # Build the models
    embed_size = args.embed_size
    num_hiddens = args.num_hidden
    rec_unit = args.rec_unit

    encoder = CNN(embed_size)
    decoder = RNN(embed_size, num_hiddens, len(vocab), 1, rec_unit=rec_unit)

    # Load the trained model parameters
    print("Loading checkpoint...")
    checkpoint = utils.load_models(args.checkpoint_file)

    # Print keys to understand the structure
    print("Checkpoint keys:", checkpoint.keys())

    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])

    if torch.cuda.is_available():
        encoder.cuda()
        decoder.cuda()

    # Process images
    if os.path.isdir(args.image_path):
        image_paths = [os.path.join(args.image_path, img) for img in os.listdir(args.image_path) if
                       img.endswith(('.jpg', '.jpeg', '.png'))]
    else:
        image_paths = [args.image_path]

    total_bleu_1 = 0
    total_bleu_2 = 0
    total_bleu_3 = 0
    total_bleu_4 = 0
    valid_image_count = 0

    for image_path in image_paths:
        # Prepare the image
        image = load_image(image_path, transform)
        image_tensor = utils.to_var(image)

        # Generate caption
        encoder.eval()
        decoder.eval()
        with torch.no_grad():
            feature = encoder(image_tensor)
            sampled_ids = decoder.sample(feature)
            if torch.is_tensor(sampled_ids):
                sampled_ids = sampled_ids.cpu().data.numpy().tolist()
            else:
                sampled_ids = [sampled_ids]

        # Ensure sampled_ids is a 2D list
        if isinstance(sampled_ids[0], int):
            sampled_ids = [sampled_ids]

        # Convert word_ids to words
        sampled_caption = []
        for word_id in sampled_ids[0]:
            word = vocab.idx2word[word_id]
            sampled_caption.append(word)
            if word == '<end>':
                break
        sentence = ' '.join(sampled_caption)

        # Print out the image and the generated caption
        print(f"Caption for {image_path}: {sentence}")

        # Load reference caption
        base_name = os.path.basename(image_path)
        reference_caption_path = os.path.join(args.reference_folder, os.path.splitext(base_name)[0] + '.txt')
        if not os.path.exists(reference_caption_path):
            print(f"Reference caption not found for {image_path}, skipping BLEU calculation.")
            continue

        with open(reference_caption_path, 'r') as f:
            reference_caption = f.read().strip()

        # Calculate BLEU scores
        bleu_1 = calculate_bleu(reference_caption, sentence, weights=(1, 0, 0, 0))
        bleu_2 = calculate_bleu(reference_caption, sentence, weights=(0.5, 0.5, 0, 0))
        bleu_3 = calculate_bleu(reference_caption, sentence, weights=(0.33, 0.33, 0.33, 0))
        bleu_4 = calculate_bleu(reference_caption, sentence, weights=(0.25, 0.25, 0.25, 0.25))

        print(f"BLEU-1 score for {image_path}: {bleu_1}")
        print(f"BLEU-2 score for {image_path}: {bleu_2}")
        print(f"BLEU-3 score for {image_path}: {bleu_3}")
        print(f"BLEU-4 score for {image_path}: {bleu_4}")

        total_bleu_1 += bleu_1
        total_bleu_2 += bleu_2
        total_bleu_3 += bleu_3
        total_bleu_4 += bleu_4
        valid_image_count += 1

        # Save the result
        result_folder = args.result_folder
        if not os.path.exists(result_folder):
            os.makedirs(result_folder)
        result_path = os.path.join(result_folder, os.path.basename(image_path))

        # Convert image tensor to numpy array for plotting
        image_np = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
        # Reverse the normalization for imshow
        image_np = image_np * (0.229, 0.224, 0.225) + (0.485, 0.456, 0.406)
        image_np = image_np.clip(0, 1)

        plt.imshow(image_np)
        plt.title(f'{sentence}\nBLEU score: {bleu_4}')
        plt.savefig(result_path)
        print(f"Result saved to {result_path}")

    if valid_image_count > 0:
        avg_bleu_1 = total_bleu_1 / valid_image_count
        avg_bleu_2 = total_bleu_2 / valid_image_count
        avg_bleu_3 = total_bleu_3 / valid_image_count
        avg_bleu_4 = total_bleu_4 / valid_image_count
        print(f'Average BLEU-1 score for all images: {avg_bleu_1}')
        print(f'Average BLEU-2 score for all images: {avg_bleu_2}')
        print(f'Average BLEU-3 score for all images: {avg_bleu_3}')
        print(f'Average BLEU-4 score for all images: {avg_bleu_4}')
    else:
        print('No valid images found for BLEU calculation.')

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint_file', type=str, required=True, help='path to the trained model checkpoint')
    parser.add_argument('--image_path', type=str, required=True, help='path to the image file or directory')
    parser.add_argument('--result_folder', type=str, default='result', help='folder to save the result')
    parser.add_argument('--reference_folder', type=str, required=True, help='folder containing reference captions')
    parser.add_argument('--embed_size', type=int, default=512, help='dimension of word embedding vectors')
    parser.add_argument('--num_hidden', type=int, default=512, help='number of hidden units in the RNN')
    parser.add_argument('--rec_unit', type=str, default='gru', help='RNN unit type (gru, lstm, elman)')
    parser.add_argument('--sample', default=False, action='store_true', help='just show result, requires --checkpoint_file')
    args = parser.parse_args()
    main(args)

BLEU_validation.py

from __future__ import print_function
import torch
from torchvision import transforms
from torch.autograd import Variable
import utils
from models import CNN, RNN
from vocab import load_vocab
import argparse
import os
import matplotlib.pyplot as plt
from PIL import Image
import nltk
from nltk.translate.bleu_score import sentence_bleu

def load_image(image_path, transform=None):
    image = Image.open(image_path).convert('RGB')
    if transform is not None:
        image = transform(image).unsqueeze(0)
    return image

def calculate_bleu(reference_caption, generated_caption):
    reference_caption = [reference_caption.split()]
    generated_caption = generated_caption.split()
    bleu_score = sentence_bleu(reference_caption, generated_caption)
    return bleu_score

import os

def main(args):
    # Image preprocessing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])

    # Load vocabulary
    print("Loading vocabulary...")
    vocab = load_vocab()
    if isinstance(vocab, dict):
        print("Error: Loaded vocabulary is a dictionary, not a Vocabulary object")
        return

    # Build the models
    embed_size = args.embed_size
    num_hiddens = args.num_hidden
    rec_unit = args.rec_unit

    encoder = CNN(embed_size)
    decoder = RNN(embed_size, num_hiddens, len(vocab), 1, rec_unit=rec_unit)

    # Load the trained model parameters
    print("Loading checkpoint...")
    checkpoint = utils.load_models(args.checkpoint_file)

    # Print keys to understand the structure
    print("Checkpoint keys:", checkpoint.keys())

    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])

    if torch.cuda.is_available():
        encoder.cuda()
        decoder.cuda()

    # Process images
    if os.path.isdir(args.image_path):
        image_paths = [os.path.join(args.image_path, img) for img in os.listdir(args.image_path) if
                       img.endswith(('.jpg', '.jpeg', '.png'))]
    else:
        image_paths = [args.image_path]

    total_bleu_score = 0
    num_images = len(image_paths)

    for image_path in image_paths:
        # Prepare the image
        image = load_image(image_path, transform)
        image_tensor = utils.to_var(image)

        # Generate caption
        encoder.eval()
        decoder.eval()
        with torch.no_grad():
            feature = encoder(image_tensor)
            sampled_ids = decoder.sample(feature)
            if torch.is_tensor(sampled_ids):
                sampled_ids = sampled_ids.cpu().data.numpy().tolist()
            else:
                sampled_ids = [sampled_ids]

        # Ensure sampled_ids is a 2D list
        if isinstance(sampled_ids[0], int):
            sampled_ids = [sampled_ids]

        # Convert word_ids to words
        sampled_caption = []
        for word_id in sampled_ids[0]:
            word = vocab.idx2word[word_id]
            sampled_caption.append(word)
            if word == '<end>':
                break
        sentence = ' '.join(sampled_caption)

        # Print out the image and the generated caption
        print(f"Caption for {image_path}: {sentence}")

        # Load reference caption
        base_name = os.path.basename(image_path)
        reference_caption_path = os.path.join(args.reference_folder, os.path.splitext(base_name)[0] + '.txt')
        if not os.path.exists(reference_caption_path):
            print(f"Reference caption not found for {image_path}, skipping BLEU calculation.")
            continue

        with open(reference_caption_path, 'r') as f:
            reference_caption = f.read().strip()

        # Calculate BLEU score
        bleu_score = calculate_bleu(reference_caption, sentence)
        print(f"BLEU score for {image_path}: {bleu_score}")
        total_bleu_score += bleu_score

        # Save the result
        result_folder = args.result_folder
        if not os.path.exists(result_folder):
            os.makedirs(result_folder)
        result_path = os.path.join(result_folder, os.path.basename(image_path))

        # Convert image tensor to numpy array for plotting
        image_np = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
        # Reverse the normalization for imshow
        image_np = image_np * (0.229, 0.224, 0.225) + (0.485, 0.456, 0.406)
        image_np = image_np.clip(0, 1)

        plt.imshow(image_np)
        plt.title(f'{sentence}\nBLEU score: {bleu_score}')
        plt.savefig(result_path)
        print(f"Result saved to {result_path}")

    avg_bleu_score = total_bleu_score / num_images
    print(f'Average BLEU-4 score for all images: {avg_bleu_score}')

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint_file', type=str, required=True, help='path to the trained model checkpoint')
    parser.add_argument('--image_path', type=str, required=True, help='path to the image file or directory')
    parser.add_argument('--result_folder', type=str, default='result', help='folder to save the result')
    parser.add_argument('--reference_folder', type=str, required=True, help='folder containing reference captions')
    parser.add_argument('--embed_size', type=int, default=512, help='dimension of word embedding vectors')
    parser.add_argument('--num_hidden', type=int, default=512, help='number of hidden units in the RNN')
    parser.add_argument('--rec_unit', type=str, default='gru', help='RNN unit type (gru, lstm, elman)')
    parser.add_argument('--sample', default=False, action='store_true', help='just show result, requires --checkpoint_file')
    args = parser.parse_args()
    main(args)

Additional code

#### losses.pkl_file_structure.py ####

import torch

checkpoint_path = 'C:\\Users\\Desktop\\show-and-tell\\checkpoint\\model-3-3236.ckpt'
checkpoint = torch.load(checkpoint_path)
print(checkpoint['decoder_state_dict']['unit.weight_hh_l0'].shape)
print(checkpoint['decoder_state_dict']['linear.weight'].shape)




#### json_to_text.py ####

import json

def json_to_txt(json_file, txt_file):
    with open(json_file, 'r') as f:
        data = json.load(f)
    captions = []
    for annotation in data['annotations']:
        captions.append(annotation['caption'])
    with open(txt_file, 'w') as f:
        for caption in captions:
            f.write(caption + '\n')

if __name__ == '__main__':
    json_file = 'data/annotations/captions_train2014.json'
    txt_file = 'data/annotations/captions_train2014.txt'
    json_to_txt(json_file, txt_file)
    
    
    
    
#### check_num_hidden.py ####
import torch

checkpoint_path = 'C:\\Users\\Desktop\\show-and-tell\\checkpoint\\model-3-3236.ckpt'
checkpoint = torch.load(checkpoint_path)
print(checkpoint['decoder_state_dict']['unit.weight_hh_l0'].shape)
print(checkpoint['decoder_state_dict']['linear.weight'].shape)

11 데이터 구조(COCO2017)